It’s a bird… it’s a plane… it… depends on your classifier’s threshold

Evaluation of an information retrieval system (a search engine, for example) generally focuses on two things:
1. How relevant are the retrieved results? (precision)
2. Did the system retrieve many of the truly relevant documents? (recall)

For those that aren’t familiar, I’ll explain what precision and recall are, and for those that are familiar, I’ll explain some of the confusion in the literature when comparing precision-recall curves.

Geese and airplanes

Suppose you have an image collection consisting of airplanes and geese.

Images of geese and airplanes
You want your system to retrieve all the airplane images and none of the geese images.
Given a set of images that your system retrieves from this collection, we can define four accuracy counts:
True positives: Airplane images that your system correctly retrieved
True negatives: Geese images that your system correctly did not retrieve
False positives: Geese images that your system incorrectly retrieved, believing them to be airplanes
False negatives: Airplane images that your system did incorrectly did not retrieve, believing them to be geese

Collection of geese and airplanes
In this example retrieval, there are three true positives and one false positive.

Using the terms I just defined, in this example retrieval, there are three true positives and one false positive. How many false negatives are there? How many true negatives are there?

There are two false negatives (the airplanes that the system failed to retrieve) and four true negatives (the geese that the system did not retrieve).

Precision and recall

Now, you’ll be able to understand more exactly what precision and recall are.

Precision is the percentage true positives in the retrieved results. That is:

where n is equal to the total number of images retrieved (tp + fp).

Recall is the percentage of the airplanes that the system retrieves. That is:

In our example above, with 3 true positives, 1 false positive, 4 true negatives, and 2 false negatives, precision = 0.75, and recall = 0.6.

75% of the retrieved results were airplanes, and 60% of the airplanes were retrieved.

Adjusting the threshold

What if we’re not happy with that performance? We could ask the system to return more examples. This would be done be relaxing our threshold of what we want our system to consider as an airplane. We could also ask our system to be more strict, and return fewer examples. In our example so far, the system retrieved four examples. That corresponds to a particular threshold (shown below by a blue line). The system retrieved the examples that appeared more airplane-like than that threshold.

This is a hypothetical ordering that our airplane retrieval system could give to the images in our collection. More airplane-like are at the top of the list. The blue line is the threshold that gave our example retrieval.

We can move that threshold up and down to get a different set of retrieved documents. At each position of the threshold, we would get a different precision and recall value. Specifically, if we retrieved only the top example, precision would be 100% and recall would be 20%. If we retrieved the top two examples, precision would still be 100%, and recall will have gone up to 40%. The following chart gives precision and recall for the above hypothetical ordering at all the possible thresholds.

Retrieval cutoff Precision Recall
Top 1 image 100% 20%
Top 2 images 100% 40%
Top 3 images 66% 40%
Top 4 images 75% 60%
Top 5 images 60% 60%
Top 6 images 66% 80%
Top 7 images 57% 80%
Top 8 images 50% 80%
Top 9 images 44% 80%
Top 10 images 50% 100%

Precision-recall curves

A good way to characterize the performance of a classifier is to look at how precision and recall change as you change the threshold. A good classifier will be good at ranking actual airplane images near the top of the list, and be able to retrieve a lot of airplane images before retrieving any geese: its precision will stay high as recall increases. A poor classifier will have to take a large hit in precision to get higher recall. Usually, a publication will present a precision-recall curve to show how this tradeoff looks for their classifier. This is a plot of precision p as a function of recall r.

The precision-recall curve for our example airplane classifier. It can achieve 40% recall without sacrificing any precision, but to get 100% recall, its precision drops to 50%.

Average precision

Rather than comparing curves, its sometimes useful to have a single number that characterizes the performance of a classifier. A common metric is the average precision. This can actually mean one of several things.

Average precision

Strictly, the average precision is precision averaged across all values of recall between 0 and 1:

That’s equal to taking the area under the curve. In practice, the integral is closely approximated by a sum over the precisions at every possible threshold value, multiplied by the change in recall:

where N is the total number of images in the collection, P(k) is the precision at a cutoff of k images, and delta r(k) is the change in recall that happened between cutoff k-1 and cutoff k.

In our example, this is (1 * 0.2) + (1 * 0.2) + (0.66 * 0) + (0.75 * 0.2) + (0.6 * 0) + (0.66 * 0.2) + (0.57 * 0) + (0.5 * 0) + (0.44 * 0) + (0.5 * 0.2) = 0.782.

Notice that the points at which the recall doesn’t change don’t contribute to this sum (in the graph, these points are on the vertical sections of the plot, where it’s dropping straight down). This makes sense, because since we’re computing the area under the curve, those sections of the curve aren’t adding any area.

Interpolated average precision

Some authors choose an alternate approximation that is called the interpolated average precision. Often, they still call it average precision. Instead of using P(k), the precision at a retrieval cutoff of k images, the interpolated average precision uses:

In other words, instead of using the precision that was actually observed at cutoff k, the interpolated average precision uses the maximum precision observed across all cutoffs with higher recall. The full equation for computing the interpolated average precision is:

Visually, here’s how the interpolated average precision compares to the approximated average precision (to show a more interesting plot, this one isn’t from the earlier example):

The approximated average precision closely hugs the actually observed curve. The interpolated average precision over estimates the precision at many points and produces a higher average precision value than the approximated average precision.

Further, there are variations on where to take the samples when computing the interpolated average precision. Some take samples at a fixed 11 points from 0 to 1: {0, 0.1, 0.2, …, 0.9, 1.0}. This is called the 11-point interpolated average precision. Others sample at every k where the recall changes.


Some important publications use the interpolated average precision as their metric and still call it average precision. For example, the PASCAL Visual Objects Challenge has used this as their evaluation metric since 2007. I don’t think their justification is strong. They say, “the intention in interpolating the precision/recall curve in this way is to reduce the impact of the “wiggles” in the precision/recall curve”. Regardless, everyone compares against each other on this metric, so within the competition, this is not an issue. However, the rest of us need to be careful when comparing “average precision” values against other published results. Are we using the VOC’s interpolated average precision, while previous work had used the non-interpolated average precision? This would incorrectly show improvement of a new method when compared to the previous work.


Precision and recall are useful metrics for evaluating the performance of a classifier.

Precision and recall vary with the strictness of your classifier’s threshold.

There are several ways to summarize the precision-recall curve with a single number called average precision; be sure you’re using the same metric as the previous work that you’re comparing with.

It’s a bird… it’s a plane… it… depends on your classifier’s threshold

92 things

A few months ago, I wrote about things I discarded. Now, I write about things I’ve kept. I moved to Mountain View for the summer, and I only have 92 things. In hindsight, that’s still 14 things too many (the red items), but I’m happy that I got that close to living with only what I’ve needed (for my own definition of “need”). These are the things I brought, the things I bought, and the things I borrowed.

Things I brought

Luggage etc.
1. Carry-on sized suitcase
2. Hiking backpack
3. Backpack

I packed the hiking backpack inside of my carry-on, so I just had a carry-on and a backpack and was able to avoid checking in any luggage. It turns out I only use the hiking backpack for day-to-day use. I think if I had not brought the unnecessary things below, I’d have been able to pack everything into the suitcase and the smaller hiking backpack.

4. Hoodie
5-7. Jeans x 3
8. Pants
9. Shorts
10-12. t-shirts x 3
13. Long sleeve merino wool shirt
14-16. Merino wool t-shirts x 3
17-23. Underwear x 7
24-30. Sock x 7 pairs
31. New Balance shoes

I’ve used the hoodie as my pillow, so I guess it was useful to bring, although I could have made something else into a pillow otherwise. It’s way too hot to need a hoodie in Mountain View. If I’d been living in San Francisco like last summer, a hoodie would make more sense. I’ve only used one of my pairs of jeans. Although I’ve used all the t-shirts I brought, the Merino wool shirts have been pretty amazing and I’d have been able to get through the summer just with them. I wear them about 80% of the time. The great thing about them is that they evaporate sweat away so quickly that they don’t end up with even the slightest smell until about the fifth wear (so I’ve been told; I wash them after two or three wears).

Sports gear
32. Running tights
33. Swim shorts
34. Swim goggles
35. Ultimate jersey
36. Long sleeve base layer
37. Waterproof jacket
38. Shorts
39. Ultimate disc
40. Cleats

It’s too warm here to need bottoms for running. I was never really going to swim. Also, summer in Mountain View… why did I bring a waterproof jacket? My cleats had needed replacement for a few months, so I should have just bought a new pair when I arrived here.

41. iPhone
42. iPhone charger
43. Laptop
44. Laptop charger
45. Camera
46. Camera battery charger
47. Camera USB cable
48. Mini-dvi to VGA converter
49. Headphones

I forgot that Google has projector converters for laptops in every conference room. The iPhone has been borderline unnecessary, since I’d be charged roaming data rates down here, but I have used it a few times. Its battery is usually empty, and I use Google voice on my computer to make and receive calls. That’s backfired a couple of times, though (sorry!)

50-52. 3 pens
53. Notebook
54. Smaller notebook

Personal care
55. Razor
56. Contact lenses
57. Small first aid kit
58. Medicine
59. Travel towel

Documents etc.
60. A folder with Visa and other documents
61. Wallet (and cards)
62. Passport
63. Pilot licence

Again, I didn’t make the time to convert my Canadian pilot licence to a US pilot licence.

64. House keys
65. Water bottle
66. Monopoly Deal

Things I’ve acquired

Luggage etc.
67. A Google branded Patagonia backpack

This was given to us on the first day. I’m giving this away; let me know if you want it.

Sports gear
68. A 2nd Ultimate disc
69. New cleats
70. New running shoes

71. FitBit

On my first day, a person in my office called out to me as I walked by and tossed a small box my direction. Inside it was a FitBit. It’s been a fun way of tracking my activity and sleep patterns.

Personal care
72. A crappy bath towel
73. A bath towel
74. Q-Tips
75. Toothpaste
76. Floss
77. Shaving gel
78. Nail clippers
79. Contact lens solution
80. First aid/athletic tape
81. Neosporin
82. Non-stick pads
83. Gauze
84. Sunscreen
85. Aleve
86. Vitamin D

I learned that I should spend more than $3 on a bath towel. Four of these things (guess which ones) wouldn’t have been needed if I didn’t fail at riding a bike.

Documents etc.
87-88. 2 library cards

89. A second water bottle

My recruiter gave me a free water bottle part-way through the summer. I like that it’s metal, so I’ll get rid of my plastic water bottle.

Things I’ve borrowed

90. Laptop
91. A Lovecraft compilation from the library
92. A bike

Things I miss

I do miss some things, though! I miss my panniers and I miss my Playstation. I miss real Dominion cards and Zendo. I miss being by the beach and I miss the mountains. Those last two don’t count though, since I’d never be able to bring them anyway.

What would be on your list of 92 things?

92 things

Stress fractures

I had a stress fracture. It think it is healed now. This is the story of how it came and went.

What are stress fractures?

When I would tell somebody I had a stress fracture, they would usually ask, “what’s a stress fracture”, or “what happened?”

A stress fracture is a fracture, but an incomplete one (the separation doesn’t go through the entire bone) and it’s caused by repeated impact stress over time and not a single, acute impact. Mine was in my leg (the medial tibia, at the junction of the middle and distal one thirds, to be exact). That’s the most common place for a stress fracture in adults.

What happened?

This is hard to answer, because of the gradual onset of the fracture. This spring, I was playing with UBC Men’s Ultimate B team and training casually for the Vancouver Sun Run, a 10k race. The schedule didn’t seem too intense. Ultimate practices were twice a week, two hours each. Weekend tournaments (infrequent) were two days long, where we would play up to five games in a day. I occasionally went to the gym and included plyometrics in my workouts. I had only one lingering injury… stiffness from a sprained right ankle (this could have reduced my ability absorb impacts as naturally).

What did it feel like?

I first noticed a general pain along my medial right tibia in March. It felt kind of nice to push on it like a massage. The pain gradually increased, but I got used to it. During a game or practice, the pain seemed to go away, but would come back afterwards, even during rest. Eventually, it hurt to step up onto things. The pain was localized to a single point about 1cm in diameter. I could poke at exactly the point on my tibia that hurt.

The Sun Run was when I finally realized/admitted that something was wrong. I ran with a strained left MCL, probably putting more stress onto my right leg than before. MCL strained, a stiff right ankle, and a burgeoning stress fracture in my right tibia: it was too much all at once.

After that run, I had all the symptoms of a tibial stress fracture: pin-point pain, I couldn’t jump on that leg, it hurt to walk for a few steps after standing up from a chair, it hurt to walk up or down stairs, and it hurt to stand on my right leg. But, I didn’t know this was a stress fracture yet.


My first visit to the doctor was the week after the Sun Run. I waited a few days to confirm that it wasn’t getting any better on its own, then went to a sports doctor at UBC. From my description of the symptoms alone, he was pretty much ready to diagnose me with a stress fracture, but he couldn’t rule out a bone bruise. I was to do my regular activities as able, and if the pain didn’t go away, then he’d be more confident that it was a stress fracture (in which case, I shouldn’t have been doing my regular activities).

I played two more games of ultimate before returning to the clinic and saw a different doctor who ordered a bone scan to confirm the stress fracture. The bone scan was really cool. First, they injected me with technetium-99m-MDP. It’s a radioactive material that’s absorbed by bones undergoing more rapid turnover, like the healing site of a stress fracture. After a few hours of letting that go through my body, I returned for the scan. I lay down while the scanner imaged my leg. It was a slow scan — several minutes at a time for each of the views they wanted to get — and I could see the results of the scan as they filled in on the screen above me. It basically looked something like this:

Bone scan showing a stress fracture of the tibia
The bright white spot is where the radioactive material was being collected more heavily by the damaged and healing bone.

Treatment and progress

The bone scan confirmed my stress fracture. I’d probably had it since about mid-March, played ultimate and ran the Sun Run on it. But now, I had to give it rest. That is the only way a stress fracture will heal. No running, no ultimate. My instructions were to rest until I’d experienced ten consecutive days without pain from the normal activities of life. I was allowed to walk around as normal, and do non-impact activities like biking, but I avoided even jogging to catch up to a bus I was about to miss.

After about two days, it no longer hurt to bike. After five days, it didn’t hurt to walk or stand up anymore. After two weeks, it didn’t hurt to walk up or down stairs, and I’d started to forget occasionally that I was injured (not always a good thing). After three weeks of rest, I went to an ultimate clinic and was able to jog around lightly to participate in some of the drills. After the fourth week, I tried again to do some light drills, but this time, the fracture site hurt again for a few days… too much, too soon. I backed off and took another three weeks off from any jogging. I did a bike trip up the Sunshine Coast during this time, but biking had felt fine for weeks. After those three weeks (6th week after diagnosis), it really felt a lot better. I started practicing ultimate twice a week again, although at a lighter pace at the start. After week 8, I played in a weekend ultimate tournament. It’s now 15 weeks after the diagnosis, and it feels really good. I’ve sometimes tried to ramp up my activities too quickly, and that causes discomfort and sometimes pain at the old injury site. When that happens, I just take time off until it feels better (it’s needed a week at most) and then try again. It’s always been fine the second time.


The main advice seems to be to give your bone time to adapt and rebuild in response to increased activity. If you add too much new stress at once, the bone will simply weaken. However, if you increase activity gradually, the bone will have time to heal, adapt, and strengthen in response.

An adequate intake of vitamin D and calcium is also important for maintaining bone density. Taking extra vitamin D supplements seem to decrease the likelihood of a stress fracture.

Actual doctors

If you want to hear what actual doctors have to say, here are some references that I found helpful and interesting:

What’s your story?

Was your recovery similar? Any advice that I missed?

Stress fractures

Fewer things

Two years ago, I started pruning my possessions while processing the clutter around my apartment.

Last year, I deleted my Facebook account.

In December, I saw Tron and Sam Flynn’s shipping container home. Was this possible? The internet told me: yes. I found sites dedicated to shipping container homes, simple living, and minimalism. That’s when I really started getting rid of things:

  • A sombrero chip and dip platter I’d never used.
  • A S’Mores maker I’d never used.
  • Shampoo. I use soap instead.
  • A wine rack. I never have more than one bottle at a time.
  • Three old backpacks.
  • A pair of rollerblades.
  • Many books. I donated and recycled these.
  • CD cases. I liked the art and design of some of the booklets, so I kept those.
  • A futon.
  • Clothing. A lot of this, I hadn’t worn in over a year.
  • Chest of drawers. (I’d gotten rid of so much clothing, I needed three fewer drawers.)
  • Two bookshelves and a TV stand. I replaced these with an Expedit. I’d gotten rid of so many books that this was a better fit. The Expedit was perfect: book storage, places for my consoles, TV, board games, and even a drawer unit in one of the squares. It’s the only visible storage in my apartment.
  • Two bins worth of “stuff”: baseball cards, old certificates, posters, trophies (junk, basically).
  • Two storage bins.
  • Two shelving units.
  • My 1 bedroom apartment. I moved into a bachelor apartment, because after getting rid of all this stuff, I needed much less space.

Very few of these were thrown in the garbage. I was able to donate, sell, or give away much of it. I’m still far from living out of a shipping container, but much closer than before. I’m going to start trying to get rid of one more thing each day, because there’s still a lot of excess. There’s only a small marginal cost to each item, but together, they make for a more costly and distracting experience that takes more time, attention, and space to maintain.

Fewer things

Donate your books

I’m filing this under “Productivity” because minimizing clutter helps you devote your time, attention, and space to things that are important to you.

Old books are common clutter items. But, where can you get rid of them in Vancouver?

Selling books is usually not worth your time, especially for very out-of-date textbooks. This only works out if you sell a textbook immediately after you are finished with it and if it’s required reading for the next term.

In Vancouver, you can recycle both hard and soft-cover books (from Vancouver recycling’s FAQ):

For soft cover books, they can go into the “mixed paper” bag or “paper products” cart for recycling. For the yellow bag, you have to be careful not to exceed the 20 kilogram (44 pound) limit as the crews lift these by hand, so you may have to do it over a number of weeks. You can also drop off the soft cover books in the mixed paper bin at our recycling depot for free.
For hardcover books, remove the paper from the hardcover and binding/glue on the spine by cutting or tearing. The paper can be recycled as mixed paper (listed above) and the covers and binding/glue will go into the garbage.

You can also donate your books. One charity that takes any type of book, including textbooks, is Reading Tree. They have collection bins all around Vancouver, and even will arrange to pick them up from your place. Find a bin near you using their bin locator.

Update: Reading Tree has ceased operations and transfered management of their collection boxes to Discover Books. Find collection bins near you at

Donate your books


A white and dark glass gem, used as marking stones for the game Zendo.
Marking stones for the game Zendo.

I almost have Zendo! I ordered some pyramids that should arrive tomorrow, and just bought some glass marking stones.
This is a fun logic game. One person is chosen as the master. The master creates a secret rule or test. You win the game by being the first player to correctly guess the master’s rule. The rule describes valid arrangements of the pyramids. An example of a rule is: “the arrangement must have at least one green pyramid”.
To begin the game, the master creates two example arrangements, one that follows the rule, and one that does not follow the rule. The master places a light marking stone beside the arrangement that follows the rule, and a dark marking stone beside the arrangement that does not follow the rule.
Players then, in turn, create arrangements and ask the master to mark them. The master marks arrangements that follow the rule with a light stone and marks those that do not follow the rule with a dark stone. By doing this, players gain information about the rule.
There are special guessing stones that you need to obtain if you want to make a guess at the rule, and there’s a special way to acquire these. It’s all explained in more detail at the Wikipedia article.
When you finally are ready to make a guess at what the rule is, you spend one guessing stone and tell the master your guess. If you are correct, you win the game. If your guess is incorrect, the master constructs a counter-example, and marks it appropriately. The master either builds an example that your rule would have marked light, but is in fact dark, or builds an example that your rule would have marked dark, but is in fact light.
Zendo’s a good game if you’re looking for something abstract. There’s no board, so it’s very portable. All you need is some space on a table to set up the examples. Depending on the difficulty of the rule, how good the players are at guessing, and how much information comes from the master’s examples, each game should take between 5 and 20 minutes. There’s as much fun in being the master as there is in being a player. The pyramids cost $24 from Looney Labs during their closeout sale, and the stones about $10 from a craft store.