Learning TensorFlow

Over the past two weeks, I’ve been teaching myself TensorFlow, Google’s open source library for deep neural network (actually, graph computation in general).

It was so easy to get started with TensorFlow that I was fooled into thinking I’d be writing an character-based recurrent-neural-network language model in a couple of days.

The TensorFlow website gives a few learning paths: Get Started, Programmer’s Guide, and Tutorials. These were written for various versions of the API and they don’t use consistent idioms or up-to-date functions. Regardless, I found these useful to go through to give me a sense of what it would be like to use TensorFlow.

After going through a few tutorials, I made the following learning plan and now feel comfortable defining and training non-distributed models in TensorFlow:

  • Create a simple single-layer network to try to learn a function from random data. This teaches how graph definition is separate from running the computation in a session, and how to feed data into placeholder input variables.
  • Output some summary data and use TensorBoard to visualize that the loss doesn’t decrease.
  • Create some synthetic data for a simple function. I used y = x[0] < x[1]. This just lets you confirm that the loss actually decreases during training. You can also visualize the weights as they change during training.
  • Replace the synthetic data with data that is loaded from file using an input queue. Input queues were the most confusing part of TensorFlow so far. Here is a minimal example of an input queue that reads records from a file and creates shuffled batches. One thing that made this more confusing than necessary was that I was using an input queue to feed a model that was being evaluated ridiculously fast. TensorBoard was telling me that the shuffle_batch queue was never getting filled up. But, this was only because my simple model was being evaluated too quickly during the optimization step. Once I increased the complexity of the model by adding a few more fully-connected layers, the optimization step took long enough for the queue to actually be helpful.

The MonitoredTrainingSession is very helpful. It initializes variables, watches for stopping criteria, saves checkpoints and summaries, restarts from checkpoint files if training gets interrupted.

My first real TensorFlow model was a char-rnn (used to model text by predicting the next character based on the previous sequence of characters). The part of the TensorFlow API that deals with recursive neural networks has changed a lot over the past year, so various examples you’ll find online present different ways of doing things.

  • TensorFlow’s own tutorial does not use tf.nn.dynamic_rnn to create the recurrent neural network based on a prototype cell. Instead, they show an example that explicitly codes the loop over timesteps and explicitly handles the recurrent state between calls to the prototype cell.
  • This blog post by Denny Britz is a good explanation of how to use dynamic_rnn to avoid having to do all of that by hand. It mentions a helpful function: sequence_loss_by_example, but that appears to have been superseded by sequence_loss.
  • This blog post by Danijar Hafner is second example showing how to use dynamic_rnn. It also shows how to flatten the outputs from the recurrent cell across timesteps so that you can easily apply the weights used for the output projection. However, this example doesn’t take advantage of the sequence_loss function and instead computes the sequence labelling cost by doing a partial summation and then averaging.

My main point is: don’t assume you’ve misunderstood something when you can’t reconcile two different examples that claim to demonstrate the same thing. It’s likely just an API change.

My own example is here. It’s not perfect either. I’m not passing state from the end of one batch to the beginning of the next batch, so this isn’t standard truncated back-propagation through time. But, the dataset I’m learning on doesn’t appear to have dependences that are longer than the length that I chose for input sequences. R2RT discusses the distinctions between a couple of different styles of back-propagation through time. The approach I ended up implementing is almost what R2RT is calling “TensorFlow style”.

Further, wasn’t thinking ahead to how I would load the trained weights for sampling when I wrote the training script. Instead, I redefined parts of the model structure in my sampling script. This is not good. A better approach is to define the graph structure in a class (like in this example). This lets you use the exact same model during evaluation/sampling as was used during training, which is important for matching the saved weights to their variables based on their keys (names).

If you’ve already been using TensorFlow for some time, I’d appreciate any feedback you have for me on my early TensorFlow code that I’ve posted on GitHub. Are there TensorFlow design patterns I’m missing, or helper functions I don’t know about? Let me know!

Learning TensorFlow

Meanwhile, back in Canada

During the 2015 Federal Election campaign, Mr. Trudeau promised to end first-past-the-post elections in Canada. We voted with the understanding that a Liberal victory would mean the end of first-past-the-post elections. The Liberals won.

Mr. Trudeau and the Liberals made this promise knowing that Canadians are not united against first-past-the-post, nor united around a particular alternative.

Screen Shot 2017-02-19 at 2.15.06 PM.png

In 2005, single-transferrable-vote (STV) was put before British Columbians in a referendum, and 43% voted for the status quo. STV did not reach 60% support and was not adopted.

In 2008, BC held another referendum in which 60% voted for the status quo; 40% for STV.

A survey conducted by the Broadbent Institute just after the 2015 election found that “44% of Canadians prefer one of the proportional voting systems while 43% prefer the status quo, the single member plurality system.” This is consistent with previous surveys and historical data.

Despite the lack of clear consensus for a concrete alternative, the Liberals promised to do the hard work of selecting an alternative, educating the people, and passing the legislation needed to change our electoral system.

Why? Because proportional representation would produce better public policy.

And, they won. Sure, only 39.5% of Canadians voted for the Liberal Party in this past election, but that gave them 54% of the seats in Parliament, and a majority government.

Then, on February 1, 2017, Mr. Trudeau published this mandate letter. It said:

A clear preference for a new electoral system, let alone a consensus, has not emerged. Furthermore, without a clear preference or a clear question, a referendum would not be in Canada’s interest. Changing the electoral system will not be in your mandate.

Mr. Trudeau did not make his promise dependent on a consensus “emerging”. This consensus did not emerge in the past 25 years. It was never going to emerge in 12 months. In his promise, he committed to doing the hard work and expending the political capital to educate Canadians and develop whatever consensus is possible. A plan to passively wait for consensus to “emerge” is no plan at all.

If lack of consensus is all it takes to stymie the Liberal agenda, I don’t understand how they are proceeding with any of their promises (60.5% of Canadian voters didn’t vote for them), or how they are selecting which promises to work on and which to walk away from.

Meanwhile, back in Canada

Washington v. Trump

Today, the 9th Circuit denied President Trump’s appeal that would have reinstated his Executive Order on immigration. This opinion addressed only a preliminary question of whether the Executive Order will remain in effect while its constitutionality is fully argued in a lower court, but it reveals the trouble that the Government will have defending it.

One of the Government’s arguments was that “the President’s decisions about immigration policy, particularly when motivated by national security concerns, are
unreviewable, even if those actions potentially contravene constitutional rights and protections.”

The court rejected that argument.

There is no precedent to support this claimed unreviewability, which runs contrary to the fundamental structure of our constitutional democracy.

The Government also argued that “if the four corners of the Executive Order offer a facially legitimate and bona fide reason for it […] the court can’t look behind that.” In this argument, they were trying to prevent the court from considering statements, tweets, and interviews by President Trump and his advisors that could reveal that the Executive Order was, in part, religiously-motivated.

The court rejected that argument.

The States argue that the Executive Order violates the Establishment and Equal Protection Clauses because it was intended to disfavor Muslims. In support of this argument, the States have offered evidence of numerous statements by the President about his intent to implement a “Muslim ban” as well as evidence they claim suggests that the Executive Order was intended to be that ban, including sections 5(b) and 5(e) of the Order. It is well established that evidence of purpose beyond the face of the challenged law may be considered in evaluating Establishment and Equal Protection Clause claims.

The Government also tried to rely on “authoritative guidance” from White House counsel that the Executive Order does not affect legal permanent residents. The Government argued that the court should understand the Executive Order based on the most recent interpretation by the White House counsel. The court was concerned about the Government’s “shifting interpretation”, and rejected that argument.

Nor has the Government established that the White House counsel’s interpretation of the Executive Order is binding on all executive branch officials responsible for enforcing the Executive Order. The White House counsel is not the President, and he is not known to be in the chain of command for any of the Executive Departments. Moreover, in light of the Government’s shifting interpretations of the Executive Order, we cannot say that the current interpretation by White House counsel, even if authoritative and binding, will persist past the immediate stage of these proceedings. On this record, therefore, we cannot conclude that the Government has shown that it is “absolutely clear that the allegedly wrongful behavior could not reasonably be expected to recur.”

The most interesting part of my past week was listening, along with a hundred thousand other people, to this case’s oral argument. It was a display of the kind of work the judiciary does every day: checking whether the case should even be before the courts, probing the limits of the arguments presented by each side, and at the core, just trying to understand the case and arguments before them so they can correctly apply the law.

There is nothing better than an adversarial dispute to crystallize the meaning of a statute, the limits of Government power, or the extent of our rights. I’ve spent as much time reading appellate opinions as any other material over the past few years. It’s not because I miss first year philosophy or want to be a lawyer; it’s because they contain tough questions that reveal how the various parts of our society fit together. And, much of it is decent writing. They are written as much for us as they are for lawyers. Good journalism answers “so what”, but nothing can sub in for the opinion itself.

Here are some of the law people I’m following on Twitter who give context to significant cases and insight based on their personal experiences with the courts (and also, some entertainment).

And here are a couple of sites that present primary sources: oral argument audio, transcripts, briefs, opinions.

I haven’t found anything close to the same for Canada. But, you can search our Supreme Court’s judgements by date, topic, party, etc. here. (Try to find the one where a farmer harvested, saved, and planted Monsanto seed across 95% of his farm and then claimed he wasn’t using it.)

Washington v. Trump

Washington, D.C.

January 16, Martin Luther King Jr. Day

With thousands of others who waited hours for free tickets, I got to see Gladys Knight and the Let Freedom Ring Choir perform at the Kennedy Center. You can watch the event here. It was a joy-filled celebration of a man, a movement, and imperfect, incomplete success.

In this temple, as in the hearts of the people, for whom he saved the union, the memory of Abraham Lincoln is enshrined forever.

After the concert, I walked to the Lincoln Memorial. I read the words above his head with a comma: “the people, for whom he saved the union, …” Without the comma, it reads, “in the hearts of the people for whom he saved the union”, connoting that he saved the union only for some people. He saved the union for all people, at least a more expansive concept of people than at the outset of the union.

20170116_204300

January 17, Visit to the Capitol and the Library of Congress

My visit to D.C. let me see the people and institutions that might check executive power. The ten-minute pro-forma session was an example of that. It was a reminder that people trust in the power of their institutions. The Senate would not have held many of its pro-forma sessions except that they prevent the President from making recess appointments.

Next was the Library of Congress.

6322480339_39454fdff0_o

I cannot live without books, but fewer will suffice…

Before I spent some time in the Reading Room, I visited Thomas Jefferson’s library. Jefferson offered to Congress his entire collection after the Library of Congress was largely destroyed in the 1814 burning of Washington. He numbered them, and arranged them by subject. It was filled with history, fiction, science, politics, religion, law, literary criticism, math… He had a Koran. Anticipating that Congress might think this collection too diverse, he wrote to them: “there is in fact no subject to which a member of Congress may not have occasion to refer.”

January 18, United States Supreme Court

20170118_042529.jpg

I chose to attend the oral argument of Lee v. Tam. It was a First Amendment case, it had a sympathetic plaintiff (Mr. Tam and the Slants), and the outcome will almost certainly determine the outcome for the “Redskins” trademark. A couple of people told me that I should arrive at 4am or even 3am to be guaranteed a spot in the audience. I arrived at 3:08 and was 8th in line, behind mostly line-holders who looked like they had been there overnight. By 4am, the line was past capacity.

At the front of the line, I was surrounded with people very familiar with the case: family of the attorney who would argue the case for Mr. Tam, an author of an NSFW amicus brief from the CATO institute,  somebody close to the legal team for Pro-Football, Inc…

The First Amendment protection of speech is an important check on the government. Expressive speech comes in many forms: journalism, literature, comedy, music. Today, much of this expression takes place in the commercial sphere. Sometimes, a speaker is emboldened to choose a particular message because they can trust in the protection of trademark law to secure exclusive use of that message as an indicator of their goods or services. The government has decided to withhold from a certain category of speech (speech that disparages) the special protections that trademark registration brings.

This case asks: is this kind of restriction a burden on speech, is that burden is viewpoint-based, and if so, is it nonetheless justified because of the purpose of the government’s trademark registration program?

The session started promptly at 10am with four minutes spent admitting attorneys from around the country to the Supreme Court Bar. Then, from Chief Justice Roberts: “Justice Sotomayor has our opinion this morning…” — I had already seen on SCOTUSBlog’s calendar that there would be an opinion announced today, so this wasn’t a complete surprise, but these don’t happen every session, and you never know what opinion will be presented — “…in case No. 14-1055, Lightfoot versus Cendant Mortgage Corporation.” I was unfamiliar with this case. She read her prepared opinion summary. Turns out that federal courts don’t have automatic jurisdiction over cases that happen to involve Fannie Mae. Who knew?

Chief Justice Roberts then introduced the first case, “We’ll hear argument first this morning in case No. 15-1293, Lee versus Tam.” You can listen to the argument here. Counsel for Mr. Tam and the Slants did not have a good day. Here are some excerpts:

A third of the audience departed after Lee v. Tam, so the room was much less full for the second case, Ziglar v. Abassi. This case relates to how hundreds of middle eastern men were detained and treated in the weeks and months after 9/11. The main question before the court was: can the men who were detained sue then Attorney General John Ashcroft (and others responsible for the detention and conditions) in his personal capacity?

Only six justices heard the case, the minimum allowed. Justices Kagan and Sotomayor had each recused themselves from the already shorthanded court. This case has been around so long that they each probably worked on it in some capacity before their appointments to the Supreme Court.

It was very well argued by Ms. Merropol (grandchild of Ethel and Julius Rosenberg). She took every opportunity to turn the argument back to points she wanted to emphasize or clarify. She was prepared for every question and handled hypotheticals with consistency. The oral argument audio is here.

This was also Mr. Gershengorn’s final argument as Acting Solicitor General for President Obama.

20170118_121827.jpg

January 19, Museum Day

NPR HQ! They were not hosting tours that day, but I still got to see their history exhibit and gift shop, and now I have a new mug.

The Smithsonian National Museum of African American History and Culture (that needs a shorter name) was out of passes, so I spent the afternoon in the Natural History Museum and then wandered the National Mall amongst the people who had already arrived for the inauguration the next day. The day-before-the-inauguration vibe was mostly one of spectacle and people watching people.

Washington, D.C.

Pilots, you don’t need to carry your radio licence

radiolicence

That you must carry your radio operator’s certificate with you on board the aircraft is a widely spread myth in Canadian aviation. It’s not true! Here are some examples of that claim in the wild:

There are a number of documents that must be on board in order to fly […] I carry a Restricted Radio Operator Certificate, restricted to aviation. (Bits of Paper)

During flight operations in Canada, the following documents must be carried aboard the aircraft […] Pilot radiotelephone operator’s certificate… (Required documents)

So an operating certificate is always needed wherever a Canadian pilot is operating a
radio on a Canadian aircraft […] They have not been inspecting Canadian pilots recently to ensure
that pilots are carrying this licence, but can do so at any time. (The AOPA/COPA Guide to Cross Border Operations, Page 31)

The following must be carried by the pilot […] A Restricted Radiotelephone Operator’s Certificate is only required if you intend to transmit on an aviation-band radio. (Are you legal?)

Also on board must be […] the radio operator’s licence of the pilot […] (From the Ground Up, 27th Revised Edition 1996, p. 100)

What do the regulations actually say?

A person may operate radio apparatus in the aeronautical service, maritime service or amateur radio service only where the person holds an appropriate radio operator certificate as set out in column I of any of items 1 and 3 to 15 of Schedule II. (Radiocommunication Regulations, s 33)

The holder of a radio authorization shall, at the request of an inspector appointed pursuant to the Act, show the radio authorization or a copy thereof to the inspector within 48 hours after the request. (Radiocommunication Regulations, s 38)

You only need to produce the radio operator certificate (or even just a copy!) within 48 hours of a request by an Industry Canada inspector. You do not need to be able to produce the document while exercising the privileges. When Canada wants you to have the document with you, it knows how to say that:

[…] no person shall act as a flight crew member or exercise the privileges of a flight crew permit, licence or rating unless (a) the person holds the appropriate permit, licence or rating […]; and (d) the person can produce the permit, licence or rating, and the certificate, when exercising those privileges. (Canadian Aviation Regulations, s 401.03)

You must be able to produce your pilot licence and medical while exercising their privileges. You don’t need to do this for the radio operator certificate.

Pilots, you don’t need to carry your radio licence

It’s a bird… it’s a plane… it… depends on your classifier’s threshold

Evaluation of an information retrieval system (a search engine, for example) generally focuses on two things:
1. How relevant are the retrieved results? (precision)
2. Did the system retrieve many of the truly relevant documents? (recall)

For those that aren’t familiar, I’ll explain what precision and recall are, and for those that are familiar, I’ll explain some of the confusion in the literature when comparing precision-recall curves.

Geese and airplanes

Suppose you have an image collection consisting of airplanes and geese.

Images of geese and airplanes
You want your system to retrieve all the airplane images and none of the geese images.
Given a set of images that your system retrieves from this collection, we can define four accuracy counts:
True positives: Airplane images that your system correctly retrieved
True negatives: Geese images that your system correctly did not retrieve
False positives: Geese images that your system incorrectly retrieved, believing them to be airplanes
False negatives: Airplane images that your system did incorrectly did not retrieve, believing them to be geese

Collection of geese and airplanes
In this example retrieval, there are three true positives and one false positive.

Using the terms I just defined, in this example retrieval, there are three true positives and one false positive. How many false negatives are there? How many true negatives are there?

There are two false negatives (the airplanes that the system failed to retrieve) and four true negatives (the geese that the system did not retrieve).

Precision and recall

Now, you’ll be able to understand more exactly what precision and recall are.

Precision is the percentage true positives in the retrieved results. That is:

where n is equal to the total number of images retrieved (tp + fp).

Recall is the percentage of the airplanes that the system retrieves. That is:

In our example above, with 3 true positives, 1 false positive, 4 true negatives, and 2 false negatives, precision = 0.75, and recall = 0.6.

75% of the retrieved results were airplanes, and 60% of the airplanes were retrieved.

Adjusting the threshold

What if we’re not happy with that performance? We could ask the system to return more examples. This would be done be relaxing our threshold of what we want our system to consider as an airplane. We could also ask our system to be more strict, and return fewer examples. In our example so far, the system retrieved four examples. That corresponds to a particular threshold (shown below by a blue line). The system retrieved the examples that appeared more airplane-like than that threshold.

This is a hypothetical ordering that our airplane retrieval system could give to the images in our collection. More airplane-like are at the top of the list. The blue line is the threshold that gave our example retrieval.

We can move that threshold up and down to get a different set of retrieved documents. At each position of the threshold, we would get a different precision and recall value. Specifically, if we retrieved only the top example, precision would be 100% and recall would be 20%. If we retrieved the top two examples, precision would still be 100%, and recall will have gone up to 40%. The following chart gives precision and recall for the above hypothetical ordering at all the possible thresholds.

Retrieval cutoff Precision Recall
Top 1 image 100% 20%
Top 2 images 100% 40%
Top 3 images 66% 40%
Top 4 images 75% 60%
Top 5 images 60% 60%
Top 6 images 66% 80%
Top 7 images 57% 80%
Top 8 images 50% 80%
Top 9 images 44% 80%
Top 10 images 50% 100%

Precision-recall curves

A good way to characterize the performance of a classifier is to look at how precision and recall change as you change the threshold. A good classifier will be good at ranking actual airplane images near the top of the list, and be able to retrieve a lot of airplane images before retrieving any geese: its precision will stay high as recall increases. A poor classifier will have to take a large hit in precision to get higher recall. Usually, a publication will present a precision-recall curve to show how this tradeoff looks for their classifier. This is a plot of precision p as a function of recall r.

The precision-recall curve for our example airplane classifier. It can achieve 40% recall without sacrificing any precision, but to get 100% recall, its precision drops to 50%.

Average precision

Rather than comparing curves, its sometimes useful to have a single number that characterizes the performance of a classifier. A common metric is the average precision. This can actually mean one of several things.

Average precision

Strictly, the average precision is precision averaged across all values of recall between 0 and 1:

That’s equal to taking the area under the curve. In practice, the integral is closely approximated by a sum over the precisions at every possible threshold value, multiplied by the change in recall:

where N is the total number of images in the collection, P(k) is the precision at a cutoff of k images, and delta r(k) is the change in recall that happened between cutoff k-1 and cutoff k.

In our example, this is (1 * 0.2) + (1 * 0.2) + (0.66 * 0) + (0.75 * 0.2) + (0.6 * 0) + (0.66 * 0.2) + (0.57 * 0) + (0.5 * 0) + (0.44 * 0) + (0.5 * 0.2) = 0.782.

Notice that the points at which the recall doesn’t change don’t contribute to this sum (in the graph, these points are on the vertical sections of the plot, where it’s dropping straight down). This makes sense, because since we’re computing the area under the curve, those sections of the curve aren’t adding any area.

Interpolated average precision

Some authors choose an alternate approximation that is called the interpolated average precision. Often, they still call it average precision. Instead of using P(k), the precision at a retrieval cutoff of k images, the interpolated average precision uses:

In other words, instead of using the precision that was actually observed at cutoff k, the interpolated average precision uses the maximum precision observed across all cutoffs with higher recall. The full equation for computing the interpolated average precision is:

Visually, here’s how the interpolated average precision compares to the approximated average precision (to show a more interesting plot, this one isn’t from the earlier example):

The approximated average precision closely hugs the actually observed curve. The interpolated average precision over estimates the precision at many points and produces a higher average precision value than the approximated average precision.

Further, there are variations on where to take the samples when computing the interpolated average precision. Some take samples at a fixed 11 points from 0 to 1: {0, 0.1, 0.2, …, 0.9, 1.0}. This is called the 11-point interpolated average precision. Others sample at every k where the recall changes.

Confusion

Some important publications use the interpolated average precision as their metric and still call it average precision. For example, the PASCAL Visual Objects Challenge has used this as their evaluation metric since 2007. I don’t think their justification is strong. They say, “the intention in interpolating the precision/recall curve in this way is to reduce the impact of the “wiggles” in the precision/recall curve”. Regardless, everyone compares against each other on this metric, so within the competition, this is not an issue. However, the rest of us need to be careful when comparing “average precision” values against other published results. Are we using the VOC’s interpolated average precision, while previous work had used the non-interpolated average precision? This would incorrectly show improvement of a new method when compared to the previous work.

Summary

Precision and recall are useful metrics for evaluating the performance of a classifier.

Precision and recall vary with the strictness of your classifier’s threshold.

There are several ways to summarize the precision-recall curve with a single number called average precision; be sure you’re using the same metric as the previous work that you’re comparing with.

It’s a bird… it’s a plane… it… depends on your classifier’s threshold