Over the past two weeks, I’ve been teaching myself TensorFlow, Google’s open source library for deep neural network (actually, graph computation in general).
It was so easy to get started with TensorFlow that I was fooled into thinking I’d be writing an character-based recurrent-neural-network language model in a couple of days.
The TensorFlow website gives a few learning paths: Get Started, Programmer’s Guide, and Tutorials. These were written for various versions of the API and they don’t use consistent idioms or up-to-date functions. Regardless, I found these useful to go through to give me a sense of what it would be like to use TensorFlow.
After going through a few tutorials, I made the following learning plan and now feel comfortable defining and training non-distributed models in TensorFlow:
- Create a simple single-layer network to try to learn a function from random data. This teaches how graph definition is separate from running the computation in a session, and how to feed data into placeholder input variables.
- Output some summary data and use TensorBoard to visualize that the loss doesn’t decrease.
- Create some synthetic data for a simple function. I used y = x < x. This just lets you confirm that the loss actually decreases during training. You can also visualize the weights as they change during training.
- Replace the synthetic data with data that is loaded from file using an input queue. Input queues were the most confusing part of TensorFlow so far. Here is a minimal example of an input queue that reads records from a file and creates shuffled batches. One thing that made this more confusing than necessary was that I was using an input queue to feed a model that was being evaluated ridiculously fast. TensorBoard was telling me that the shuffle_batch queue was never getting filled up. But, this was only because my simple model was being evaluated too quickly during the optimization step. Once I increased the complexity of the model by adding a few more fully-connected layers, the optimization step took long enough for the queue to actually be helpful.
The MonitoredTrainingSession is very helpful. It initializes variables, watches for stopping criteria, saves checkpoints and summaries, restarts from checkpoint files if training gets interrupted.
My first real TensorFlow model was a char-rnn (used to model text by predicting the next character based on the previous sequence of characters). The part of the TensorFlow API that deals with recursive neural networks has changed a lot over the past year, so various examples you’ll find online present different ways of doing things.
- TensorFlow’s own tutorial does not use tf.nn.dynamic_rnn to create the recurrent neural network based on a prototype cell. Instead, they show an example that explicitly codes the loop over timesteps and explicitly handles the recurrent state between calls to the prototype cell.
- This blog post by Denny Britz is a good explanation of how to use dynamic_rnn to avoid having to do all of that by hand. It mentions a helpful function: sequence_loss_by_example, but that appears to have been superseded by sequence_loss.
- This blog post by Danijar Hafner is second example showing how to use dynamic_rnn. It also shows how to flatten the outputs from the recurrent cell across timesteps so that you can easily apply the weights used for the output projection. However, this example doesn’t take advantage of the sequence_loss function and instead computes the sequence labelling cost by doing a partial summation and then averaging.
My main point is: don’t assume you’ve misunderstood something when you can’t reconcile two different examples that claim to demonstrate the same thing. It’s likely just an API change.
My own example is here. It’s not perfect either. I’m not passing state from the end of one batch to the beginning of the next batch, so this isn’t standard truncated back-propagation through time. But, the dataset I’m learning on doesn’t appear to have dependences that are longer than the length that I chose for input sequences. R2RT discusses the distinctions between a couple of different styles of back-propagation through time. The approach I ended up implementing is almost what R2RT is calling “TensorFlow style”.
Further, wasn’t thinking ahead to how I would load the trained weights for sampling when I wrote the training script. Instead, I redefined parts of the model structure in my sampling script. This is not good. A better approach is to define the graph structure in a class (like in this example). This lets you use the exact same model during evaluation/sampling as was used during training, which is important for matching the saved weights to their variables based on their keys (names).
If you’ve already been using TensorFlow for some time, I’d appreciate any feedback you have for me on my early TensorFlow code that I’ve posted on GitHub. Are there TensorFlow design patterns I’m missing, or helper functions I don’t know about? Let me know!