Logging and Monitoring Basics with tf.contrib.learn

When training a model, it’s often valuable to track and evaluate progress in real time. In this tutorial, you’ll learn how to use TensorFlow’s logging capabilities and the Monitor API to audit the in-progress training of a neural network classifier for categorizing irises. This tutorial builds on the code developed in tf.contrib.learn Quickstart so if you haven't yet completed that tutorial, you may want to explore it first, especially if you're looking for an intro/refresher on tf.contrib.learn basics.

Setup

For this tutorial, you'll be building upon the following code from tf.contrib.learn Quickstart:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np

# Data sets
IRIS_TRAINING = "iris_training.csv"
IRIS_TEST = "iris_test.csv"

# Load datasets.
training_set = tf.contrib.learn.datasets.base.load_csv(filename=IRIS_TRAINING,
                                                       target_dtype=np.int)
test_set = tf.contrib.learn.datasets.base.load_csv(filename=IRIS_TEST,
                                                   target_dtype=np.int)

# Specify that all features have real-value data
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)]

# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                            hidden_units=[10, 20, 10],
                                            n_classes=3,
                                            model_dir="/tmp/iris_model")

# Fit model.
classifier.fit(x=training_set.data,
               y=training_set.target,
               steps=2000)

# Evaluate accuracy.
accuracy_score = classifier.evaluate(x=test_set.data,
                                     y=test_set.target)["accuracy"]
print('Accuracy: {0:f}'.format(accuracy_score))

# Classify two new flower samples.
new_samples = np.array(
    [[6.4, 3.2, 4.5, 1.5], [5.8, 3.1, 5.0, 1.7]], dtype=float)
y = classifier.predict(new_samples)
print('Predictions: {}'.format(str(y)))

Copy the above code into a file, and download the corresponding training and test data sets to the same directory.

In the following sections, you'll progressively make updates to the above code to add logging and monitoring capabilities. Final code incorporating all updates is available for download here.

Overview

The tf.contrib.learn Quickstart tutorial walked through how to implement a neural net classifier to categorize Iris examples into one of three species.

But when the code from this tutorial is run, the output contains no logging tracking how model training is progressing—only the results of the print statements that were included:

Accuracy: 0.933333
Predictions: [1 2]

Without any logging, model training feels like a bit of a black box; you can't see what's happening as TensorFlow steps through gradient descent, get a sense of whether the model is converging appropriately, or audit to determine whether early stopping might be appropriate.

One way to address this problem would be to split model training into multiple fit calls with smaller numbers of steps in order to evaluate accuracy more progressively. However, this is not recommended practice, as it greatly slows down model training. Fortunately, tf.contrib.learn offers another solution: a Monitor APIdesigned to help you log metrics and evaluate your model while training is in progress. In the following sections, you'll learn how to enable logging in TensorFlow, set up a ValidationMonitor to do streaming evaluations, and visualize your metrics using TensorBoard.

Enabling Logging with TensorFlow

TensorFlow uses five different levels for log messages. In order of ascending severity, they are DEBUG, INFO, WARN, ERROR, and FATAL. When you configure logging at any of these levels, TensorFlow will output all log messages corresponding to that level and all levels of higher severity. For example, if you set a logging level of ERROR, you'll get log output containing ERROR and FATAL messages, and if you set a level of DEBUG, you'll get log messages from all five levels.

By default, TensorFlow is configured at a logging level of WARN, but when tracking model training, you'll want to adjust the level to INFO, which will provide additional feedback as fit operations are in progress.

Add the following line to the beginning of your code (right after your imports):

tf.logging.set_verbosity(tf.logging.INFO)

Now when you run the code, you'll see additional log output like the following:

INFO:tensorflow:Training steps [0,200)
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Step 1: loss_1:0 = 1.48073
INFO:tensorflow:training step 100, loss = 0.19847 (0.001 sec/batch).
INFO:tensorflow:Step 101: loss_1:0 = 0.192693
INFO:tensorflow:Step 200: loss_1:0 = 0.0958682
INFO:tensorflow:training step 200, loss = 0.09587 (0.003 sec/batch).

With INFO-level logging, tf.contrib.learn automatically outputs training-loss metrics to stderr after every 100 steps.

Configuring a ValidationMonitor for Streaming Evaluation

Logging training loss is helpful to get a sense whether your model is converging, but what if you want further insight into what's happening during training? tf.contrib.learn provides several high-level Monitors you can attach to your fit operations to further track metrics and/or debug lower-level TensorFlow operations during model training, including:

Monitor Description
CaptureVariable Saves a specified variable's values into a collection

: : at every n steps of training : | PrintTensor | Logs a specified tensor's values at every n steps | : : of training : | SummarySaver | Saves [Summary] | : : (../../apidocs/python/train.md#summary-operations) : : : [protocol buffers] : : : (https\://developers.google.com/protocol-buffers/) : : : for a given tensor using a [SummaryWriter] : : : (../../api_docs/python/train.md#SummaryWriter) at : : : every _n steps of training : | ValidationMonitor | Logs a specified set of evaluation metrics at every | : : n steps of training, and, if desired, implements : : : early stopping under certain conditions :

Evaluating Every N Steps

For the Iris neural network classifier, while logging training loss, you might also want to simultaneously evaluate against test data to see how well the model is generalizing. You can accomplish this by configuring a ValidationMonitor with the test data (test_set.data and test_set.target), and setting how often to evaluate with every_n_steps. The default value of every_n_steps is 100; here, set every_n_steps to 50 to evaluate after every 50 steps of model training:

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps=50)

Place this code right before the line instantiating the classifier.

ValidationMonitors rely on saved checkpoints to perform evaluation operations, so you'll want to modify instantiation of the classifier to add a RunConfig that includes save_checkpoints_secs, which specifies how many seconds should elapse between checkpoint saves during training. Because the Iris data set is quite small, and thus trains quickly, it makes sense to set save_checkpoints_secs to 1 (saving a checkpoint every second) to ensure a sufficient number of checkpoints:

classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                            hidden_units=[10, 20, 10],
                                            n_classes=3,
                                            model_dir="/tmp/iris_model",
                                            config=tf.contrib.learn.RunConfig(
                                                save_checkpoints_secs=1))

NOTE: The model_dir parameter specifies an explicit directory (/tmp/iris_model) for model data to be stored; this directory path will be easier to reference later on than an autogenerated one. Each time you run the code, any existing data in /tmp/iris_model will be loaded, and model training will continue where it left off in the last run (e.g., running the script twice in succession will execute 4000 steps during training—2000 during each fit operation). To start over model training from scratch, delete /tmp/iris_model before running the code.

Finally, to attach your validation_monitor, update the fit call to include a monitors param, which takes a list of all monitors to run during model training:

classifier.fit(x=training_set.data,
               y=training_set.target,
               steps=2000,
               monitors=[validation_monitor])

Now, when you rerun the code, you should see validation metrics in your log output, e.g.:

INFO:tensorflow:Validation (step 50): loss = 1.71139, global_step = 0, accuracy = 0.266667
...
INFO:tensorflow:Validation (step 300): loss = 0.0714158, global_step = 268, accuracy = 0.966667
...
INFO:tensorflow:Validation (step 1750): loss = 0.0574449, global_step = 1729, accuracy = 0.966667

Customizing the Evaluation Metrics

By default, if no evaluation metrics are specified, ValidationMonitor will log both loss and accuracy, but you can customize the list of metrics that will be run every 50 steps. The tf.contrib.metrics module provides a variety of additional metric functions for classification models that you can use out of the box with ValidationMonitor, including streaming_precision and streaming_recall. To specify the exact metrics you'd like to run in each evaluation pass, add a metrics param to the ValidationMonitor constructor. metrics takes a dict of key/value pairs, where each key is the name you'd like logged for the metric, and the corresponding value is the function that calculates it.

Revise the ValidationMonitor constructor as follows to add logging for precision and recall, in addition to accuracy (loss is always logged, and doesn't need to be explicity specified):

validation_metrics = {"accuracy": tf.contrib.metrics.streaming_accuracy,
                      "precision": tf.contrib.metrics.streaming_precision,
                      "recall": tf.contrib.metrics.streaming_recall}
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps=50,
    metrics=validation_metrics)

Rerun the code, and you should see precision and recall included in your log output, e.g.:

INFO:tensorflow:Validation (step 50): recall = 0.0, accuracy = 0.266667, global_step = 0, precision = 0.0, loss = 1.71139
...
INFO:tensorflow:Validation (step 150): recall = 1.0, accuracy = 0.966667, global_step = 132, precision = 1.0, loss = 0.157797
...
INFO:tensorflow:Validation (step 1600): recall = 1.0, accuracy = 0.966667, global_step = 1589, precision = 1.0, loss = 0.055873

Early Stopping with ValidationMonitor

Note that in the above log output, by step 150, the model has already achieved precision and recall rates of 1.0. This raises the question as to whether model training could benefit from early stopping.

In addition to logging eval metrics, ValidationMonitors make it easy to implement early stopping when specified conditions are met, via three params:

Param Description
early_stopping_metric Metric that triggers early stopping

: : (e.g., loss or accuracy) under conditions : : : specified in early_stopping_rounds and : : : early_stopping_metric_minimize. Default : : : is "loss". : | early_stopping_metric_minimize | True if desired model behavior is to | : : minimize the value of : : : early_stopping_metric; False if : : : desired model behavior is to maximize the : : : value of early_stopping_metric. Default : : : is True. : | early_stopping_rounds | Sets a number of steps during which if | : : the early_stopping_metric does not : : : decrease (if : : : early_stopping_metric_minimize is : : : True) or increase (if : : : early_stopping_metric_minimize is : : : False), training will be stopped. Default : : : is None, which means early stopping : : : will never occur. :

The following revision to the ValidationMonitor constructor specifies that if loss (early_stopping_metric="loss") does not decrease (early_stopping_metric_minimize=True) over a period of 200 steps (early_stopping_rounds=200), model training will stop immediately at that point, and not complete the full 2000 steps specified in fit:

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps=50,
    metrics=validation_metrics,
    early_stopping_metric="loss",
    early_stopping_metric_minimize=True,
    early_stopping_rounds=200)

Rerun the code to see if model training stops early:

...
INFO:tensorflow:Validation (step 1450): recall = 1.0, accuracy = 0.966667, global_step = 1431, precision = 1.0, loss = 0.0550445
INFO:tensorflow:Stopping. Best step: 1150 with loss = 0.0506100878119.

Indeed, here training stops at step 1450, indicating that for the past 200 steps, loss did not decrease, and that overall, step 1150 produced the smallest loss value against the test data set. This suggests that additional calibration of hyperparameters by decreasing the step count might further improve the model.

Visualizing Log Data with TensorBoard

Reading through the log produced by ValidationMonitor provides plenty of raw data on model performance during training, but it may also be helpful to see visualizations of this data to get further insight into trends—for example, how accuracy is changing over step count. You can use TensorBoard (a separate program packaged with TensorFlow) to plot graphs like this by setting the logdir command-line argument to the directory where you saved your model training data (here, /tmp/iris_model). Run the following on your command line:

$ tensorboard --logdir=/tmp/iris_model/
Starting TensorBoard 22 on port 6006
(You can navigate to http://0.0.0.0:6006)

Then load the provided URL (here, http://0.0.0.0:6006) in your browser. If you click on the accuracy field, you'll see an image like the following, which shows accuracy plotted against step count:

Accuracy over step count in TensorBoard

For more on using TensorBoard, see TensorBoard: Visualizing Learning and TensorBoard: Graph Visualization.

results matching ""

    No results matching ""