Get Started with Computer Vision by Building a Digit Recognition Model with Tensorflow

Computer vision gives eyes to the computer, allowing it to see and distinguish among various images. It is one of the hottest topics in the deep learning world. Computer vision has many applications, including self-driving cars, robotics, and face recognition.

In this guide, we'll be building an end-to-end computer vision model for recognizing hand-written digits using Tensorflow, which is an excellent library for building machine learning and deep learning models.

Getting Started

Environment

Throughout this guide, we'll be using Google Colab - a free Jupyter notebook environment that runs in the cloud. Colab has all the necessary machine learning and deep learning tools pre-installed and it provides free access to GPUs and TPUs, making it the best environment for running experiments, especially on vision data.

To create a new Colab notebook, head over to Google Colab, and click the "New Notebook" button at the bottom of the model.

Data

We're going to use the MNIST dataset in this guide, which is a collection of 70,000 hand-written digit images, which is a very good amount of data for a deep learning model.

Loading the data

There are several ways we can load the MNIST dataset, but to keep it simple, we'll use the Keras Datasets API.

Let's import our MNIST dataset into train and test sets.

from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

The load_data() method of the mnist module returns the train and test sets in form of nested tuples, which we have destructured in the above lines into X_train, X_test, y_train, and y_test.

Exploring the Data

Now that we have our data loaded, let's get familiar with it.

# Check how many examples do we have in our train and test sets
print(f"We have {len(X_train)} images in the training set and {len(X_test)} images in the test set.")

Running the above code returns the following output:

We have 60000 images in the training set and 10000 images in the test set.

Look at that! We've got 60,000 training images and 10000 test images! That's a really good amount of data for our digit recognition model.

Now let's have a look at the shape of our samples.

# Let's see the first sample of our training set
X_train[0].shape

Running the above code cell, we get:

(28, 28)

This output shows that each of the samples is a 28x28 image.

Visualizing the Images

Let's plot the first image in our dataset and see how it looks:

import matplotlib.pyplot as plt
plt.imshow(X_train[0])

First image in the dataset

Do we need the axes here? Definitely no. Let's make our plot a little nicer.

plt.figure(figsize=(3, 3))
plt.imshow(X_train[0], cmap="gray")
plt.title(y_train[0])
plt.axis(False);

output

This looks a lot better than what he had before.

Up until now, we have visualized only one image, which is not enough to get familiar with, right? Let's plot a few more randomly picked images.

import random
random_image = random.randint(0,  len(X_train))

plt.figure(figsize=(3, 3))
plt.imshow(X_train[random_image], cmap="gray")

plt.title(y_train[random_image])
plt.axis(False);

You'll see a random image each time you run this code cell. Run this at least 10 to 20 times so you get more familiar with the images.

Preprocessing Our Data

Now that we've visualized enough images, it's time to preprocess our data to be in the right shape for our model.

Let's check the shape of our images once again.

X_train.shape
# Output: (60000, 28, 28)

The output shows that we have 60,000 training images of size 28x28 each.

The Conv2D layer in a convolutional model requires the input to be in shape: [height, width, color_channels] but we only have the height and width dimensions so far. Let's reshape our train and test data to have the missing color_channels dimension as well.

X_train = X_train.reshape(X_train.shape + (1,))
X_test = X_test.reshape(X_test.shape + (1, ))

X_train.shape # (60000, 28, 28, 1)

Neural networks tend to like normalized data and perform better on it. Normalization, in simple terms, means to have the data at the same scale, that is, in our case, between 0 and 1. Let's normalize our train and test images.

X_train = X_train / 255.
X_test = X_test / 255.

We also need to change the datatype of our training and test sets to float32 rather than the default float64.

X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)

Building our Convolutional Model

To build our image recognition model, we'll follow the TinyVGG architecture, as shown in the image below:

tiny-vgg

Let's build our convolutional model using Keras Sequential API.

import tensorflow as tf
from tensorflow.keras import layers

model = tf.keras.Sequential([
    layers.Conv2D(filters=10,
                kernel_size=3, 
                activation="relu", 
                input_shape=(28,  28,  1)),
    layers.Conv2D(10,  3, activation="relu"),
    layers.MaxPool2D(),
    layers.Conv2D(10,  3, activation="relu"),
    layers.Conv2D(10,  3, activation="relu"),
    layers.MaxPool2D(),
    layers.Flatten(),
    layers.Dense(10, activation="softmax")
])

Let's discuss the layers in our model.

The Conv2D layer applies some filters (in our case, 10) to the images. The kernel size (3 in our case) determines the dimensions of the kernel (3 is the same as (3, 3)). To know more about filters and kernels, head over to the CNN Explainer website, which is an awesome resource to learn about convolution neural networks.
The MaxPool2D layer downsamples (i.e. condenses) the input image. Again, CNN Explainer is the best resource to learn about the pooling layer.
The output Dense layer requires its input to be one-dimensional, for which we've added the Flatten layer in between the second MaxPool2D layer and the output Dense layer.

Let's now check the summary of our model.

model.summary()

model summary

Do you see the pattern in the output shapes of the layers? Let's get a visual summary of our model to see what's going on. model summary

So the diagram shows that each time an image passes through the Conv2D, its width and height decrease by 2, and when it passes through the pooling layer, they get halved.

Let's move on to compiling our model.

Compiling the model

To compile our model, we'll use sparse_categorical_crossentropy as our loss function as our labels are label-encoded, and Adam as the optimizer as it works best for almost every problem, and accuracy as our evaluation metric.

model.compile(loss="sparse_categorical_crossentropy", 
            optimizer=tf.keras.optimizers.Adam(),
            metrics=["accuracy"])

Finally, it's time to train our model! Are you excited? Yes? Let's fit it on the training data over 10 epochs.

model.fit(X_train, y_train, epochs=10)

Wait until the model trains, then run the following cell to see how well our model performs on test data.

model.evaluate(X_test, y_test)

You'll see something like this:

313/313 [==============================] - 1s 3ms/step - loss: 0.0348 - accuracy: 0.9895
[0.03484882786870003, 0.9894999861717224]

Look at that! Our model has ~99% accuracy on the test data! That's huge!

As a final step, let's save our model so that we can use it later in any application.

model.save("digit-model.h5")

Make sure to download the saved model to your local machine as the files in Colab get deleted once the runtime is closed.

Congratulations on building your first computer vision model, the digit recognizer model. I'll see you in the next guide, where we'll build a web app for this model. Check out the demo here: digit-recognizer-tensorflow.herokuapp.com

Make sure to follow me at @TalhaQuddoosPK to stay tuned.

Talha Quddoos