The keras
package in R can be used to classify images
using convolutional neural networks (CNNs). These are machine learning
models having successive layers of representation of image or other
spatial data. The “deep” in deep learning refers to this layering
(deeper = more layers).
It is a good idea to refresh your memory on how CNNs work. I can recommend a couple of youtube videos for this: 3Blue1Brown and Brandon Rohrer.
Instructions below use the MNIST data set, which is a database of images of handwritten integer single digits 0-9. I include an example of a CNN to identify the digit from an image.
The R keras
package is an interface to the Python
packages Keras
and TensorFlow
. To run, you can
either install all the packages on your own computer, or use Google
Colabs for free in your browser.
You’ll need a Google account (e.g., the one you use for GMail or Google Calendar). Open a browser and go to https://colab.google. Click “New Notebook”, which starts a new Jupyter notebook.
Free notebooks can run for at most 12 hours, but this depends on availability and usage. Access to GPUs is more limited than CPUs. See the FAQ for more information.
If you use Colabs, then you need to reinstall all the packages you need every time you reconnect – they are not saved for longer than each session.
Expose the header by clicking the down-arrow button at the far top right corner of the page (“Toggle header visibility”).
You can name your notebook at the top left and save it in your Google Drive.
From the menu, select “Runtime” and then “Change Runtime Type”. Select “R” under Runtime Type. Select “CPU” for now and save. The GPUs will be faster but there might be more competition for access. You can go back and change the Runtime at any time, but then you’re starting again.
Click “+Code” to create a new command cell. To enter an R command, type the command in the cell and click the button the left to run. E.g.,
print("R is good")
## [1] "R is good"
keras
package (and packages it relies
upon). Also install the imager
package, so you can see the
images being processed. Unfortunately, this step is very slow, maybe 10
minutes. Be careful not to abandon the notebook for too long while you
wait or it will time out and you will need to begin again.install.packages("keras")
install.packages("imager")
You are set.
This installation is required only once.
The following commands worked for me on a Mac M2 running OS 14.1.2. Do they work for you?
install.packages("keras")
library(keras)
library(reticulate)
install_keras()
install.packages("imager")
The MNIST data are included with the keras
package. It
consists of a large number of grayscale images of handwritten digits 0
to 9.
library(keras)
library(imager)
Read the data into an R object. Then pull out the training and testing data. The training data will be used to fit the CNN model, and the test data set will be used to evaluate the accuracy of the model.
The objects x_train
and x_test
are arrays
containing the images, whereas y_train
and
y_test
are vectors that indicate the known (correct) digit
corresponding to each image.
# Load the data (will take a minute)
mnist <- dataset_mnist()
# Extract image arrays
x_train <- mnist$train$x
x_test <- mnist$test$x
# Extract classifications
y_train <- mnist$train$y
y_test <- mnist$test$y
x_train
and x_test
are 3-dimensional
arrays. Each image is represented as a 28 x 28 matrix of numbers and is
stored in the second and third dimensions of the array with
corresponding index. A matrix is just a 2D array.
To see the dimensions of the array, enter
dim(x_train)
## [1] 60000 28 28
dim(x_test)
## [1] 10000 28 28
The output means that x_train
contains 60,000 images
each 28 x 28 pixels. x_test
contains 10,000 images, also 28
x 28 pixels. The first dimension of the arrays index the images.
A grayscale image is typically represented by pixels containing numbers between 0 through 255. A black pixel is 0, whereas a white pixel has the value 255. Numbers in between are gray.
To see how the first image in the array is represented, type the following. If your output window was wide enough to show all 28 columns, you would see that the non-zero numbers seem to trace a handwritten digit “5”.
x_train[1, , ]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0 0 0 0 0 3
## [7,] 0 0 0 0 0 0 0 0 30 36 94 154 170
## [8,] 0 0 0 0 0 0 0 49 238 253 253 253 253
## [9,] 0 0 0 0 0 0 0 18 219 253 253 253 253
## [10,] 0 0 0 0 0 0 0 0 80 156 107 253 253
## [11,] 0 0 0 0 0 0 0 0 0 14 1 154 253
## [12,] 0 0 0 0 0 0 0 0 0 0 0 139 253
## [13,] 0 0 0 0 0 0 0 0 0 0 0 11 190
## [14,] 0 0 0 0 0 0 0 0 0 0 0 0 35
## [15,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [16,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [17,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [18,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [19,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [20,] 0 0 0 0 0 0 0 0 0 0 0 0 39
## [21,] 0 0 0 0 0 0 0 0 0 0 24 114 221
## [22,] 0 0 0 0 0 0 0 0 23 66 213 253 253
## [23,] 0 0 0 0 0 0 18 171 219 253 253 253 253
## [24,] 0 0 0 0 55 172 226 253 253 253 253 244 133
## [25,] 0 0 0 0 136 253 253 253 212 135 132 16 0
## [26,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [27,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [28,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0 0 0
## [6,] 18 18 18 126 136 175 26 166 255 247 127 0
## [7,] 253 253 253 253 253 225 172 253 242 195 64 0
## [8,] 253 253 253 253 251 93 82 82 56 39 0 0
## [9,] 253 198 182 247 241 0 0 0 0 0 0 0
## [10,] 205 11 0 43 154 0 0 0 0 0 0 0
## [11,] 90 0 0 0 0 0 0 0 0 0 0 0
## [12,] 190 2 0 0 0 0 0 0 0 0 0 0
## [13,] 253 70 0 0 0 0 0 0 0 0 0 0
## [14,] 241 225 160 108 1 0 0 0 0 0 0 0
## [15,] 81 240 253 253 119 25 0 0 0 0 0 0
## [16,] 0 45 186 253 253 150 27 0 0 0 0 0
## [17,] 0 0 16 93 252 253 187 0 0 0 0 0
## [18,] 0 0 0 0 249 253 249 64 0 0 0 0
## [19,] 0 46 130 183 253 253 207 2 0 0 0 0
## [20,] 148 229 253 253 253 250 182 0 0 0 0 0
## [21,] 253 253 253 253 201 78 0 0 0 0 0 0
## [22,] 253 253 198 81 2 0 0 0 0 0 0 0
## [23,] 195 80 9 0 0 0 0 0 0 0 0 0
## [24,] 11 0 0 0 0 0 0 0 0 0 0 0
## [25,] 0 0 0 0 0 0 0 0 0 0 0 0
## [26,] 0 0 0 0 0 0 0 0 0 0 0 0
## [27,] 0 0 0 0 0 0 0 0 0 0 0 0
## [28,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,26] [,27] [,28]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
## [5,] 0 0 0
## [6,] 0 0 0
## [7,] 0 0 0
## [8,] 0 0 0
## [9,] 0 0 0
## [10,] 0 0 0
## [11,] 0 0 0
## [12,] 0 0 0
## [13,] 0 0 0
## [14,] 0 0 0
## [15,] 0 0 0
## [16,] 0 0 0
## [17,] 0 0 0
## [18,] 0 0 0
## [19,] 0 0 0
## [20,] 0 0 0
## [21,] 0 0 0
## [22,] 0 0 0
## [23,] 0 0 0
## [24,] 0 0 0
## [25,] 0 0 0
## [26,] 0 0 0
## [27,] 0 0 0
## [28,] 0 0 0
The vectors y_train
and y_test
contain the
known digit corresponding to each image in the training and test data
sets. For example, here are the correct digits correponding to each of
the first few images in the training data set.
head(y_train)
## [1] 5 0 4 1 9 2
To view an image with imager
, enter the following. The
image is a “5” but it is transposed, a behavior of the
imager plot()
function.
plot(as.cimg(x_train[1, , ]), axes = FALSE)
The matrix must therefore be transposed with t()
to plot
correctly.
x1 <- t(x_train[1, , ])
plot(as.cimg(x1), axes = FALSE)
We can plot several images at once by binding the matrices.
x5 <- rbind(t(x_train[1,,]), t(x_train[2,,]),
t(x_train[3,,]), t(x_train[4,,]), t(x_train[5,,]))
plot(as.cimg(x5), axes = FALSE)
Building a CNN using R’s keras
package requires that
images be represented with three dimensions, whereas our arrays have the
images coded as 28 x 28 pixel 2D matrices. We need to add a third
dimension, which will be a 1 for grayscale images.
x_train_reshape <- array_reshape(x_train, c(dim(x_train), 1))
dim(x_train_reshape)
## [1] 60000 28 28 1
x_test_reshape <- array_reshape(x_test, c(dim(x_test), 1))
dim(x_test_reshape)
## [1] 10000 28 28 1
The reason is that color images are represented as 3D arrays. An 8-bit RGB color image has 3 channels (R, G, and B), and so an RGB color image would be represented by three matrices of numbers, one matrix for each channel. For example, an RGB color image of 28 x 28 pixels would be stored in an array with dimensions c(28, 28, 3). Even though a grayscale image has just one channel, for consistency the CNN expects it to be represented with three dimensions, with the last dimension a 1 (rather than a 3 as for an RGB color image).
For the CNN, the grayscale numbers, which range from 0 (black) to 255 (white) need to be rescaled to range from 0 to 1.
x_train_rescale <- x_train_reshape / 255
range(x_train_rescale)
## [1] 0 1
x_test_rescale <- x_test_reshape / 255
range(x_test_rescale)
## [1] 0 1
For the CNN, the classification vectors containing the true digits corresponding to each image must be converted to binary class matrices. The number of response classes is 10, i.e., all the integers from 0 to 9. A value of “0” is represented by the row vector of indicators c(1,0,0,0,0,0,0,0,0,0), i.e., a 1 to indicate the first class and 0 otherwise. The value “1” is indicated by c(0,1,0,0,0,0,0,0,0,0), the number “2” is c(0,0,1,0,0,0,0,0,0,0), and so on.
This is accomplished in R as follows.
# Number of categories
num_classes <- length(unique(y_train))
num_classes
## [1] 10
# Convert
y_train_matrix <- to_categorical(y_train, num_classes)
head(y_train)
## [1] 5 0 4 1 9 2
head(y_train_matrix)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 1 0 0 0 0
## [2,] 1 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 1 0 0 0 0 0
## [4,] 0 1 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 1
## [6,] 0 0 1 0 0 0 0 0 0 0
y_test_matrix <- to_categorical(y_test, num_classes)
head(y_test)
## [1] 7 2 1 0 4 1
head(y_test_matrix)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 1 0 0
## [2,] 0 0 1 0 0 0 0 0 0 0
## [3,] 0 1 0 0 0 0 0 0 0 0
## [4,] 1 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 1 0 0 0 0 0
## [6,] 0 1 0 0 0 0 0 0 0 0
y_train_matrix
has 60,000 rows, one for each of the
60,000 images in the training data set. Similarly,
y_test_matrix
has 10,000 rows, one for each image in the
test data set.
We can think of the images of handwritten numbers as the predictors
(x
variables) and the known digit categories corresponding
to each image (0-9) as the response (y
) variables in a
model. We will build a CNN model that predicts the digit from the images
with low error rates.
At this point the number of options is enormous. To get started I
looked at other analyses of this same data set online. I also asked
ChatGPT for help constructing a convolutional neural net in the R
keras
package, describing the nature of the data. The
following model was not overly complicated and seemed to work reasonably
well.
The first line of the code initializes a sequential model
(my_model
). Sequential models in keras
are
linear stacks of layers. We create a model by successively adding layers
to this initial model object, as in the steps below. Each function call
takes my_model
as the first argument and returns the
modified model, which is then reassigned to my_model
.
# Initialize
my_model <- keras_model_sequential()
# Add layers
my_model <- layer_conv_2d(my_model, filters = 32, kernel_size = c(3, 3),
activation = 'relu', input_shape = c(28, 28, 1))
my_model <- layer_max_pooling_2d(my_model, pool_size = c(2, 2))
my_model <- layer_conv_2d(my_model, filters = 32, kernel_size = c(3, 3),
activation = 'relu')
my_model <- layer_max_pooling_2d(my_model, pool_size = c(2, 2))
my_model <- layer_flatten(my_model)
my_model <- layer_dense(my_model, units = 32, activation = 'relu')
my_model <- layer_dense(my_model, units = num_classes, activation = 'softmax')
If you’ve watch the videos you’ll have an idea as to what these layers mean.
The first layer is a convolutional layer. The parameters specify that this layer will have 32 filters, each of size of 3x3 pixels. The activation function used is ‘relu’ (Rectified Linear Unit). The input_shape indicates the dimensions of the input images.
The second layer adds a max pooling layer to the model, which reduces the spatial dimensions (height and width) of the input volume for the next convolutional layer. The pool_size of c(2, 2) means that the max pooling operation is applied over 2x2 regions, reducing the size of the feature maps by a factor of 2.
The third layer is another 2D convolutional layer with the same configuration as the first one: 32 filters of size 3x3 with ‘relu’ activation.
The fourth layer is another max pooling layer similar to the previous max pooling layer. It further reduces the spatial dimensions of the feature maps, reducing the number of parameters needed in subsequent layers.
The fifth layer is a flattening layer. After extracting features through convolutional and pooling layers, this layer flattens the 3D feature maps into 1D, allowing them to be fed into the dense layers that follow.
The sixth layer is the first dense (fully connected) layer. It receives the flattened input and applies a ‘relu’ activation function across 32 units (or neurons). This layer enables the network to learn non-linear combinations of the high-level features extracted by the convolutional layers.
The seventh layer outputs the probability distribution over the target classes, with the number of units equal to the number of classes (num_classes). The ‘softmax’ activation function is used to convert the outputs to probabilities, ensuring that they sum up to 1. This layer essentially classifies the input image into one of the num_classes categories based on the features learned by the network.
Each step builds upon the previous to progressively construct a model that can classify images into one of several categories based on their visual content.
Here is a compact summary of the model. The number of parameters being fitted seems ridiculously high, on the same order as the number of data points. Whether this is a cause for concern depends mainly on whether the model ends up overfitting the data, which is evaluated by its accuracy when applied to the test data after training.
summary(my_model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## conv2d (Conv2D) (None, 26, 26, 32) 320
## max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
## conv2d_1 (Conv2D) (None, 11, 11, 32) 9248
## max_pooling2d_1 (MaxPooling2D) (None, 5, 5, 32) 0
## flatten (Flatten) (None, 800) 0
## dense (Dense) (None, 32) 25632
## dense_1 (Dense) (None, 10) 330
## ================================================================================
## Total params: 35530 (138.79 KB)
## Trainable params: 35530 (138.79 KB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________
The model needs to be configured for training. Here we specify the
loss function (loss_categorical_crossentropy
, suitable for
a response variable with multiple classes), the optimizer to be used (RMSprop)
and the metric used to evaluate the model during training and testing
(accuracy of classification).
my_model <- compile(my_model,
loss = loss_categorical_crossentropy,
optimizer = 'rmsprop',
metrics = c('accuracy'))
Finally, fit the model to the data. epochs
sets to the
number of epochs to train the model. An epoch refers to one complete
pass of the entire training dataset through the CNN. More epochs result
in better fitting of the training data set, but too many can lead to
overfitting. The samples are passed through the model in batches of size
batch_size
.
validation_split
, if included, sets aside the indicated
proportion of samples and uses it to evaluate the metrics of fit at each
epoch. This is a useful way to track whether the model is overfitting
the data. Overfitting will be detected as a decline in the accuracy of
model predictions applied to the validation set even as the accuracy of
the model applied to the training data improves with increasing numbers
of epochs.
Your results will differ a little from those shown here because there are multiple sources of randomness in the training process (initialization of weights, the gradient descent process, the partitioning of training and validation sets, etc).
model_fit <- fit(my_model,
x_train_rescale, y_train_matrix,
batch_size = 128,
epochs = 8,
validation_split = 0.2)
## Epoch 1/8
## 375/375 - 9s - loss: 0.3296 - accuracy: 0.9005 - val_loss: 0.1327 - val_accuracy: 0.9592 - 9s/epoch - 25ms/step
## Epoch 2/8
## 375/375 - 6s - loss: 0.1120 - accuracy: 0.9661 - val_loss: 0.0877 - val_accuracy: 0.9736 - 6s/epoch - 17ms/step
## Epoch 3/8
## 375/375 - 6s - loss: 0.0991 - accuracy: 0.9725 - val_loss: 0.1147 - val_accuracy: 0.9719 - 6s/epoch - 16ms/step
## Epoch 4/8
## 375/375 - 6s - loss: 0.1036 - accuracy: 0.9743 - val_loss: 0.1054 - val_accuracy: 0.9759 - 6s/epoch - 17ms/step
## Epoch 5/8
## 375/375 - 6s - loss: 0.1173 - accuracy: 0.9750 - val_loss: 0.1610 - val_accuracy: 0.9726 - 6s/epoch - 16ms/step
## Epoch 6/8
## 375/375 - 6s - loss: 0.1469 - accuracy: 0.9753 - val_loss: 0.1774 - val_accuracy: 0.9752 - 6s/epoch - 16ms/step
## Epoch 7/8
## 375/375 - 6s - loss: 0.1835 - accuracy: 0.9754 - val_loss: 0.3672 - val_accuracy: 0.9589 - 6s/epoch - 16ms/step
## Epoch 8/8
## 375/375 - 6s - loss: 0.2137 - accuracy: 0.9753 - val_loss: 0.3364 - val_accuracy: 0.9718 - 6s/epoch - 16ms/step
Pay particular attention to the accuracy of the model applied to the validation set with increasing number of training epochs. It should rise and flatten. A decrease at later epochs will indicate overfitting, in which case you imght go back and try again with a simpler model.
plot(model_fit)
See how accurate it is with the test data.
evaluate(my_model, x_test_reshape, y_test_matrix)
## 313/313 - 2s - loss: 65.4425 - accuracy: 0.9745 - 2s/epoch - 7ms/step
## loss accuracy
## 65.44247 0.97450
Use the model to predict the known response (digit) for each of the
test data images. y_test_probs
will contain the
probabilities of image assignment to different digit classes.
y_test_pred
will determine the single class with the
highest probability.
y_test_probs <- predict(my_model, x_test_rescale)
## 313/313 - 1s - 501ms/epoch - 2ms/step
y_test_pred <- as.vector(k_argmax(y_test_probs))
Compare the known y-values (digits) in the test data set with the
model-predicted y-values (y_test_pred
). You can see that
handwritten 7’s are sometimes mistakenly classified as 2’s, and
handwritten 8’s and 3’s are sometimes classified as 5’s.
table(y_test, y_test_pred)
## y_test_pred
## y_test 0 1 2 3 4 5 6 7 8 9
## 0 966 0 2 0 0 2 5 2 2 1
## 1 0 1122 5 0 1 0 5 0 2 0
## 2 1 0 1025 1 0 0 0 3 2 0
## 3 0 0 3 1003 0 0 0 0 3 1
## 4 2 0 5 1 950 0 1 2 1 20
## 5 2 0 0 23 0 843 10 1 3 10
## 6 2 1 2 0 3 3 947 0 0 0
## 7 1 2 17 13 0 0 0 983 1 11
## 8 4 0 7 13 1 1 9 2 924 13
## 9 2 3 2 3 3 0 0 2 2 992
View some of the misclassified images. Here are a few of the test images of handwritten 7’s that were misclassified as 2’s.
z <- which(y_test == 7 & y_test_pred == 2)
z
## [1] 1040 1227 1327 1755 1904 2196 2579 3167 3190 3752 3768 5247 7433 9010 9016
## [16] 9020 9025
plot(as.cimg(t(x_test[z[1], , ])), axes = FALSE)
Ignore the warning.
save_model_hdf5(my_model, "mymodel.h5")
# reload the saved model
my_model <- load_model_hdf5("mymodel.h5")
© 2009-2024 Dolph Schluter