Training AlexNet using Swift for TensorFlow

Last March at the TensorFlow Dev Summit, Chris Lattner and Richard Wei unveiled Swift for TensorFlow. As someone who has done quite a bit of work in Swift, and in particular machine learning using Swift, this was of great interest to me. The justifications for why Swift is an ideal language to pair with a framework like TensorFlow are described here, and the engineering work behind automatic differentiation alone is impressive.

I've been keeping an eye on the project since it was announced, and with the recent open source release of Swift for TensorFlow's deep learning library, it looked like it was finally possible to use the project for non-trivial applications.

As a result, I decided to create and train the classic AlexNet convolutional neural network architecture on a toy image dataset using Swift for TensorFlow and see if that was now possible. It is indeed, and I wanted to document what I learned from the process and share the code as an example of how image classification models can be trained in Swift for TensorFlow.


AlexNet is the convolutional neural network architecture that won the 2012 ImageNet Large Scale Visual Recognition Challenge competition, an important event in recent machine vision history. I like to use the architecture as a kind of "hello world" example when working with new neural network training or inference frameworks, because its linear architecture is relatively simple to implement, yet it works well in broad classes of image classification tasks.

My final code can be found in this GitHub repository. It takes the form of an application that builds and runs on Mac (via Xcode or the Swift Package Manager) or Ubuntu 18.04. This should build with the API as it was on February 28, 2019, and I’ll try to keep it up to date as the API evolves.

The primary task that the base AlexNet network is tuned for is image classification: given an image and a set of possible categories, sort that image into one of these categories. In this toy example, I chose to make a cat or dog binary classifier. I took 30 images of dogs and 30 images of cats from Pexels to form my training set, and a separate set of 30 different images each to form a validation set. The goal of the network is to determine if a photo is either of a cat or a dog.


This is a pretty tiny dataset for an image classification task (I commonly work with 10,000+ images per dataset for this), so we're going to lean heavily on transfer learning to make this work. Transfer learning is where you train a neural network on a very large image dataset (ImageNet 2012 is a common starting point), use the weights and biases learned from that initial training run as a starting point, and then re-train the network on your smaller target dataset. With a dataset of only 60 images, a randomly initialized AlexNet won't converge during training and won't recognize much, but one seeded with weights from training on ImageNet will rapidly converge on a network that can identify cats and dogs reasonably well.


In my GitHub project, you'll find six files. The first is AlexNet.swift, which defines the structure of the AlexNet convolutional neural network. It does so within the AlexNet struct, which complies to the Layer protocol from the Swift deep learning API. This model contains within it other components that comply with the Layer protocol, such as Conv2D (a struct that implements 2-D convolutions), Dense (a struct that implements fully connected, or dense, layers), MaxPool2D (a maximum pooling operation), and a custom Layer called LRN that I added in LRN.swift.

The original AlexNet implementation used local response normalization (LRN), an operation that's not as common in more recent architectures, so it hadn't yet been added to the Swift deep learning API. However, you can create your own custom Layers and place them in a model, so that's what I did. In LRN.swift, I created a Layer that performed the LRN calculation. It does this by calling into TensorFlow via the Raw functions that are exposed to Swift (in this case Raw.lRN). Dan Zheng helpfully provided the code for calculating the gradient of the LRN function and satisfying the needs of automatic differentiation.

In the AlexNet initializer, I configure all of the layers per the original publication, although I let you tweak the width of the fully-connected layers to make the model a little smaller and easier to train. In order to properly initialize the network so that it'll train on this tiny dataset, I load weights and biases from a pretrained network and set those to the filter and bias properties on the convolution layers:

let conv1Weights = try loadWeights(from: "conv1.weights", directory: directory, filterShape: (11, 11, 3, 96))
let conv1Bias = try loadBiases(from: "conv1.biases", directory: directory)
self.conv1 = Conv2D<Float>(filter: conv1Weights, bias: conv1Bias, activation: relu, strides: (4, 4), padding: .valid)

The overall network is defined as follows:

self.conv1 = Conv2D(filterShape: (11, 11, 3, 96), strides: (4, 4), padding: .valid, activation: relu)
self.conv2 = Conv2D(filterShape: (5, 5, 96, 256), strides: (1, 1), padding: .same, activation: relu)
self.conv3 = Conv2D(filterShape: (3, 3, 256, 384), strides: (1, 1), padding: .same, activation: relu)
self.conv4 = Conv2D(filterShape: (3, 3, 384, 384), strides: (1, 1), padding: .same, activation: relu)
self.conv5 = Conv2D(filterShape: (3, 3, 384, 256), strides: (1, 1), padding: .same, activation: relu)        
self.norm1 = LRN(depthRadius: 5, bias: 1.0, alpha: 0.0001, beta: 0.75)
self.pool1 = MaxPool2D(poolSize: (3, 3), strides: (2, 2), padding: .valid)
self.norm2 = LRN(depthRadius: 5, bias: 1.0, alpha: 0.0001, beta: 0.75)
self.pool2 = MaxPool2D(poolSize: (3, 3), strides: (2, 2), padding: .valid)
self.pool5 = MaxPool2D(poolSize: (3, 3), strides: (2, 2), padding: .valid)
self.fc6 = Dense(inputSize: 9216, outputSize: fullyConnectedWidth, activation: relu) // 6 * 6 * 256 on input
self.drop6 = Dropout<Float>(probability: 0.5)
self.fc7 = Dense(inputSize: fullyConnectedWidth, outputSize: fullyConnectedWidth, activation: relu)
self.drop7 = Dropout<Float>(probability: 0.5)        
self.fc8 = Dense(inputSize: fullyConnectedWidth, outputSize: classCount, activation: { $0 } )

The convolution weights are derived from a version of AlexNet that I trained using the Caffe framework against the ImageNet 2012 dataset. I trained my own variant of AlexNet from scratch because the original AlexNet model grouped convolutions on two separate GPUs, and that presented problems. This kind of grouping in convolution layers was uncommon afterwards, and the Conv2D Layers don't have a way of specifying grouping at present. Therefore, I needed to train a model that would produce the proper number of weights to use in a model without convolution grouping, and the fastest way was to train my own. I used Caffe simply because I'd already set up an environment for that in Nvidia's DIGITS tool.

The weights were extracted from the trained Caffe model using a separate Swift command-line application I wrote. The Swift Protobuf framework comes in handy for this so that you can parse the Caffe protobuf-based file format. The weights and biases were dumped out as raw binary files for each convolution layer to the /weights directory in this repository. WeightLoading.swift has the code needed to load these Caffe-formatted weights and biases and reorder them for TensorFlow. Caffe, TensorFlow, and Apple's Metal Performance Shaders all use different ordering for weights, which makes translating trained models between the three a lot of fun.

Once the network layers have been initialized, you need to specify how the data flows through the network (the graph). This happens in the applied(to:) method, which you can see in the code is marked as @differentiable. The gradients determined from differentiating this method are vital to backpropagation during network training. A functional style is used to define the inputs and outputs for each layer (similar to what you might do in Keras):

@differentiable(wrt: (self, input))
public func applied(to input: Tensor<Float>, in context: Context) -> Tensor<Float> {
    let conv1Result = conv1.applied(to: input, in: context)
    let norm1Result = norm1.applied(to: conv1Result, in: context)
    let pool1Result = pool1.applied(to: norm1Result, in: context)
    let conv2Result = conv2.applied(to: pool1Result, in: context)
    let norm2Result = norm2.applied(to: conv2Result, in: context)
    let pool2Result = pool2.applied(to: norm2Result, in: context)
    let conv3Result = conv3.applied(to: pool2Result, in: context)
    let conv4Result = conv4.applied(to: conv3Result, in: context)
    let conv5Result = conv5.applied(to: conv4Result, in: context)
    let pool5Result = pool5.applied(to: conv5Result, in: context)
    let reshapedIntermediate = pool5Result.reshaped(toShape: Tensor<Int32>([pool5Result.shape[Int32(0)], 9216]))
    let fc6Result = fc6.applied(to: reshapedIntermediate, in: context)
    let drop6Result = drop6.applied(to: fc6Result, in: context)
    let fc7Result = fc7.applied(to: drop6Result, in: context)
    let drop7Result = drop7.applied(to: fc7Result, in: context)
    let fc8Result = fc8.applied(to: drop7Result, in: context)

    return fc8Result

but there also is a sequential API that is available (again, similar to Keras).

The inputs and outputs of each of these layers in the network are of the Tensor type. Tensors are multidimensional arrays that use Swift's generics in interesting ways to extend to different internal data types. The Tensors that are passed between the layers in this classification network are initially four-dimensional arrays of Floats, but they are flattened to a 1-D array of floats before being passed to the final Dense layers and the classification output. The image inputs are four-dimensional because they consist of arrays (batches) of images with a width, height, and three color channels. Images are batched for processing efficiency. Convolution layers then expand the number of channels as filters are applied across a batch of images.

The loss function is defined for this classification case using softmax cross entropy loss that compares the output of the classification network (an array of floats where the position of the largest value is the strongest classification guess by the network) against the one-hot ground truth determined from the labeled data. I've also added an accuracy measurement function to provide human-understandable measurements of how accurate the network is against its training images and how well it generalizes to validation images it has never seen before.

Eventually, I imagine the deep learning API's Dataset type will provide full image loading capabilities, but at present they are missing. Therefore, I wrote functions and a struct in ImageDataset.swift that use either Core Graphics or TensorFlow to load images from disk, resize them to the input size of the convolutional network, and convert them into Tensor<Float>s for use in training and validation. The TensorFlow functions use the Raw functions again to enable this, and these work across all supported platforms:

func loadImageUsingTensorFlow(from fileURL: URL, size: (Int, Int), byteOrdering: ImageDataset.ByteOrdering, pixelMeanToSubtract: Float) -> [Float]? {
    let loadedFile = Raw.readFile(filename: StringTensor(fileURL.absoluteString))
    let loadedJpeg = Raw.decodeJpeg(contents: loadedFile, channels: 3, dctMethod: "")
    let resizedImage = Raw.resizeBilinear(images: Tensor<UInt8>([loadedJpeg]), size: Tensor<Int32>([Int32(size.0), Int32(size.1)])) - pixelMeanToSubtract
    if (byteOrdering == .bgr) {
        let reversedChannelImage = Raw.reverse(resizedImage, dims: Tensor<Bool>([false, false, false, true]))
        return reversedChannelImage.scalars
    } else {
        return resizedImage.scalars

The ImageDataset struct does make limited use of the Dataset type to perform randomized shuffling of images and labels during training. I found that randomized shuffling of the images during each pass through the dataset was crucial in order to get decent training performance.

One thing I’ll note is that some of these functions could have been performed by bridging over to Python using the new Swift->Python support (which I really like), but I wanted to see if I could make everything work without relying on Python libraries for any part of it.

Training and results

All of this is pulled together in the main.swift file, which defines the application. Our training dataset is loaded from images/train and a validation set from images/val. Weights are loaded from the weights/ directory and an AlexNet implementation is constructed using them. With all that, it's time to train a network.

Training is done using a stochastic gradient descent (SGD) optimizer, using learning parameters I've found to work with this network, although a little tuning there might help. For example, the learning rate is held constant here where it usually is decreased over the length of training, and I’ve noticed occasional instability at points in the training process.

The training loop works by retrieving a randomly shuffled batch of images, determining gradients across the network based on these inputs, and then updating the training weights via backpropagation. I log out the epoch (number of times through the whole dataset), loss, and accuracy against the training set on each pass, along with validation accuracy each tenth pass.

Running this on the CPU on the Mac takes a few minutes to get to ~100% accuracy on the training set and 80+% accuracies on the validation set. Using the same Swift code on my Ubuntu training computer with a GTX 1080 is about ten times faster, as you'd expect. Here's what typical graphs of training loss and accuracy look like for this application:


To dig deeper, I wanted to prove to myself that the calculations at each layer were being performed correctly, so I created functions that dump out activation heat maps from each neural network layer. This is triggered by setting dumpTensorImages to true in main.swift, and the tensor visualization functions are in TensorVisualization.swift. Below are images of network activations captured at multiple points in both Caffe (left) and Swift for TensorFlow (right), using identical weights and network architectures. The starting test image and its color split are at the top:

The output from each filter in a convolution is a square in the tensor visualization, with the relative strength of the activation (normalized by the minimum and maximum value in the activations at that layer) denoted by a blue-red heatmap. As you can see, aside from my slightly different heatmap colorspace, the neural network calculations match between Caffe (left) and Swift for TensorFlow (right). Thanks go to Delia for being my validation pug in these tests.

Next steps and learning more

It's still early days for the Swift for TensorFlow project, as can be seen by the components I needed to write to get this to work. Over time, it will get a lot easier to construct models as the framework is built out and these capabilities are provided for you. However, you now have enough of the building blocks to make this possible, if you’re willing to fill in some of the pieces yourself. The LRN layer above is a good example of how a capability was missing in the deep learning API, but we have the ability to code a replacement ourselves.

I'm excited about the prospect of building machine learning models in Swift. There's the ease and flexibility of writing model code in Xcode, a Swift Playground, or Jupyter, testing it out there, and then deploying without any changes to my Ubuntu training computers with powerful Nvidia GPUs. Swift's type system has been a tremendous help in catching bugs early for our robotics software, as well as for our custom in-house convolutional network inference framework, and I look forward to see how it can help prevent bugs that otherwise lead to wasted hours of training.

If you're interested in learning more, I've linked a number of references above for various aspects of Swift for TensorFlow and training image classification models. The main repository and mailing list are pretty active, and everyone involved is helpful and willing to answer questions. In particular, I'd like to thank Richard Wei and Dan Zheng for their help in getting me up to speed on how this works and even tweaking the API in response to suggestions.

If you have any things you'd like to discuss about this model, please feel free to contact me (Brad Larson) here at Perceptual Labs at or find me on Twitter.