Training AlexNet using Swift for TensorFlow

Last March at the TensorFlow Dev Summit, Chris Lattner and Richard Wei unveiled Swift for TensorFlow. As someone who has done quite a bit of work in Swift, and in particular machine learning using Swift, this was of great interest to me. The justifications for why Swift is an ideal language to pair with a framework like TensorFlow are described here, and the engineering work behind automatic differentiation alone is impressive.

I've been keeping an eye on the project since it was announced, and with the recent open source release of Swift for TensorFlow's deep learning library, it looked like it was finally possible to use the project for non-trivial applications.

As a result, I decided to create and train the classic AlexNet convolutional neural network architecture on a toy image dataset using Swift for TensorFlow and see if that was now possible. It is indeed, and I wanted to document what I learned from the process and share the code as an example of how image classification models can be trained in Swift for TensorFlow.


AlexNet is the convolutional neural network architecture that won the 2012 ImageNet Large Scale Visual Recognition Challenge competition, an important event in recent machine vision history. I like to use the architecture as a kind of "hello world" example when working with new neural network training or inference frameworks, because its linear architecture is relatively simple to implement, yet it works well in broad classes of image classification tasks.

My final code can be found in this GitHub repository. It takes the form of an application that builds and runs on Mac (via Xcode or the Swift Package Manager) or Ubuntu 18.04. This should build with the API as it was on February 28, 2019, and I’ll try to keep it up to date as the API evolves.

The primary task that the base AlexNet network is tuned for is image classification: given an image and a set of possible categories, sort that image into one of these categories. In this toy example, I chose to make a cat or dog binary classifier. I took 30 images of dogs and 30 images of cats from Pexels to form my training set, and a separate set of 30 different images each to form a validation set. The goal of the network is to determine if a photo is either of a cat or a dog.


This is a pretty tiny dataset for an image classification task (I commonly work with 10,000+ images per dataset for this), so we're going to lean heavily on transfer learning to make this work. Transfer learning is where you train a neural network on a very large image dataset (ImageNet 2012 is a common starting point), use the weights and biases learned from that initial training run as a starting point, and then re-train the network on your smaller target dataset. With a dataset of only 60 images, a randomly initialized AlexNet won't converge during training and won't recognize much, but one seeded with weights from training on ImageNet will rapidly converge on a network that can identify cats and dogs reasonably well.


In my GitHub project, you'll find six files. The first is AlexNet.swift, which defines the structure of the AlexNet convolutional neural network. It does so within the AlexNet struct, which complies to the Layer protocol from the Swift deep learning API. This model contains within it other components that comply with the Layer protocol, such as Conv2D (a struct that implements 2-D convolutions), Dense (a struct that implements fully connected, or dense, layers), MaxPool2D (a maximum pooling operation), and a custom Layer called LRN that I added in LRN.swift.

The original AlexNet implementation used local response normalization (LRN), an operation that's not as common in more recent architectures, so it hadn't yet been added to the Swift deep learning API. However, you can create your own custom Layers and place them in a model, so that's what I did. In LRN.swift, I created a Layer that performed the LRN calculation. It does this by calling into TensorFlow via the Raw functions that are exposed to Swift (in this case Raw.lRN). Dan Zheng helpfully provided the code for calculating the gradient of the LRN function and satisfying the needs of automatic differentiation.

In the AlexNet initializer, I configure all of the layers per the original publication, although I let you tweak the width of the fully-connected layers to make the model a little smaller and easier to train. In order to properly initialize the network so that it'll train on this tiny dataset, I load weights and biases from a pretrained network and set those to the filter and bias properties on the convolution layers:

let conv1Weights = try loadWeights(from: "conv1.weights", directory: directory, filterShape: (11, 11, 3, 96))
let conv1Bias = try loadBiases(from: "conv1.biases", directory: directory)
self.conv1 = Conv2D<Float>(filter: conv1Weights, bias: conv1Bias, activation: relu, strides: (4, 4), padding: .valid)

The overall network is defined as follows:

self.conv1 = Conv2D(filterShape: (11, 11, 3, 96), strides: (4, 4), padding: .valid, activation: relu)
self.conv2 = Conv2D(filterShape: (5, 5, 96, 256), strides: (1, 1), padding: .same, activation: relu)
self.conv3 = Conv2D(filterShape: (3, 3, 256, 384), strides: (1, 1), padding: .same, activation: relu)
self.conv4 = Conv2D(filterShape: (3, 3, 384, 384), strides: (1, 1), padding: .same, activation: relu)
self.conv5 = Conv2D(filterShape: (3, 3, 384, 256), strides: (1, 1), padding: .same, activation: relu)        
self.norm1 = LRN(depthRadius: 5, bias: 1.0, alpha: 0.0001, beta: 0.75)
self.pool1 = MaxPool2D(poolSize: (3, 3), strides: (2, 2), padding: .valid)
self.norm2 = LRN(depthRadius: 5, bias: 1.0, alpha: 0.0001, beta: 0.75)
self.pool2 = MaxPool2D(poolSize: (3, 3), strides: (2, 2), padding: .valid)
self.pool5 = MaxPool2D(poolSize: (3, 3), strides: (2, 2), padding: .valid)
self.fc6 = Dense(inputSize: 9216, outputSize: fullyConnectedWidth, activation: relu) // 6 * 6 * 256 on input
self.drop6 = Dropout<Float>(probability: 0.5)
self.fc7 = Dense(inputSize: fullyConnectedWidth, outputSize: fullyConnectedWidth, activation: relu)
self.drop7 = Dropout<Float>(probability: 0.5)        
self.fc8 = Dense(inputSize: fullyConnectedWidth, outputSize: classCount, activation: { $0 } )

The convolution weights are derived from a version of AlexNet that I trained using the Caffe framework against the ImageNet 2012 dataset. I trained my own variant of AlexNet from scratch because the original AlexNet model grouped convolutions on two separate GPUs, and that presented problems. This kind of grouping in convolution layers was uncommon afterwards, and the Conv2D Layers don't have a way of specifying grouping at present. Therefore, I needed to train a model that would produce the proper number of weights to use in a model without convolution grouping, and the fastest way was to train my own. I used Caffe simply because I'd already set up an environment for that in Nvidia's DIGITS tool.

The weights were extracted from the trained Caffe model using a separate Swift command-line application I wrote. The Swift Protobuf framework comes in handy for this so that you can parse the Caffe protobuf-based file format. The weights and biases were dumped out as raw binary files for each convolution layer to the /weights directory in this repository. WeightLoading.swift has the code needed to load these Caffe-formatted weights and biases and reorder them for TensorFlow. Caffe, TensorFlow, and Apple's Metal Performance Shaders all use different ordering for weights, which makes translating trained models between the three a lot of fun.

Once the network layers have been initialized, you need to specify how the data flows through the network (the graph). This happens in the applied(to:) method, which you can see in the code is marked as @differentiable. The gradients determined from differentiating this method are vital to backpropagation during network training. A functional style is used to define the inputs and outputs for each layer (similar to what you might do in Keras):

@differentiable(wrt: (self, input))
public func applied(to input: Tensor<Float>, in context: Context) -> Tensor<Float> {
    let conv1Result = conv1.applied(to: input, in: context)
    let norm1Result = norm1.applied(to: conv1Result, in: context)
    let pool1Result = pool1.applied(to: norm1Result, in: context)
    let conv2Result = conv2.applied(to: pool1Result, in: context)
    let norm2Result = norm2.applied(to: conv2Result, in: context)
    let pool2Result = pool2.applied(to: norm2Result, in: context)
    let conv3Result = conv3.applied(to: pool2Result, in: context)
    let conv4Result = conv4.applied(to: conv3Result, in: context)
    let conv5Result = conv5.applied(to: conv4Result, in: context)
    let pool5Result = pool5.applied(to: conv5Result, in: context)
    let reshapedIntermediate = pool5Result.reshaped(toShape: Tensor<Int32>([pool5Result.shape[Int32(0)], 9216]))
    let fc6Result = fc6.applied(to: reshapedIntermediate, in: context)
    let drop6Result = drop6.applied(to: fc6Result, in: context)
    let fc7Result = fc7.applied(to: drop6Result, in: context)
    let drop7Result = drop7.applied(to: fc7Result, in: context)
    let fc8Result = fc8.applied(to: drop7Result, in: context)

    return fc8Result

but there also is a sequential API that is available (again, similar to Keras).

The inputs and outputs of each of these layers in the network are of the Tensor type. Tensors are multidimensional arrays that use Swift's generics in interesting ways to extend to different internal data types. The Tensors that are passed between the layers in this classification network are initially four-dimensional arrays of Floats, but they are flattened to a 1-D array of floats before being passed to the final Dense layers and the classification output. The image inputs are four-dimensional because they consist of arrays (batches) of images with a width, height, and three color channels. Images are batched for processing efficiency. Convolution layers then expand the number of channels as filters are applied across a batch of images.

The loss function is defined for this classification case using softmax cross entropy loss that compares the output of the classification network (an array of floats where the position of the largest value is the strongest classification guess by the network) against the one-hot ground truth determined from the labeled data. I've also added an accuracy measurement function to provide human-understandable measurements of how accurate the network is against its training images and how well it generalizes to validation images it has never seen before.

Eventually, I imagine the deep learning API's Dataset type will provide full image loading capabilities, but at present they are missing. Therefore, I wrote functions and a struct in ImageDataset.swift that use either Core Graphics or TensorFlow to load images from disk, resize them to the input size of the convolutional network, and convert them into Tensor<Float>s for use in training and validation. The TensorFlow functions use the Raw functions again to enable this, and these work across all supported platforms:

func loadImageUsingTensorFlow(from fileURL: URL, size: (Int, Int), byteOrdering: ImageDataset.ByteOrdering, pixelMeanToSubtract: Float) -> [Float]? {
    let loadedFile = Raw.readFile(filename: StringTensor(fileURL.absoluteString))
    let loadedJpeg = Raw.decodeJpeg(contents: loadedFile, channels: 3, dctMethod: "")
    let resizedImage = Raw.resizeBilinear(images: Tensor<UInt8>([loadedJpeg]), size: Tensor<Int32>([Int32(size.0), Int32(size.1)])) - pixelMeanToSubtract
    if (byteOrdering == .bgr) {
        let reversedChannelImage = Raw.reverse(resizedImage, dims: Tensor<Bool>([false, false, false, true]))
        return reversedChannelImage.scalars
    } else {
        return resizedImage.scalars

The ImageDataset struct does make limited use of the Dataset type to perform randomized shuffling of images and labels during training. I found that randomized shuffling of the images during each pass through the dataset was crucial in order to get decent training performance.

One thing I’ll note is that some of these functions could have been performed by bridging over to Python using the new Swift->Python support (which I really like), but I wanted to see if I could make everything work without relying on Python libraries for any part of it.

Training and results

All of this is pulled together in the main.swift file, which defines the application. Our training dataset is loaded from images/train and a validation set from images/val. Weights are loaded from the weights/ directory and an AlexNet implementation is constructed using them. With all that, it's time to train a network.

Training is done using a stochastic gradient descent (SGD) optimizer, using learning parameters I've found to work with this network, although a little tuning there might help. For example, the learning rate is held constant here where it usually is decreased over the length of training, and I’ve noticed occasional instability at points in the training process.

The training loop works by retrieving a randomly shuffled batch of images, determining gradients across the network based on these inputs, and then updating the training weights via backpropagation. I log out the epoch (number of times through the whole dataset), loss, and accuracy against the training set on each pass, along with validation accuracy each tenth pass.

Running this on the CPU on the Mac takes a few minutes to get to ~100% accuracy on the training set and 80+% accuracies on the validation set. Using the same Swift code on my Ubuntu training computer with a GTX 1080 is about ten times faster, as you'd expect. Here's what typical graphs of training loss and accuracy look like for this application:


To dig deeper, I wanted to prove to myself that the calculations at each layer were being performed correctly, so I created functions that dump out activation heat maps from each neural network layer. This is triggered by setting dumpTensorImages to true in main.swift, and the tensor visualization functions are in TensorVisualization.swift. Below are images of network activations captured at multiple points in both Caffe (left) and Swift for TensorFlow (right), using identical weights and network architectures. The starting test image and its color split are at the top:

The output from each filter in a convolution is a square in the tensor visualization, with the relative strength of the activation (normalized by the minimum and maximum value in the activations at that layer) denoted by a blue-red heatmap. As you can see, aside from my slightly different heatmap colorspace, the neural network calculations match between Caffe (left) and Swift for TensorFlow (right). Thanks go to Delia for being my validation pug in these tests.

Next steps and learning more

It's still early days for the Swift for TensorFlow project, as can be seen by the components I needed to write to get this to work. Over time, it will get a lot easier to construct models as the framework is built out and these capabilities are provided for you. However, you now have enough of the building blocks to make this possible, if you’re willing to fill in some of the pieces yourself. The LRN layer above is a good example of how a capability was missing in the deep learning API, but we have the ability to code a replacement ourselves.

I'm excited about the prospect of building machine learning models in Swift. There's the ease and flexibility of writing model code in Xcode, a Swift Playground, or Jupyter, testing it out there, and then deploying without any changes to my Ubuntu training computers with powerful Nvidia GPUs. Swift's type system has been a tremendous help in catching bugs early for our robotics software, as well as for our custom in-house convolutional network inference framework, and I look forward to see how it can help prevent bugs that otherwise lead to wasted hours of training.

If you're interested in learning more, I've linked a number of references above for various aspects of Swift for TensorFlow and training image classification models. The main repository and mailing list are pretty active, and everyone involved is helpful and willing to answer questions. In particular, I'd like to thank Richard Wei and Dan Zheng for their help in getting me up to speed on how this works and even tweaking the API in response to suggestions.

If you have any things you'd like to discuss about this model, please feel free to contact me (Brad Larson) here at Perceptual Labs at or find me on Twitter.

Launching Pocket Agronomist

We're proud to announce that Pocket Agronomist is now available for download from the App Store (U.S.-only right now). It is an application that combines convolutional neural networks to perform realtime crop disease diagnosis with augmented reality for measuring crop statistics. It was developed by Agricultural Intelligence, a joint venture between Perceptual Labs and Ag AI.

Pocket Agronomist is currently free to download and use. You can read more about it on our product page.

We unveiled Pocket Agronomist a little under a year ago in this post, and have been refining the core technology and expanding capabilities since then. Read on for more about what we've added since then:


ARKit for stand count measurement


The largest addition is an augmented-reality-based tool for performing stand counts in the field. Stand counts are measures of how many plants are present per a set distance in a row within a field. If you then know the spacing of the rows, you can calculate crop density within a field.

Beyond simple density measurements, if you can measure the spacing between individual plants in a row, you can learn even more about a field. Uneven plant spacing (indicated by a high standard deviation in individual distance measurements) might mean a significantly lower yield for a field. This could be due to problems during planting or crop damage due to frost or hail. Being aware of issues early in the season might allow a farmer to quickly re-plant, and an accurate assessment after storm damage can make sure insurance policies pay the correct amount for lost yield.

At present, stand counts are gathered by hand using a labor-intensive process typically involving going to a field, laying out a tape measure, reading out the inch markers for each plant, writing down these values, and then transcribing them later into a spreadsheet which calculates the overall statistics. Many choose to simplify this process to cut down on time and only measure the number of plants in a set distance, losing the individual spacing statistics.

We found that this process could be simplified and automated using the latest augmented reality technologies being deployed to mobile devices. We conducted a series of experiments using Apple's ARKit and found that we could use it to identify the ground in a field reliably enough to perform stand count measurements that produce average plant distances matching hand measurements to within a third of an inch.

To do this, we use ARKit to identify the position of the ground relative to the device's camera in three-dimensional space. The user then aims a focus square along that virtual ground plane until it surrounds a plant. Tapping on the screen initiates causes a line to be drawn from the center of the screen to the virtual plane of the ground, and the base of the plant is labeled at that intersecting point in 3-D. That point is tracked as the device and its camera moves, and ideally should remain aligned with the base of the plant.

While that all sounds complicated, it's handled for the user automatically and all they have to do is point their phone camera at the bases of plants in a row and tap for each. Mistaken measurements can be undone easily, and all measurements are displayed as a 3-D overlay on the real world so that they can see what was measured down a row.

When a sufficient number of measurements have been taken, a press of a button will bring up all the statistics for a stand count, calculated automatically. No data entry is required, and all of these values can be emailed to anyone as a comma separated value file for later analysis.

We feel that this could be a significant time saver for farmers, insurance agents, and others, and wanted to get it into the hands of those people as soon as we could. This is why we are launching Pocket Agronomist now, so that people can make use of this throughout the remainder of the U.S. growing season.

Enhancing disease detection and expanding crop selection

Pocket Agronomist still features realtime disease detection using trained convolution neural networks, and we have continued to improve this functionality since we unveiled it last year. However, we are labeling this capability as "experimental" currently while we assess its performance in the field and work to gather training data for some less common corn diseases.

Data collection is another reason we are launching the product now. By default, when Pocket Agronomist is used to diagnose a disease case in the wild, an image of that disease is uploaded to our repository to be used in training our neural networks. You can opt out of this at any point via a simple switch. Images are anonymized, but are stamped with location data so that we can see what region they were taken from (which can aid in separating similar strains of crop diseases). This location data can also be selectively disabled.

A number of people helped us collect data throughout the last growing season, and more people have offered to do so once this was formally released. We're hoping to use this throughout this growing season to increase our disease recognition accuracies and fully validate the use of this application for diagnosis in the field.

To further help data collection efforts, we've added a tab within the application where you can take and upload images of diseases appearing in other crops beyond the currently-supported corn diseases. We plan to add a series of additional crops over time as we build our training datasets.

We're all extremely excited to see how people use this in the field, and to grow its capabilities throughout this season. Feel free to download it from the U.S. App Store today and give it a try.

Pocket Agronomist: diagnosing plant diseases using convolutional neural networks


At Perceptual Labs, our machine vision work is driven by the needs of real world applications. I'd like to talk about an application that we're proud to unveil today. Called Pocket Agronomist, it is an iOS application for diagnosing common crop diseases using the live camera feed from your mobile device. It utilizes convolutional neural networks to perform object detection in real time in order to identify and label regions of plant disease detected when looking through the camera.

Pocket Agronomist was developed by Agricultural Intelligence, a joint venture between Perceptual Labs and Ag AI. Ag AI brings significant agricultural experience, knows the needs of the agricultural industry, and has the means to acquire a dataset for detection of these various plant diseases. When combined with Perceptual Labs' software and expertise for training and deploying convolutional neural networks on mobile devices, a unique product is made possible.

Designing, building, and testing this application took a significant amount of work, and I'd like to describe the process. 

Identifying the problem

We are fortunate to be located in Madison, WI, an area surrounded by farm country and where the University of Wisconsin, one of the nation's leading research universities, performs leading-edge agricultural research. I can see three corn fields from my front porch, and I come from a stereotypical Wisconsin family featuring dairy farmers and cheesemakers. It's natural, then, that one of the first conversations we had about the use of machine vision involved agriculture.

After the growing season last year, the founders of Ag AI came to us with an interesting problem. Every year, farmers struggle to minimize crop damage done by disease. In 2016 alone, it's estimated that 817 million bushels of corn were lost to disease. Early identification and treatment of these diseases could be a tremendous help to farmers, but it's impractical to have trained specialists walking every field.

Both parties here recognized that by using machine vision, we might be able to provide that expertise to every farmer via the device they already have in their pocket. Our early proof-of-principle tests showed very promising results, so we created a joint venture to explore this. Ag AI would provide their extensive agricultural expertise, access to crop disease datasets for training and testing, and a network of initial users. Perceptual Labs would design and train convolutional neural networks for disease diagnosis and build an application around this. When I use "we" in this article, I'm referring to the joint partnership between our two companies.

Designing the application

The first step was to determine the shape and capabilities of the end application. We decided to focus on a single crop at first, with the ability to quickly expand to others once the process had been established. Corn was a natural choice for our first target, as it is the dominant crop in the Midwest and greater United States.

We identified 14 diseases and one instance of non-disease damage that we would be able to train the system to detect. Detection wasn't enough, though, so we wanted to provide encyclopedic information about a disease once it was detected, as well as clear indications of the danger it posed to a field and the steps to mitigate this danger.

The application had to work in areas with no or limited network connectivity, which is the case for many farms. Perceptual Labs' focus on performing machine vision on device, with no server-side component, meant that our technology was ideally suited for this. By training a convolutional neural network to detect these diseases, and then performing inference on live camera video, a farmer or contractor would be able to simply point their mobile device at a corn leaf and get an immediate readout of what it detects.

Once we had a good idea of what the application would need to do, we starting building up the components required to make it work.

Aggregating a dataset for training

When training a convolutional neural network for classification or object detection using supervised learning, a sizable dataset of training images is needed. How large and diverse a dataset we needed was an early question, followed by where you could obtain such a dataset. Much of the work you'll see out there involving convolutional neural networks tends to be based on a few generic publicly available datasets, like ImageNet. That's fine if the things you want to detect are contained within those datasets, but for most applications you'll need something that is more targeted.

We had assumed that it would be fairly easy to find images for common corn diseases at land grant universities and others with strong agricultural programs, but until now most didn't have a need to capture hundreds of photos of specific diseases. A few images were good enough for educational and research purposes, and many of those were taken in artificial conditions. As we've found in other cases, good datasets were lacking because people couldn't have anticipated the needs of training maching learning systems.

Therefore, we quickly turned to acquiring training imagery ourselves. We aggregated what we could from universities and other agricultural partners, but needed to capture a lot more imagery from the field to make this viable. To address this, we built a system into our beta testing application where when disease was detected in the field, the application would automatically label it and upload an image to add to our dataset. At the same time, users could manually capture images and upload them with a single click, or directly inform us when something was misdiagnosed.

Distributing this among our beta testers let us gather hundreds of images from a single disease outbreak, and we could re-train our neural networks on a nearly daily basis with these additions, continually improving accuracy and eliminating false positives, one by one. Even though the corn growing season has come to an end in the Midwest, we're still processing all the images we captured right up to harvest.

Network design and training

While convolutional neural networks generalize very well to wide ranges of problems, we've found that a little tuning doesn't hurt when targeting specific real-world problems. Most published image classification convolutional networks have been built and benchmarked around the ImageNet ILSVRC 2012 dataset. While that provides a useful baseline, with a wide variety of images and classification categories, I'm of the opinion that people are maybe micro-optimizing for this dataset at the expense of other cases. For example, a recent study showed some of the problems I've seen in ILSVRC 2012, such as misclassified images, multiple classes in the same image, and so on.

When evaluating the performance of our neural networks, we wanted to make sure we were as rigorous as we could be, so we cultivated a diverse and challenging validation set of images for each disease category, as well as a large number of cases with no disease, no corn, or even no plants to test against. We made sure that every image was of something the network had never seen before, and they encompassed the wide range of lighting and environmental conditions you'd see in the field. We also made sure that all validation runs were performed on an iOS device running a modified version of our application, because differences in GPU floating point precision can cause subtle differences in classification.

As a result of this, our validation test accuracies provided a strong reflection of how our convolutional networks would perform out in the field.

During training, we continually evaluated runtime performance on a mobile device alongside accuracy. We tried to balance the accuracy of the network with speed, finding a sweet spot where we would only realize trivial accuracy gains by making the network much slower. We profiled and found bottlenecks in the network designs that slowed things down but didn't aid accuracy, and gradually thinned those out.

Testing in the field

We can perform all the validation we want on static images, but none of that matters if the product isn't usable in the real world. This particular application posed some unique testing challenges, because you literally had to go into the field to try it out. This outdoors environment also posed some interesting design challenges, as well.

Back before the launch of the iPhone App Store, I had a fascinating discussion with Craig Hockenberry of the Iconfactory, whose Twitter client Twitterrific had just won one of the first iPhone-oriented Apple Design Awards. He had commented that a driver for the dark background and light text of Twitterrific was that they had found it was more legible outdoors in bright sunlight. 

We originally had a more traditional iOS-7-style light interface, with a white background and black text, but farmers found that to be harder to read in bright sunlight when wandering rows of corn. Similarly to Craig's experience, we tested out a dark interface, and people immediately found it to be more legible in the field. We're also working to make sure that text is large enough and icons are clear enough to be read by the majority of our users who are outside of the under-20 demographic many iOS applications seem to be designed for.


Functionally, we originally built the application to perform image classification on a live video stream. Given a frame of video, it would tell us whether the camera sees undamaged corn, one of the corn diseases, corn that's too far to make out, a plant that isn't corn, or no plant at all. Our original design had detailed confidence percentages and alternative diagnoses that continually updated with incoming camera video.

While we thought this extra information would be useful, it turned out to be more of a distraction when used in the field, and confidence percentages didn't help in making actual diagnostic decisions. We simplified the readout to just show the current best diagnosis (if there was one), with any alternative above a certain confidence threshold shown below that. We also worked to prevent diagnoses from bouncing around from frame to frame when results were close.


This worked reasonably well, but we found some drawbacks with this classification approach. As mentioned above, the text results could bounce back and forth between very close diagnoses, making it hard to read when moving the phone across a leaf. The text was at the bottom of the screen, below the camera view that you were using to line up the corn leaf. You received no information about where the system detected disease, so you had to scan around to find where the disease was. 

Finally, image classification does a terrible job with images that contain multiple classes, which in our case could mean multiple diseases. This impacted both training and testing, preventing us from using many images containing multiple diseases in training and validation. It then reduced accuracy in the field when these multiple disease cases (which are not uncommon) were encountered. A single leaf could exhibit two different diseases as well as damage from fertilizer application all at once, and a single-category classification can't sufficiently express that.


At about this time, Perceptual Labs had just gotten our object detection networks and training operational, so we decided to see if that could be a solution to these issues. Areas of disease aren't your traditional objects (like cars or people) that have well-defined shapes and boundaries. Disease lesions could take on multiple sizes and shapes. Would object detection even work for this? Turns out it does, and it works very well.

We then shifted our efforts from a product based around image classification convolutional neural networks to one using these newer object detection networks. This took a lot of work on the dataset side, with us having to go back through and manually label areas of disease, a process we're still refining.

By using object detection, we were able to transform the application from providing a simple text readout of what it sees to labeled bounding boxes within the live camera video that show you exactly where disease was found. These boxes track and scale as you move the camera around the leaf. This both simplified the application interface and provided a lot more information to the user. It also significantly reduced our rates of false positives, while ultimately matching the accuracy of our previous image classification networks.

To our knowledge, this is the first case of object detection being used on live camera video to detect and localize disease, particularly on a mobile device. We're very proud of how this application has turned out, and in-the-field testing has been key in building a very useful product.

Looking forward

While development and data-gathering progressed, harvest loomed as a hard deadline. I'd drive past cornfields on my way to work and watch the plants grow as a kind of real-world progress bar. We used the entire growing season for data gathering and testing, right up to the day before harvest in many fields. We're still processing all the imagery and test results from the season, and the application is out in the hands of many beta testers as we enhance its capabilities.

We're very excited about the capabilities of this application, and we believe it will provide a unique solution to common agricultural problems. If you would like to see it in action, or are interested in talking with us about the use of this product or technology, feel free to contact us.

Again, to read more about the application, please visit the website at

Realtime object detection on iOS

As an example of the work we've been doing at Perceptual Labs, the following video shows realtime object detection and localization, captured live from an iPhone:



We designed a custom convolutional neural network that can recognize classes of objects and identify their location in an image. This network processes video frames at over 30 frames per second and identifies multiple objects and their locations in each frame. It's fast enough to recognize and track arbitrary objects on live video from an iPhone camera. You'll notice that it isn't limited to tracking a specific object, but it can detect broad classes of objects like people or cars.

The above examples did not use Core ML, and instead used a higher-performance custom inference framework that we've developed in-house. In fact, we've developed custom tools and code to assist with or enable every step in the process from dataset aggregation, network design, training, and finally to deployment on device.

I'd like to talk about some of the work we did in order to create this example.

Object detection and localization

As I talked about previously, I'm extremely excited about the advancements being made in machine vision right now. In particular, technologies like Apple's Metal Performance Shaders and Core ML are making it relatively easy to deploy trained convolutional neural networks on mobile devices.

Almost all of the examples you will see of people applying convolutional neural networks to images will be performing image classification. Image classification is a process where a network takes in an image and attempts to determine the single best class out of a list to describe it, usually with an accompanying confidence score.

Image classification, while still a complex problem to solve, is not quite as difficult as object detection and localization. For object detection, instead of a single class that describes an entire image, you are attempting to find any instances of one or more object types within an image and accurately describe their positions. There might be no objects of interest in a scene, or there might be hundreds of objects of varying sizes and shapes representing dozens of classes.

When designing a process using convolutional neural networks to perform object detection, there are several steps:

  • Convolutional network design
  • Dataset aggregation and formatting
  • Network training
  • Deployment (inference)

Convolutional network design

There are a few major ways that people have approached object detection using convolutional neural networks. Image classification can be performed within a sliding window over an image (such as with OverFeat). Alternatively, a second neural network can be trained to propose regions on which classification is performed (as performed by R-CNN, Fast R-CNN, and Faster R-CNN). 

Finally, recent research has demonstrated that a single convolutional neural network can be trained end-to-end to take in images and simultaneously localize objects and determine their class (examples of this are the SSD, YOLO and YOLOv2, SqueezeDet, and DetectNet architectures). These single-shot object detection networks are the only ones at present that are fast enough to run on a mobile device. They are also significantly easier to train than other approaches, because the entire network can be trained at one time.

For reasons that I'll describe later, I've been very impressed with the capabilities of Nvidia's DIGITS software for training networks using the Caffe framework. Nvidia created an object detection network called DetectNet and customized their fork of Caffe to allow for it to be trained on labeled images. It works by splitting images into a grid and calculating the likelihood that a bounding box for an object of a specific class is centered on a grid rectangle. In parallel, it calculates proposed bounding box offsets for objects in each grid rectangle. From there, the two can be combined to determine where objects of various types are located within an image.

Unfortunately, the stock DetectNet design (which uses a GoogLeNet-style network at its core) is too computationally expensive to deploy directly to mobile devices, particularly for use with live video. We needed a different network design.

There hasn't been as much research into object recognition networks as image classification ones, but thankfully many of the lessons learned from the latter apply to the former. At their heart, most object recognition networks have similar internal designs and perform similar functions as image classification networks.

Taking lessons we learned from designing custom image classification networks for mobile devices, we designed a more compact network that had similar inputs and outputs to DetectNet so that we could train it on custom datasets using DIGITS and Caffe. It uses a small fraction of the total calculations required for DetectNet, but still can identify and place broad classes of objects in images.

Matthijs Hollemans provides a great walkthrough of how he got a variant of the YOLO object detection network (Tiny YOLO) working on iOS, as well as how the YOLO network operates. YOLO uses a slightly different output architecture than the DetectNet based we worked from, but is also a single-shot object detector. Even the stripped-down Tiny YOLO network design only runs at 8-10 FPS on an iPhone 7, where with a little work we were able to design a custom architecture of similar accuracy that runs at more than 30 FPS on the same hardware. The YOLO architecture also uses the custom Darknet training framework and required multiple steps to convert to a format usable on iOS, where we are able to drop our Caffe-trained models directly into our iOS, Mac, and (soon) Linux applications.

Dataset aggregation and formatting

Image classification datasets are relatively simple to structure, usually consisting of a directory for each class in which you place all example images for that class. Object detection training sets need a lot more information, because you somehow have to specify what objects are within each training image and the coordinates of a bounding box or other shape determining their position. Each image might have no objects within it or an arbitarily large number.

There isn't a generally accepted standard for this, so different training tools expect labels and images in different formats, and publicly available datasets have their own formats. In our case, we wanted to use the PASCAL Visual Object Classes dataset as a starting point to train our networks, but needed to write a tool to translate between their XML-based annotations and the text labels needed for Nvidia's DIGITS.

For your own custom datasets (such as the ones we've been developing internally for upcoming applications), you'll need a way to label classes and bounding boxes for your objects. There aren't a lot of great solutions out there for this, so this is also an area where we're investing in some bespoke in-house tools. I will say that of the tools that are out there, Ryo Kawamura's free RectLabel (on the Mac App Store) does a lot of what you'll need, and has a well-thought-out interface.

Network training

I mentioned it before, but my current favorite convolutional network training framework is Caffe driven by Nvidia's DIGITS. I prefer Caffe's domain-specific language for network design that abstracts away a lot of boilerplate code, and their binary format for trained networks nicely encapsulates network design and all of your trained weights. Nvidia's DIGITS is an excellent graphical tool for managing datasets, setting up and enqueuing training runs, and analyzing training and network performance.

If you want an introduction to how DIGITS works for training, Nvidia has plenty of training material online and Reza Shirazian provided a recent tutorial on how to use DIGITS with AWS to train a Caffe model for use with Core ML.

If you're at all serious about training your own convolutional networks, I would highly recommend investing in your own hardware to do so. You'll pay back your investment within weeks by avoiding AWS hourly rates, and be able to develop and test ideas faster. For our use, I built a system with two liquid-cooled Nvidia GTX 1080s and a liquid-cooled CPU:


The two GPUs gave us the ability to test out multiple network designs at the same time, or simultaneously train on different datasets. Continuously running at near 100% load on the CPU and GPU was causing thermal issues and instability in a previous computer, thus the liquid cooling. That's certainly not necessary with proper air cooling, but I wanted to make sure this system would be rock-solid under load (at one point, we'd run both GPUs and CPU at near 100% load for three months straight).

The conditions under which you perform training can have as much of an impact as your network design. Pick too high of a learning rate, and the network flies off the rails. Too low, and it never converges. Do you use batch normalization or not? What do you do for data augmentation? There are a huge number of variables here, and sometimes experimentation is the only way to find out what works. This is another reason why I like to have a multi-GPU training system of my own to quickly test modifications.

Deployment to device

The process of running a network against input data to provide results is called inference. Once you have a trained neural network, you need to be able to perform inference with it inside your application. Apple has provided an easy-to-use solution for Mac and iOS with Core ML, but we used our own framework for this. We did this partly because we had already developed our own framework months before Core ML was announced, but largely because our framework has some advantages over Core ML.

Our framework was written in Swift as a high-level abstraction that can parse common convolutional network file formats directly and build the structures needed to perform inference at runtime. No translation or preprocessing is needed, and network designs can be sent directly from a training computer to an iOS application without recompilation. Having to write any code that's custom to a specific network design makes it that much harder to experiment with new architectures, something I do on a regular basis.

Even with Core ML out there now as a way to do this, we're still working on our own custom framework for a few reasons. One is performance, where we've seen up to 30% faster inference times on our custom framework vs. Core ML. Another is flexibility, because we can add or implement different network layer types, inputs, or outputs as we need them instead of being locked into what Core ML supports. 

Finally, there's device flexibility. For iOS, Core ML only runs on iOS 11, and the Metal Performance Shaders it uses to perform GPU-accelerated convolutional network calculations only work on A8 devices or newer. This means that either your networks won't work on A7-class devices (iPad Air, iPhone 5S, etc.) or will be incredibly slow as they fall back to CPU-side calculations. This presents a problem when submitting an application to the App Store, because there currently is no way to require A8-class devices as a minimum specification. Also, I've so far talked only about Mac and iOS devices, but what about embedded Linux devices and other platforms?

To target a much broader range of devices, we've built an OpenGL-based inference engine that leverages the structures I created for GPUImage. This lets us run GPU-accelerated convolutional neural networks on many other Mac and iOS devices than those supported by Metal Performance Shaders, and we'll soon be extending that to embedded Linux devices and beyond.

Bringing it all together

For the video above, we started with the PASCAL VOC 2012 dataset and our custom DetectNet-compatible object detection network. We converted the dataset into the image and label format expected by Nvidia's DIGITS and used DIGITS and Caffe to train our network. We then built a sample application using our convolutional network framework and dropped the trained Caffe network file into it. The application just had to feed camera frames into the framework and get back for each frame a list of objects, their class, and the normalized bounding boxes for them within the video frame.

All the application had to do at that point was to pipe the video frames to the screen and draw the labeled bounding boxes over that feed. The result is what you see in the video. Unfortunately, QuickTime's screen recording couldn't keep up with the device's video display rate, so the above video isn't as smooth as it appeared on device.

As you can tell, I'm very excited about the potential applications of realtime arbitrary object detection on live video in portable and embedded devices. This enables a range of capabilities that didn't exist before, and we're only starting to explore the use cases. We've been working on one such application that we think could have a large impact, and we should have more to say about that soon.

Introducing Perceptual Labs

At Perceptual Labs, we apply leading-edge machine vision techniques to solve real-world problems on mobile devices and embedded computers. Machine vision is currently undergoing a revolution as a result of advancements in convolutional neural networks (CNNs), and they are a large part of the technologies in use at Perceptual Labs. Huge advances are being made every day in problems previously considered intractable, all through the use of CNNs or related technologies.

Why are they so useful? To answer that, I'd like to talk about how people have traditionally approached machine vision challenges.

Human-defined machine vision

Machine vision is the process of having a computer extract useful information from an image, video, or camera feed. That information can range from the simple, like what the average color of an image is, to the complex, like how many people were standing in front of a camera and where they are. We often take for granted how easy it is for us to look at a scene and understand what's present and where. However, try to write code to replicate even a small part of what people do every second and you'll find that this is one of the most challenging areas in computer science.

For decades, programmers have directly designed algorithms to operate on images and pull information from them. Many techniques have been developed to detect edges, corners, circles, lines, and other shapes. Points of interest and descriptors created from looking at the area around those points were used to recognize, match, and track various objects.

I've been personally fascinated by the acceleration of these machine vision algorithms using graphics processing units (GPUs). GPUs are built to handle massive amounts of parallel calculations on independent data, such as pixels in an image. Once relatively fixed in function, they've become more and more programmable to the point where modern GPUs can run fairly complex custom programs on large quantities of data.

In particular, I believed that GPUs had the potential to make machine vision practical on low power mobile and embedded computers. These camera-equipped handheld computers most people carry with them every day have so much untapped potential for helping to understand the world around us.

To that end, back in 2010 I experimented with GPU-accelerated video processing on the iPhone. In 2012, I built an open source framework around this called GPUImage to simplify GPU-accelerated image processing on iOS (and later Mac and Linux). I never could have predicted how widely this framework would be adopted, and among the diverse applications of it were many attempts at accelerating machine vision operations.

Convolutional neural networks

For tasks like object recognition and tracking, people had first attempted to design algorithms and operations for the entire process from taking in an image to delivering a result. I did this myself in pursuit of several problems, with mixed results.

Others attempted to use machine learning in various ways to solve parts of the problem. In the case of image classification, a popular approach was to use human-designed feature extractors (edge detection, feature detectors like SIFT, SURF, FAST, etc.) and machine learning for the final classification stages. It was generally assumed that a human had to be guiding at least some part of the design to create a competitive solution.

At about the time that I was starting to work on GPUImage in 2012, this all began to change. The annual ImageNet image classification challenge is a central place to gauge the state of the art in image classification software. In 2012, the winning entry used a convolutional neural network as described by Krizhevsky, Sutskever, and Hinton in their seminal paper "ImageNet classification with deep convolutional neural networks". Instead of having a feature extractor or classifier defined by a human, the entire network was trained from end to end on image data. It not only beat human-designed entries, those others weren't even close.

Neural networks as a concept have been around about as long as computing itself. They've gone through waves of popularity, each time coming back down to Earth after something limited their practical application. The first time, it was due to the fact that single-layer perceptrons provably couldn't solve certain problems. In the 1990s, interest grew around multi-layer neural networks. However, in a traditional neural network design, every neuron in a layer is connected to every neuron in a previous layer, and each of those connections has a separate weight to be learned. This leads to a combinatorial explosion of weights, which proved impractical to train and had the tendency to overfit training data. The interest in neural networks faded again.

A convolutional neural network is designed to address these shortcomings. Rather than connect every neuron between layers, it was found that you could use a fixed number of weights and simply move a window across a previous layer (perform a convolution) to generate output for the next layer. This reduced the number of weights required by orders of magnitude and let you build very deep networks (ones composed of many layers). This also turned out to be an operation that GPUs were ideally suited to run. For example, all of the edge detection operations I've written in GPUImage use convolutions and they can run in sub-millisecond times on the latest iOS GPUs.

For a great overview of the history of convolutional neural networks and their applications, I highly recommend the Nature review paper "Deep learning" by three of the pillars in the field: Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.

I'll admit that at first, I wasn't convinced of the value of these techniques. I'd witnessed the problems of the mid-90's wave of interest in neural networks and been frustrated by their limitations in projects I'd worked on then. It took trying to solve an object recognition problem using FAST, SIFT, and related feature extractors, having that succeed less than 5% of the time, then testing out a stock CNN and have that work more than 90% of the time for me to sit up and pay attention. When I found my hand-coded processes unable to beat the results of even simple networks on complex problems, I was sold on their value.

Convolutional neural networks and the general techniques of deep learning (the design and training of neural networks with many layers) have proven to be applicable to an incredibly diverse range of problems. At Perceptual Labs, our focus has been on image processing on mobile devices and embedded hardware. We've been taking the lessons I learned in optimizing traditional machine vision algorithms for mobile GPUs and applying them to the design and deployment of CNNs on these low-power devices.

We've been able to perform image classification and object recognition:

The above examples are on live video from an iPhone using convolutional neural networks that we've designed and trained ourselves, deployed via an in-house software framework that works across iOS, Mac, and soon Linux. Our first public application using this technology is launching soon, and we'll be talking more about that and our other work in this blog.

I'm tremendously excited about the current pace of innovation in machine vision. If your organization has a unique, ambitious application in mind that would be enabled by high performance machine vision on mobile or embedded devices, and is looking for a partner to work with on this, please contact us.