Realtime object detection on iOS

As an example of the work we've been doing at Perceptual Labs, the following video shows realtime object detection and localization, captured live from an iPhone:



We designed a custom convolutional neural network that can recognize classes of objects and identify their location in an image. This network processes video frames at over 30 frames per second and identifies multiple objects and their locations in each frame. It's fast enough to recognize and track arbitrary objects on live video from an iPhone camera. You'll notice that it isn't limited to tracking a specific object, but it can detect broad classes of objects like people or cars.

The above examples did not use Core ML, and instead used a higher-performance custom inference framework that we've developed in-house. In fact, we've developed custom tools and code to assist with or enable every step in the process from dataset aggregation, network design, training, and finally to deployment on device.

I'd like to talk about some of the work we did in order to create this example.

Object detection and localization

As I talked about previously, I'm extremely excited about the advancements being made in machine vision right now. In particular, technologies like Apple's Metal Performance Shaders and Core ML are making it relatively easy to deploy trained convolutional neural networks on mobile devices.

Almost all of the examples you will see of people applying convolutional neural networks to images will be performing image classification. Image classification is a process where a network takes in an image and attempts to determine the single best class out of a list to describe it, usually with an accompanying confidence score.

Image classification, while still a complex problem to solve, is not quite as difficult as object detection and localization. For object detection, instead of a single class that describes an entire image, you are attempting to find any instances of one or more object types within an image and accurately describe their positions. There might be no objects of interest in a scene, or there might be hundreds of objects of varying sizes and shapes representing dozens of classes.

When designing a process using convolutional neural networks to perform object detection, there are several steps:

  • Convolutional network design
  • Dataset aggregation and formatting
  • Network training
  • Deployment (inference)

Convolutional network design

There are a few major ways that people have approached object detection using convolutional neural networks. Image classification can be performed within a sliding window over an image (such as with OverFeat). Alternatively, a second neural network can be trained to propose regions on which classification is performed (as performed by R-CNN, Fast R-CNN, and Faster R-CNN). 

Finally, recent research has demonstrated that a single convolutional neural network can be trained end-to-end to take in images and simultaneously localize objects and determine their class (examples of this are the SSD, YOLO and YOLOv2, SqueezeDet, and DetectNet architectures). These single-shot object detection networks are the only ones at present that are fast enough to run on a mobile device. They are also significantly easier to train than other approaches, because the entire network can be trained at one time.

For reasons that I'll describe later, I've been very impressed with the capabilities of Nvidia's DIGITS software for training networks using the Caffe framework. Nvidia created an object detection network called DetectNet and customized their fork of Caffe to allow for it to be trained on labeled images. It works by splitting images into a grid and calculating the likelihood that a bounding box for an object of a specific class is centered on a grid rectangle. In parallel, it calculates proposed bounding box offsets for objects in each grid rectangle. From there, the two can be combined to determine where objects of various types are located within an image.

Unfortunately, the stock DetectNet design (which uses a GoogLeNet-style network at its core) is too computationally expensive to deploy directly to mobile devices, particularly for use with live video. We needed a different network design.

There hasn't been as much research into object recognition networks as image classification ones, but thankfully many of the lessons learned from the latter apply to the former. At their heart, most object recognition networks have similar internal designs and perform similar functions as image classification networks.

Taking lessons we learned from designing custom image classification networks for mobile devices, we designed a more compact network that had similar inputs and outputs to DetectNet so that we could train it on custom datasets using DIGITS and Caffe. It uses a small fraction of the total calculations required for DetectNet, but still can identify and place broad classes of objects in images.

Matthijs Hollemans provides a great walkthrough of how he got a variant of the YOLO object detection network (Tiny YOLO) working on iOS, as well as how the YOLO network operates. YOLO uses a slightly different output architecture than the DetectNet based we worked from, but is also a single-shot object detector. Even the stripped-down Tiny YOLO network design only runs at 8-10 FPS on an iPhone 7, where with a little work we were able to design a custom architecture of similar accuracy that runs at more than 30 FPS on the same hardware. The YOLO architecture also uses the custom Darknet training framework and required multiple steps to convert to a format usable on iOS, where we are able to drop our Caffe-trained models directly into our iOS, Mac, and (soon) Linux applications.

Dataset aggregation and formatting

Image classification datasets are relatively simple to structure, usually consisting of a directory for each class in which you place all example images for that class. Object detection training sets need a lot more information, because you somehow have to specify what objects are within each training image and the coordinates of a bounding box or other shape determining their position. Each image might have no objects within it or an arbitarily large number.

There isn't a generally accepted standard for this, so different training tools expect labels and images in different formats, and publicly available datasets have their own formats. In our case, we wanted to use the PASCAL Visual Object Classes dataset as a starting point to train our networks, but needed to write a tool to translate between their XML-based annotations and the text labels needed for Nvidia's DIGITS.

For your own custom datasets (such as the ones we've been developing internally for upcoming applications), you'll need a way to label classes and bounding boxes for your objects. There aren't a lot of great solutions out there for this, so this is also an area where we're investing in some bespoke in-house tools. I will say that of the tools that are out there, Ryo Kawamura's free RectLabel (on the Mac App Store) does a lot of what you'll need, and has a well-thought-out interface.

Network training

I mentioned it before, but my current favorite convolutional network training framework is Caffe driven by Nvidia's DIGITS. I prefer Caffe's domain-specific language for network design that abstracts away a lot of boilerplate code, and their binary format for trained networks nicely encapsulates network design and all of your trained weights. Nvidia's DIGITS is an excellent graphical tool for managing datasets, setting up and enqueuing training runs, and analyzing training and network performance.

If you want an introduction to how DIGITS works for training, Nvidia has plenty of training material online and Reza Shirazian provided a recent tutorial on how to use DIGITS with AWS to train a Caffe model for use with Core ML.

If you're at all serious about training your own convolutional networks, I would highly recommend investing in your own hardware to do so. You'll pay back your investment within weeks by avoiding AWS hourly rates, and be able to develop and test ideas faster. For our use, I built a system with two liquid-cooled Nvidia GTX 1080s and a liquid-cooled CPU:


The two GPUs gave us the ability to test out multiple network designs at the same time, or simultaneously train on different datasets. Continuously running at near 100% load on the CPU and GPU was causing thermal issues and instability in a previous computer, thus the liquid cooling. That's certainly not necessary with proper air cooling, but I wanted to make sure this system would be rock-solid under load (at one point, we'd run both GPUs and CPU at near 100% load for three months straight).

The conditions under which you perform training can have as much of an impact as your network design. Pick too high of a learning rate, and the network flies off the rails. Too low, and it never converges. Do you use batch normalization or not? What do you do for data augmentation? There are a huge number of variables here, and sometimes experimentation is the only way to find out what works. This is another reason why I like to have a multi-GPU training system of my own to quickly test modifications.

Deployment to device

The process of running a network against input data to provide results is called inference. Once you have a trained neural network, you need to be able to perform inference with it inside your application. Apple has provided an easy-to-use solution for Mac and iOS with Core ML, but we used our own framework for this. We did this partly because we had already developed our own framework months before Core ML was announced, but largely because our framework has some advantages over Core ML.

Our framework was written in Swift as a high-level abstraction that can parse common convolutional network file formats directly and build the structures needed to perform inference at runtime. No translation or preprocessing is needed, and network designs can be sent directly from a training computer to an iOS application without recompilation. Having to write any code that's custom to a specific network design makes it that much harder to experiment with new architectures, something I do on a regular basis.

Even with Core ML out there now as a way to do this, we're still working on our own custom framework for a few reasons. One is performance, where we've seen up to 30% faster inference times on our custom framework vs. Core ML. Another is flexibility, because we can add or implement different network layer types, inputs, or outputs as we need them instead of being locked into what Core ML supports. 

Finally, there's device flexibility. For iOS, Core ML only runs on iOS 11, and the Metal Performance Shaders it uses to perform GPU-accelerated convolutional network calculations only work on A8 devices or newer. This means that either your networks won't work on A7-class devices (iPad Air, iPhone 5S, etc.) or will be incredibly slow as they fall back to CPU-side calculations. This presents a problem when submitting an application to the App Store, because there currently is no way to require A8-class devices as a minimum specification. Also, I've so far talked only about Mac and iOS devices, but what about embedded Linux devices and other platforms?

To target a much broader range of devices, we've built an OpenGL-based inference engine that leverages the structures I created for GPUImage. This lets us run GPU-accelerated convolutional neural networks on many other Mac and iOS devices than those supported by Metal Performance Shaders, and we'll soon be extending that to embedded Linux devices and beyond.

Bringing it all together

For the video above, we started with the PASCAL VOC 2012 dataset and our custom DetectNet-compatible object detection network. We converted the dataset into the image and label format expected by Nvidia's DIGITS and used DIGITS and Caffe to train our network. We then built a sample application using our convolutional network framework and dropped the trained Caffe network file into it. The application just had to feed camera frames into the framework and get back for each frame a list of objects, their class, and the normalized bounding boxes for them within the video frame.

All the application had to do at that point was to pipe the video frames to the screen and draw the labeled bounding boxes over that feed. The result is what you see in the video. Unfortunately, QuickTime's screen recording couldn't keep up with the device's video display rate, so the above video isn't as smooth as it appeared on device.

As you can tell, I'm very excited about the potential applications of realtime arbitrary object detection on live video in portable and embedded devices. This enables a range of capabilities that didn't exist before, and we're only starting to explore the use cases. We've been working on one such application that we think could have a large impact, and we should have more to say about that soon.