At Perceptual Labs, we apply leading-edge machine vision techniques to solve real-world problems on mobile devices and embedded computers. Machine vision is currently undergoing a revolution as a result of advancements in convolutional neural networks (CNNs), and they are a large part of the technologies in use at Perceptual Labs. Huge advances are being made every day in problems previously considered intractable, all through the use of CNNs or related technologies.
Why are they so useful? To answer that, I'd like to talk about how people have traditionally approached machine vision challenges.
Human-defined machine vision
Machine vision is the process of having a computer extract useful information from an image, video, or camera feed. That information can range from the simple, like what the average color of an image is, to the complex, like how many people were standing in front of a camera and where they are. We often take for granted how easy it is for us to look at a scene and understand what's present and where. However, try to write code to replicate even a small part of what people do every second and you'll find that this is one of the most challenging areas in computer science.
For decades, programmers have directly designed algorithms to operate on images and pull information from them. Many techniques have been developed to detect edges, corners, circles, lines, and other shapes. Points of interest and descriptors created from looking at the area around those points were used to recognize, match, and track various objects.
I've been personally fascinated by the acceleration of these machine vision algorithms using graphics processing units (GPUs). GPUs are built to handle massive amounts of parallel calculations on independent data, such as pixels in an image. Once relatively fixed in function, they've become more and more programmable to the point where modern GPUs can run fairly complex custom programs on large quantities of data.
In particular, I believed that GPUs had the potential to make machine vision practical on low power mobile and embedded computers. These camera-equipped handheld computers most people carry with them every day have so much untapped potential for helping to understand the world around us.
To that end, back in 2010 I experimented with GPU-accelerated video processing on the iPhone. In 2012, I built an open source framework around this called GPUImage to simplify GPU-accelerated image processing on iOS (and later Mac and Linux). I never could have predicted how widely this framework would be adopted, and among the diverse applications of it were many attempts at accelerating machine vision operations.
Convolutional neural networks
For tasks like object recognition and tracking, people had first attempted to design algorithms and operations for the entire process from taking in an image to delivering a result. I did this myself in pursuit of several problems, with mixed results.
Others attempted to use machine learning in various ways to solve parts of the problem. In the case of image classification, a popular approach was to use human-designed feature extractors (edge detection, feature detectors like SIFT, SURF, FAST, etc.) and machine learning for the final classification stages. It was generally assumed that a human had to be guiding at least some part of the design to create a competitive solution.
At about the time that I was starting to work on GPUImage in 2012, this all began to change. The annual ImageNet image classification challenge is a central place to gauge the state of the art in image classification software. In 2012, the winning entry used a convolutional neural network as described by Krizhevsky, Sutskever, and Hinton in their seminal paper "ImageNet classification with deep convolutional neural networks". Instead of having a feature extractor or classifier defined by a human, the entire network was trained from end to end on image data. It not only beat human-designed entries, those others weren't even close.
Neural networks as a concept have been around about as long as computing itself. They've gone through waves of popularity, each time coming back down to Earth after something limited their practical application. The first time, it was due to the fact that single-layer perceptrons provably couldn't solve certain problems. In the 1990s, interest grew around multi-layer neural networks. However, in a traditional neural network design, every neuron in a layer is connected to every neuron in a previous layer, and each of those connections has a separate weight to be learned. This leads to a combinatorial explosion of weights, which proved impractical to train and had the tendency to overfit training data. The interest in neural networks faded again.
A convolutional neural network is designed to address these shortcomings. Rather than connect every neuron between layers, it was found that you could use a fixed number of weights and simply move a window across a previous layer (perform a convolution) to generate output for the next layer. This reduced the number of weights required by orders of magnitude and let you build very deep networks (ones composed of many layers). This also turned out to be an operation that GPUs were ideally suited to run. For example, all of the edge detection operations I've written in GPUImage use convolutions and they can run in sub-millisecond times on the latest iOS GPUs.
For a great overview of the history of convolutional neural networks and their applications, I highly recommend the Nature review paper "Deep learning" by three of the pillars in the field: Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.
I'll admit that at first, I wasn't convinced of the value of these techniques. I'd witnessed the problems of the mid-90's wave of interest in neural networks and been frustrated by their limitations in projects I'd worked on then. It took trying to solve an object recognition problem using FAST, SIFT, and related feature extractors, having that succeed less than 5% of the time, then testing out a stock CNN and have that work more than 90% of the time for me to sit up and pay attention. When I found my hand-coded processes unable to beat the results of even simple networks on complex problems, I was sold on their value.
Convolutional neural networks and the general techniques of deep learning (the design and training of neural networks with many layers) have proven to be applicable to an incredibly diverse range of problems. At Perceptual Labs, our focus has been on image processing on mobile devices and embedded hardware. We've been taking the lessons I learned in optimizing traditional machine vision algorithms for mobile GPUs and applying them to the design and deployment of CNNs on these low-power devices.
We've been able to perform image classification and object recognition:
The above examples are on live video from an iPhone using convolutional neural networks that we've designed and trained ourselves, deployed via an in-house software framework that works across iOS, Mac, and soon Linux. Our first public application using this technology is launching soon, and we'll be talking more about that and our other work in this blog.
I'm tremendously excited about the current pace of innovation in machine vision. If your organization has a unique, ambitious application in mind that would be enabled by high performance machine vision on mobile or embedded devices, and is looking for a partner to work with on this, please contact us.