Object detection has undergone a dramatic and fundamental shift. I’m not talking about deep learning here – deep learning is really more about classification and specifically about feature learning. Feature learning has begun to play and will continue to play a critical role in machine vision. Arguably in a few years we’ll have a diversity of approaches for learning rich features hierarchies from large amounts of data; it’s a fascinating topic. However, as I said, this post is not about deep learning.
Rather, perhaps an even more fundamental shift has occurred in object detection: the recent crop of top detection algorithms abandons sliding windows in favor of segmentation in the detection pipeline. Yes, you heard right, segmentation!
First some evidence for my claim. Arguably the three top performing object detection systems as of the date of this post (12/2013) are the following:
- Segmentation as Selective Search++ (UvA-Euvision)
- Regionlets for Object Detection (NEC-MU)
- Rich Feature Hierarchies for Accurate Object Detection (R-CNN)
The first two are the winning and second place entries on the ImageNet13 detection challenge. The winning entry (UvA-Euvision) is an unpublished extension of Koen van de Sande earlier work (hence I added the “++”). The third paper is Ross Girshick et al.’s recent work, and while Ross did not compete in the ImageNet challenge due to lack of time, his results on PASCAL are superior to the NEC-MU numbers (as far as I know no direct comparison exists between Ross’s work and the winning ImageNet entry).
Now, here’s the kicker: all three detection algorithms shun sliding window in favor of a segmentation pre-processing step, specifically the region generation method of Koen van de Sande et al., Segmentation as Selective Search.
Now this is not your father’s approach to segmentation — there’s a number of notable differences that allows the current batch of methods to succeed whereas yesteryear’s methods failed. This is really a topic for another post, but the core idea is to generate around 1-2 thousand candidate object segments per image that with high probability coarsely capture most of the objects in the image. The candidate segments themselves may be noisy and overlapping and in general need not capture the objects perfectly. Instead they’re converted to bounding boxes and passed into various classification algorithms.
Incidentally, Segmentation as Selective Search is just one of many recent algorithms for generating candidate regions / bounding boxes (objecteness was perhaps the first), however, again this is a subject for another post…
So what advantage does region generation give over sliding windows approaches? Sliding window methods perform best for objects with fixed aspect ratio (e.g., faces, pedestrians). For more general object detection a search must be performed over position, scale and aspect ratio. The resulting 4 dimensional search space is large and difficult to search over exhaustively. One way to look at deformable part models is that they perform this search efficiently, however, this places severe restrictions on the models themselves. Thus we were stuck as a community: while we and our colleagues in machine learning derived increasingly sophisticated classification machinery for various problems, for object detection we were restricted to using approaches able to handle variable aspect ratio efficiently.
The new breed of segmentation algorithms allows us to bypass the need for efficiently searching over the space of all bounding boxes and let’s us employ more sophisticated learning machinery to perform classification. The unexpected thing is they actually work!
This is a surprising development and goes counter to everything we’ve known about object detection. For the past ten years, since Viola and Jones popularized the sliding window framework, dense window classification has dominated detection benchmarks (e.g. PASCAL or Caltech Peds). While there have been other paradigms, based for example on interest points, none could match the reliability and robustness of sliding windows. Now all this has changed!
Object detectors have evolved rapidly and their accuracy has increased dramatically over the last decade. So what have we learned? A few lessons come to mind: design better features (e.g. histograms of gradients), employ efficient classification machinery (e.g. cascades), and use flexible models to handle deformation and variable aspect ratios (e.g. part based models). And of course use a sliding window paradigm.
Recent developments have shattered our collective knowledge of how to perform object detection. Take a look at this diagram from Ross’s recent work:
sliding windows-> segmentation designed features-> deep learned features parts based models-> linear SVMs
Now ask yourself: a few years ago, would you have expected such a dramatic overturning of all our knowledge about object detection!?
Our field has jumped out of a local minima and exciting times lie ahead. I expect progress to be rapid in the new few years – it’s an amazing time to be doing research in object detection 🙂