Image Captioning

21 Jan

qualRecently, a number of us interested in image captioning gathered at Berkeley to exchange ideas (many thanks to Trevor Darrell for hosting us). Present were many of the authors of the various recent image captioning papers as well as a few additional folks who have worked in the area. For a nice summary of the recent work in the image captioning space please see John Platt’s post. For reference, here is a list of the recent image/video captioning papers in no particular order:

Note: I worked on the last of these papers in my past life 🙂

The various teams presented their work, and while I thought there would be a lot of repetition in the talks / ideas, if anything I was surprised by the diversity of insights. Oriol Vinyals had a nice high level summary of the various approaches: the two main axes on which methods differ is whether they are “pipeline” or “end-to-end” systems and whether they perform “generation” or “retrieval”. Oriol advocated end-to-end generation systems. Personally, I think pipeline systems (where you first learn visual detectors than a separate language module on top) have the major advantage that you can debug/develop/experiment with the modules independently, although I also prefer generation over retrieval. Regardless, a lot of subtleties in this space. I’ve heard a comment before that all these papers are the “same”; this is like saying all papers on deep learning are the same 😛

After presentations of the methods papers, the discussion focused on (1) datasets, (2) evaluation, and (3) next tasks. This was in many ways the most important part of the event imo.

(1) On the datasets front COCO ( has been adopted as the dataset of choice in many of the papers in the current batch of work on image captioning (full disclosure: I am part of the COCO team). I believe most of the papers in the list above evaluated on COCO. That is not to say that COCO is not without its shortcomings (more on this later). The Flickr captioning dataset (; EDIT: corrected link for Flickr 30K is: was also used in a number of papers.

(2a) As far as evaluation metrics, the captioning community is in a bit of disarray. While the BLEU metric was adopted by many of the groups, the results in the various papers are NOT comparable due to various subtleties/choices in the BLEU metric. Lame. The COCO team is working to remedy this by setting up an automatic evaluation server where authors upload captions and comparisons are automatically generated (very standard stuff). We should have this up and running in a few weeks, many of the teams seem interested in uploading and comparing results once this is up and running. Then we will know who is best!

(2b) The evaluation of captions needs improvement. The BLEU metric is deeply flawed in that automatic methods are now outperforming humans according to BLEU (!!). Devi Parikh gave a very nice talk about evaluating captioning. If you are interested in this space I strongly recommend her recent paper: The main takeaway is that while automatic evaluation of captions is noisy, given enough human reference captions for the same image (~50 or so), automatic metrics become well correlated with human judgement of caption quality. Well BLEU still sucks, but there are others that do better (e.g. METEOR and the newly introduced CIDEr are well correlated with human judgement given enough reference captions). Given Devi’s findings, the COCO team is adopting the new metric (CIDEr) along w the old (BLEU, METEOR) and labeling a subset (5%) of the test images in COCO with 40 captions each.

(3a) Next tasks. Captioning is a potentially appealing task because in theory it requires (1) a detailed understanding of an image and (2) ability to communicate that information via natural language. Unfortunately, the consensus is that automatic captioning (at least on many images) can be done with only partial understanding of the image and rudimentary language skills. My collaborator Larry Zitnick coined this the “giraffe-tree” problem: the caption “a giraffe next to a tree” is a valid caption for a high percentage of images containing giraffes. In fact, *originality* of the generated captions is a big issue: depending on the approach a sizable percentage of the automatically generated captions on test images are exact duplicates of captions in the training set.

(3b) A lot of discussion then centered on alternative or more challenging tasks compared to image captioning that would require deeper understanding of the image content and more sophisticated language models to communicate this information. Question-answer was perhaps the most popular suggestion, the issue with Q&A is that large scale Q&A datasets are difficult to define/gather (e.g. while the Q&A dataset presented at NIPS by Mario Fritz is a nice first step it is relatively small scale and the diversity of the questions is low). Folks from MPI presented a very impressive new video-captioning dataset (, although quite promising it is still somewhat small (50K videos/captions) given the diversity of videos in the world. Tamara Berg ( gave a great talk about some alternative strategies for gathering information about an image, see her recent papers with “refer” in the title (e.g. ReferIt). Defining the right challenges is a wide open problem, however, that is not to say that image captioning is solved by any stretch of the imagination.

Having focused on the negatives of the image captioning task for most of this post, I did want to conclude by stating that in general there was a LOT of excitement in the room. While numerous groups converged on a relatively similar solution in a short period of time, and there was unfortunate attention by the popular media, this should not undermine how cool this topic really is. When we started this problem we had no idea automatic image captioning would work as well as it does! Anyway, clearly the current batch of work is just a first step, but many of us in the room were excited about the possibilities going forward.

Evaluating Object Proposals

18 Nov

[Update 05/19/2015: Please see our arXiv paper for an expanded version of the experiments described in this post:]

Object proposals have quickly become a popular pre-processing step for object detection. Some time ago I surveyed the state-of-the-art in generating object proposals, and the field has seen a lot of activity since that time! In fact, Larry Zitnick and I proposed the fast and quite effective Edge Boxes algorithm at ECCV 2014 (source code) (sorry for the self promotion).

So, How Good are Detection Proposals, Really? In their BMVC 2014 paper of the same name, Jan Hosang, Rodrigo Benenson and Bernt Schiele address this question, performing an apples-to-apples comparison of nearly all methods published to date. Highly recommended reading. While multiple evaluation metrics for object proposals have been proposed, ultimately object proposals are a pre-processing step for detection. In the BMVC paper, Jan and co-authors evaluated object proposals coupled with Deformable Part Models (DPMs). Surprisingly, some recent proposal methods which seemed quite promising based on recall plots resulted in rather poor detector performance.

So, I set out to see if there is a link between object proposal recall at various Intersection over Union (IoU) thresholds and detector average precision (AP). I asked the following: is proposal recall at a given IoU threshold (or thresholds) correlated with detector AP? Or do we need an alternate method for measure effectiveness of object proposals?

For the analysis I used the AP Pascal 2007 test scores for the DPM variant LM-LLDA. The AP numbers are obtained from Table 1 of Jan et al.’s BMVC paper. Recall values of each proposal method for each IoU threshold (at 1000 proposals) can be obtained from Figure 4b in Jan’s paper. However, while I used boxes provided by Jan I ran my own evaluation code (available here) to obtain the exact recall values (and the curves/numbers differ slightly from Figure 4b). Source code for reproducing these results can be found at the end of this post.

Here are the recall curves for 12 of the 14 methods evaluated by Jan et al. I excluded “Uniform” and “Superpixels” from the analysis since there AP is extremely low (under 20%). This is similar to Figure 4b from Jan’s paper except as mentioned I used my own evaluation code which differs slightly.


In the first experiment I computed the Pearson correlation coefficient between recall at various IoU values and AP. Here is the result:corr2For IoU values between 0.7 and 0.8 the correlation is quite high, reaching an r value of .976 at IoU of 0.7. What this means is that knowing recall of a proposal method at IoU value between 0.7 and 0.8 very strongly predicts AP for the proposals. Conversely, recall below about 0.6 or above 0.9 is only weakly predictive of detector performance.

Here is a scatter plot of recall at IoU of 0.7 versus AP for the 12 methods (a best fit line is shown):

corr3As the high correlation coefficient suggests, there is a very strong linear relationship between recall at IoU=0.7 and AP.

I also analyzed correlation between recall across all IoU thresholds and AP by considering the area under the curve (AUC). However, correlation between AUC and AP is slightly weaker: r=0.9638. This makes sense as recall at low/high IoU values is not predictive of AP as discussed above. Instead, we can measure correlation between partial AUC (for a fixed threshold range) and AP. I found the optimal partial AUC for predicting AP is in the range of IoU thr of 0.625 to 0.850. The correlation in this case is r=0.984, meaning the partial AUC between these recall values predicts AP quite well. Here is the plot:

corr4Overall, this is a fairly strong result. First, it implies that we should be optimizing proposals for recall at IoU threshold in the range of 0.625 to 0.850 (at least for DPMs). Note that his is despite the fact that in all the experiments AP was measured at a IoU of 0.5 for detector evaluation. Second, it means that we should be able to predict with reasonable reliability detector performance given recall in the given IoU range. One thing is clear: it is probably best to stop optimizing proposals for IoU of 0.5.

The main limitation of the analysis is that it is only for a single detector (DPM variant) on a single dataset (PASCAL 2007). It would be great to repeat for another detector, especially a modern detector based on convolutional neural networks such as Ross Girshick’s RCNN. Second, in the long term we definitely want to move toward having more accurate proposals. Presumably, as proposal accuracy improves, so will our ability to exploit the more accurate proposals.

I would like to thank Jan Hosang and Rodrigo Benenson for providing the output of all object proposal methods in a standardized format, and for Jan, Rodrigo and Ross Girshick for excellent discussions on the subject.

Matlab code for reproducing all results is below. Copying/pasting into Matlab should work. It is simple to swap in AP/recall for other detectors/datasets.

%% Recall matrix used for experiments nThresholds x nAlgorithms
R=[0.9196    0.8537 0.9197 0.8779 0.9116 0.8835 0.7724 0.8614 0.8107 0.8965 0.7737 0.4643
   0.8830 0.8339 0.9100 0.8566 0.8955 0.8669 0.7556 0.8407 0.7848 0.8831 0.7490 0.4391
   0.8276 0.8104 0.9013 0.8358 0.8769 0.8433 0.7409 0.8177 0.7564 0.8649 0.7218 0.4129
   0.7472 0.7875 0.8909 0.8125 0.8560 0.8108 0.7247 0.7924 0.7290 0.8438 0.6951 0.3881
   0.6483 0.7630 0.8801 0.7866 0.8349 0.7632 0.7089 0.7641 0.7026 0.8202 0.6643 0.3571
   0.5487 0.7360 0.8659 0.7575 0.8086 0.6973 0.6906 0.7322 0.6725 0.7957 0.6307 0.3216
   0.4450 0.7104 0.8488 0.7241 0.7811 0.6048 0.6727 0.7016 0.6425 0.7693 0.5936 0.2823
   0.3529 0.6818 0.8266 0.6884 0.7548 0.4953 0.6509 0.6699 0.6120 0.7388 0.5534 0.2423
   0.2743 0.6507 0.7941 0.6547 0.7241 0.3877 0.6263 0.6333 0.5737 0.7084 0.5076 0.2064
   0.2099 0.6178 0.7488 0.6160 0.6915 0.2879 0.6031 0.5964 0.5379 0.6740 0.4574 0.1711
   0.1587 0.5809 0.6855 0.5792 0.6560 0.2150 0.5724 0.5544 0.4961 0.6369 0.3985 0.1400
   0.1144 0.5396 0.6026 0.5342 0.6159 0.1578 0.5325 0.5106 0.4601 0.5913 0.3354 0.1109
   0.0850 0.4996 0.4982 0.4918 0.5713 0.1138 0.4599 0.4614 0.4175 0.5436 0.2685 0.0865
   0.0625 0.4490 0.3935 0.4417 0.5246 0.0812 0.3580 0.4154 0.3714 0.4931 0.1996 0.0662
   0.0456 0.3979 0.2886 0.3940 0.4728 0.0530 0.2369 0.3616 0.3256 0.4390 0.1368 0.0484
   0.0312 0.3430 0.1961 0.3424 0.4195 0.0328 0.1321 0.3020 0.2793 0.3791 0.0862 0.0352
   0.0225 0.2832 0.1258 0.2818 0.3542 0.0181 0.0648 0.2427 0.2245 0.3101 0.0480 0.0234
   0.0144 0.2149 0.0731 0.2135 0.2738 0.0086 0.0242 0.1786 0.1650 0.2363 0.0212 0.0141
   0.0091 0.1433 0.0358 0.1369 0.1834 0.0027 0.0051 0.1146 0.1022 0.1568 0.0076 0.0085
   0.0053 0.0621 0.0123 0.0623 0.0821 0.0004 0.0007 0.0498 0.0367 0.0696 0.0017 0.0046
   0.0001 0.0003 0.0003 0.0002 0.0002 0      0      0.0001 0.0001 0.0002 0      0 ];
save 'R' 'R';
%% define algorithm names, IoU thresholds, recall values and AP scores
T=.5:.025:1; load R; %% see below for R matrix actually used in plots
AP=[21.8 29.9 31.8 31.2 32.4 25.0 29.6 30.5 30.7 31.7 27.3 20.7];
% plot correlation versus IoU and compute best threshold t
S=corrcoef([R' AP']); S=S(1:end-1,end); [s,t]=max(S);
figure(2); plot(T,S,'-or'); xlabel('IoU'); ylabel('corr'); grid on;
% plot AP versus recall at single best threshold t
figure(3); R1=R(t,:); plot(R1,AP,'dg'); grid on; text(R1+.015,AP,nms);
xlabel(sprintf('recall at IoU=%.3f',T(t))); axis([0 1 20 34])
title(sprintf('correlation=%.3f',s)); ylabel('AP'); hold on;
p=polyfit(R1,AP,1); line([0 1],[p(2),sum(p)],'Color',[1 1 1]/3); hold off
% compute correlation against optimal range of thrs(a:b)
n=length(T); S=zeros(n,n);
for a=1:n, for b=a:n, s=corrcoef(sum(R(a:b,:),1),AP); S(a,b)=s(2); end; end
[s,t]=max(S(:)); [a,b]=ind2sub([n n],t); R1=mean(R(a:b,:),1);
figure(4); plot(R1,AP,'dg'); grid on; text(R1+.015,AP,nms);
xlabel(sprintf('recall at IoU=%.3f-%.3f',T(a),T(b))); axis([0 1 20 34])
title(sprintf('correlation=%.3f',s)); ylabel('AP'); hold on;
p=polyfit(R1,AP,1); line([0 1],[p(2),sum(p)],'Color',[1 1 1]/3); hold off
% make figures nicer and save
for i=2:4, figure(i); hs=get(gca,{'title','xlabel','ylabel'});
  hs=[gca hs{:}]; set(hs,'FontSize',12); end
for i=2:4, figure(i); savefig(['corr' int2str(i)],'png'); end

Generating Object Proposals

22 Dec

proposal_teaserThis is a followup post to A Seismic Shift in Object Detection. In the earlier post I discussed the resurgence of segmentation for object detection, in this post I go into more technical detail about the algorithms for generating the segments and object proposals. If you haven’t yet, you should read my previous post first. 🙂

First, a brief historical overview. In classic segmentation the goal was to assign every pixel in an image to one of K labels such that distinct objects receive unique labels. So ideally an algorithm would generate separate segments for your cat and your couch and pass these on to the next stage of processing. Unfortunately, this is a notoriously difficult problem, and arguably, it’s simply the wrong problem formulation. Classic segmentation is exceedingly difficult: mistakes are irreversible and incorrect segments harm subsequent processing. If your kitty had its head chopped off or was permanently merged with the couch, well, tough luck. Thus classic segmentation rarely serves as a pre-processing step for detection and workarounds have been developed (e.g. sliding windows).


A major innovation in how we think about segmentation occurred around 2005. In their work on estimating geometric context, Derek Hoiem et al. proposed to use multiple overlapping segmentations, this was explored further by Bryan Russell and Tomasz Malisiewicz and their collaborators (see here and here). The idea is as simple as it sounds: generate multiple candidate segmentations, and while your kitty may be disfigured in many, hopefully at least one of the segmentations will contain her whole and unharmed. This was a leap in thinking because it shifts the focus to generating a diversity of segmentations as opposed to a single and perfect (but unachievable) segmentation.

The latest important leap, and the focus of this post, was first made in 2010 concurrently by three groups (Objectness, CPMC, Object Proposals). The key observation was this: since the unit of interest for subsequent processing in object detection and related tasks is a single object segment, why exhaustively (and uniquely) label every pixel in an image? Instead why not directly generate object proposals (either segments or bounding boxes) without attempting to provide complete image segmentations? Doing so is both easier (no need to label every pixel) and more forgiving (multiple chances to generate good proposals). Sometimes a slight problem reformulation makes all the difference!

Below I go over five of the arguably most important papers on generating object proposals (a list of additional paper can be found at the end of this post). Keep reading for details or skip to the discussion below.

Objectness: One of the earliest papers on generating object proposals. The authors sample and rank 100,000 windows per image according to their likelihood of containing an object. This `objectness’ score is based on multiple cues derived from saliency, edges, superpixels, color and location. The proposals tend to fit objects fairly loosely, but the first few hundred are of high quality (see Fig. 6 in this paper). The algorithm is fast (a few seconds per image) and outputs boxes.

CPMC: Published concurrently with objectness, the idea is to use graph cuts with different random seeds and parameters to obtain multiple binary foreground / background segmentations. Each generated foreground mask serves as an object proposal, the proposals are ranked according to a learned scoring function. The algorithm is slow (~8 min / image) as it relies on the gPb edge detector but it generates high quality segmentation masks (see also this companion paper).

Object Proposals: Similarly to CPMC (and published just a few months after), the authors generate multiple foreground / background segmentations and use these as object proposals. Same strengths and weaknesses as CPMC: high quality segmentation masks but long computation time due in part to reliance on gPb edges.

sssSelective Search: As discussed in my previous post, arguably the top three methods for object detection as of ICCV13 all used selective search in their detection pipelines. The key to the success of selective search is its fast speed (~8 seconds / image) and high recall (97% of objects detected given 10000 candidates per image). Selective search is based on computing multiple hierarchical segmentations using superpixels from Felzenszwalb and Huttenlocher (F&H) computed on different color spaces. Object proposals are the various segments in the hierarchies or bounding boxes surrounding them.

RPRandomized Prim’s (RP): a simple and fast approach to generating high quality proposal boxes (again based on F&H superpixels). The authors propose a randomized greedy algorithm for computing sets of superpixels that are likely to occur together. The quality of proposals is high and object coverage is good given a large number of proposals (1000 – 10000). RP is the fastest of the batch (<1 second per image) and is a promising pre-processing step for detection.

So what’s the best approach? It depends on the constraints and application. If speed is critical the only candidates are Objectness, Selective Search or RP.  For high quality segments CPMC or Object Proposals are best, the bounding boxes returned by RP also appear promising. For object detection, recall is critical and thus generating thousands of candidates with Selective Search or RP is likely the best bet. For domains where a more aggressive pruning of windows is necessary, for example, weakly supervised or unsupervised learning, Objectness or CPMC are the most promising candidates. Overall, which object proposal method is best suited depends on the target application and computational constraints.

One interesting observation is that all five algorithms described above utilize either gPb or F&H for input edges or superpixels. The quality and speed of the input edges detectors help determine the speed of the resulting proposals and their adherence to object boundaries. gPb is accurate but slow (multiple minutes per image) while F&H is fairly fast but of lower quality. So, this seems to be a perfect spot to drop a shameless plug for our own work: our recent edge detector presented at ICCV runs in real time (30 fps) and achieves edges of similar quality to gPb (and even somewhat higher). Read more here. 🙂


I expect that after its recent successes object proposal generation will continue to receive strong interest from the community. I hope to see the development of approaches that are faster, have higher recall with fewer proposals, and better adhere to object boundaries. Downstream it will be interesting to see more algorithms take advantage of the segmentation masks associated with the proposals (currently most but not all detection approaches discard the segmentation masks). And of course I have to wonder, will we experience yet another paradigm shift in segmentation? Let’s see where we end up…

Below is a list of additional papers on generating object proposals. If I missed anything relevant please email me or leave a comment and I’ll add a link!

A Seismic Shift in Object Detection

10 Dec

Object detection has undergone a dramatic and fundamental shift. I’m not talking about deep learning here – deep learning is really more about classification and specifically about feature learning. Feature learning has begun to play and will continue to play a critical role in machine vision. Arguably in a few years we’ll have a diversity of approaches for learning rich features hierarchies from large amounts of data; it’s a fascinating topic. However, as I said, this post is not about deep learning.

Rather, perhaps an even more fundamental shift has occurred in object detection: the recent crop of top detection algorithms abandons sliding windows in favor of segmentation in the detection pipeline. Yes, you heard right, segmentation!


First some evidence for my claim. Arguably the three top performing object detection systems as of the date of this post (12/2013) are the following:

  1. Segmentation as Selective Search++ (UvA-Euvision)
  2. Regionlets for Object Detection (NEC-MU)
  3. Rich Feature Hierarchies for Accurate Object Detection (R-CNN)

The first two are the winning and second place entries on the ImageNet13 detection challenge. The winning entry (UvA-Euvision) is an unpublished extension of Koen van de Sande earlier work (hence I added the “++”). The third paper is Ross Girshick et al.’s recent work, and while Ross did not compete in the ImageNet challenge due to lack of time, his results on PASCAL are superior to the NEC-MU numbers (as far as I know no direct comparison exists between Ross’s work and the winning ImageNet entry).

Now, here’s the kicker: all three detection algorithms shun sliding window in favor of a segmentation pre-processing step, specifically the region generation method of Koen van de Sande et al., Segmentation as Selective Search.


Now this is not your father’s approach to segmentation — there’s a number of notable differences that allows the current batch of methods to succeed whereas yesteryear’s methods failed. This is really a topic for another post, but the core idea is to generate around 1-2 thousand candidate object segments per image that with high probability coarsely capture most of the objects in the image. The candidate segments themselves may be noisy and overlapping and in general need not capture the objects perfectly. Instead they’re converted to bounding boxes and passed into various classification algorithms.

Incidentally, Segmentation as Selective Search is just one of many recent algorithms for generating candidate regions / bounding boxes (objecteness was perhaps the first), however, again this is a subject for another post…

So what advantage does region generation give over sliding windows approaches? Sliding window methods perform best for objects with fixed aspect ratio (e.g., faces, pedestrians). For more general object detection a search must be performed over position, scale and aspect ratio. The resulting 4 dimensional search space is large and difficult to search over exhaustively. One way to look at deformable part models is that they perform this search efficiently, however, this places severe restrictions on the models themselves. Thus we were stuck as a community: while we and our colleagues in machine learning derived increasingly sophisticated classification machinery for various problems, for object detection we were restricted to using approaches able to handle variable aspect ratio efficiently.

The new breed of segmentation algorithms allows us to bypass the need for efficiently searching over the space of all bounding boxes and let’s us employ more sophisticated learning machinery to perform classification. The unexpected thing is they actually work!

This is a surprising development and goes counter to everything we’ve known about object detection. For the past ten years, since Viola and Jones popularized the sliding window framework, dense window classification has dominated detection benchmarks (e.g. PASCAL or Caltech Peds). While there have been other paradigms, based for example on interest points, none could match the reliability and robustness of sliding windows. Now all this has changed!


Object detectors have evolved rapidly and their accuracy has increased dramatically over the last decade. So what have we learned? A few lessons come to mind: design better features (e.g. histograms of gradients), employ efficient classification machinery (e.g. cascades), and use flexible models to handle deformation and variable aspect ratios (e.g. part based models). And of course use a sliding window paradigm.

Recent developments have shattered our collective knowledge of how to perform object detection. Take a look at this diagram from Ross’s recent work:



  1. sliding windows -> segmentation
  2. designed features -> deep learned features
  3. parts based models -> linear SVMs

Now ask yourself: a few years ago, would you have expected such a dramatic overturning of all our knowledge about object detection!?

Our field has jumped out of a local minima and exciting times lie ahead. I expect progress to be rapid in the new few years – it’s an amazing time to be doing research in object detection 🙂

Structured Random Forests

8 Mar


Peter Kontschieder and co-authors have written a number of interesting papers that hack random forests to accomplish tasks beyond standard classification / regression. In 2011 they published two papers at ICCV and BMVC (pdf, pdf) on structured learning for semantic image labeling and in 2012 a paper at NIPS (pdf) on context sensitive random forests for object detection. What these papers all have in common is that they exploit the strengths of random forests and the format of the training data to learn over images in an interesting manner. In this blog post I focus on the first two papers but the third is similar in spirit.

In semantic segmentation the goal is to assign each pixel a label (e.g. grass, sky, cow). The foundational work of Jamie Shotten used random forests to predict the label of each pixel independently (using a patch surrounding the pixel) turning the problem into a multi-class classification problem. The predictions obtained this way are reasonable but noisy so a smoothing approach such as a CRF is employed as a second step.  Peter and co-authors propose a very natural yet powerful variant of this idea: what if instead of predicting the label of the center pixel of a patch, the tree predicts the labels of a subset of the pixels or even the entire patch?

The cool thing is it’s not all that hard to alter a random forest to achieve structured prediction. The figure at the top of this post demonstrates the basic idea. For classification, during training the label of each training example (e.g. of the center pixel) is used to determine the split. Instead one can *sample* the label on which to split on if each example contains multiple labels (e.g. for the whole patch). This works because individual splits in random forests are inherently noisy anyway – it’s the application of multiple consecutive splits that results in child nodes with relatively pure distributions of the labels. The idea can be generalized: so long as there’s a (noisy) way of deriving a multi-class label from a structured label the concept should apply.

For semantic segmentation, the structured forest predicts the patch labels for an entire neighborhood of pixels. Each tree in the forest at each pixel location predicts such a labeling. All these labels need to be combined into a single global solution for the image: the authors formulate this as a “label puzzle game” and offer an iterative approach guaranteed to converge to a local minimum. This step removes the need for CRFs and leads to better performance on standard datasets. Both the ICCV and BMVC papers cover this overall outline, with the ICCV focusing more on the structured forests and the BMVC more on solving for the global solution (note: the papers cover very similar material and should probably have been a single submission).

In each of the following rows the authors show (1) the original image, (2) baseline random forest output, (3) structured forest+puzzle solver output:


Predicting labels for the whole patch (and combining the results in a smart way) removes the “noisy pixel” problem. The approach is conceptually appealing and results are reasonable. Random forests are naturally well suited for structured learning in this manner: by design random forests are trained in a noisy manner, so introducing an alternative (noisy) splitting criteria works well in practice. Moreover random forests can “memorize” the structured labels observed at the leaves during training so at test time inference (per patch) is trivial.

This paper is a great example of why machine learning research greatly benefits when grounded in a real problem: the structure of the data defines a unique learning problem that could never be formulated on “black box” datasets where training and testing data are clearly defined and the real data abstracted away (e.g. feature vectors and labels).  Formulating novel learning paradigms for real datasets is among the most interesting directions in computer vision.

Discriminative Patches

20 Feb


Paper of the day: Unsupervised Discovery of Mid-level Discriminative Patches. Saurabh Singh, Abhinav Gupta and Alexei A. Efros. ECCV 2012. Website+code.

The authors describe their goal as learning a “fully unsupervised mid-level visual representation”. A good way to think about their approach is as an improvement over the standard “bag of words” model.  “Visual words” are typically created by clustering a large set of patches (sparsely detected using an interest point detector). While bag of words and extensions are reasonably effective, the “visual words” themselves tend to code fairly simple structures such as bars or corners. Instead, the goal here is to find “discriminative patches”, that while similar to “visual words”, satisfy two properties: they are frequent and discriminative.

The idea is to discover structures that occur frequently in a “discovery dataset” D and are distinctive from most patches in a massive “natural world dataset” N. This can be done via an iterative discriminative clustering scheme: cluster patches from D and for each cluster train a classifier (HOG+linear SVM) against all patches in N (using bootstrapping). Resulting classifiers that achieve high scores on patches from D and low scores on patches from N, and fire sufficiently densely on D, are consider good discriminative patches. I’m glossing over details (including a nice trick of splitting D and N into two to have validation sets), but the basic idea is quite intuitive and straightforward.

Visualization of “visual words” versus “discriminative patches”:


The choice of discovery dataset D seems crucial since D must contain interesting patches with relatively high frequency. While the title includes the words “Unsupervised Discovery”, the more interesting discriminative patches are obtained when some weak supervision is used: specifically if the choice of D emphasizes particular object categories or scene structures. For example, using various categories from the MIT indoor-67 scene dataset for D gives the following discriminative patches:


This is pretty cool, these patches are definitely informative!

The high level goal of discovering representative and meaningful image structure is quite compelling. While the unsupervised aspect is interesting, the weakly supervised results are actually more promising. It would be great to see more application of discriminative patches (the authors show a few in their paper, but it seems like they’re barely scratching the surface here) and alternate ways of exploiting weak supervision. In that vein the authors published a paper at SIGGRAPH 2012 that is quite fun: What Makes Paris Look like Paris? (similar to supervised discriminative patches but supervision comes from geographic region in which the images were captured).

Discriminative patches certainly aren’t the last word in image representation (no pun intended) but they’re an interesting new direction!

Object Detection Grammars

15 Dec


Paper of the day: Object Detection with Grammar Models. Ross Girshick, Pedro Felzenszwalb, David McAllester. NIPS 2011.

The paper describes an approach using formal grammars applied to object recognition — and while the idea of using grammars for computer vision has a long history — an approach that achieves compelling results on PASCAL (specifically the person category) is quite new. The work is essentially an extension of Deformable Part Models (DPM) that tries to formalize the notion of objects, parts, mixture models, etc., by defining a formal grammar for objects. The grammar allows for such things as optional parts, mixtures at the part levels, explicit occlusion reasoning and recursive parts (parts of parts). The grammar model is defined by hand, while learning of all parameters (HOG template appearances, deformation parameters, scores of productions) is performed using latent SVMs (an extension of what was done for DPMs).

The NIPS paper is quite dense. To understand the model better I found it useful to first briefly skim the Wikipedia entry on Context Free Grammars first and to read the earlier tech report that goes into far more detail on defining the model itself. Online talks by Ross (link), Pedro (link) and David (link) are also helpful.

The particular grammar used achieves 49.5 AP on the person category in PASCAL with contextual reasoning. For comparison, DPMs in their latest incarnation achieved 44.4 (in their raw form) and poselets (the state-of-the-art algorithm on this data) had 48.5 AP. My impression is that the largest gains come from the fact that the model performs explicit occlusion reasoning — which is very cool — and hence does not require a mixture model at the object level or bounding box prediction. Some example results:


The general idea behind this work is quite interesting. My impression though is that making object grammars work in practice for a wide range of categories has major limitations. I see three issues: (1) the model must be defined by hand for each category, (2) initialization is likely crucial and obtaining a good initialization is not trivial (for the person category there was a trick the authors were able to use for initialization described in S4.3), and (3) performance seems to be quite sensitive to the model used. In David’s talk, David discusses how they tried typing in numerous grammars for person and it was finally the occlusion style model that really helped. I suspect that if the grammars worked for a wider range of models we would have seen results on all of PASCAL.

Regardless, the paper provides great food for thought and suggests future directions where DPMs can go.