Archive | paper RSS feed for this section

Evaluating Object Proposals

18 Nov

[Update 05/19/2015: Please see our arXiv paper for an expanded version of the experiments described in this post:]

Object proposals have quickly become a popular pre-processing step for object detection. Some time ago I surveyed the state-of-the-art in generating object proposals, and the field has seen a lot of activity since that time! In fact, Larry Zitnick and I proposed the fast and quite effective Edge Boxes algorithm at ECCV 2014 (source code) (sorry for the self promotion).

So, How Good are Detection Proposals, Really? In their BMVC 2014 paper of the same name, Jan Hosang, Rodrigo Benenson and Bernt Schiele address this question, performing an apples-to-apples comparison of nearly all methods published to date. Highly recommended reading. While multiple evaluation metrics for object proposals have been proposed, ultimately object proposals are a pre-processing step for detection. In the BMVC paper, Jan and co-authors evaluated object proposals coupled with Deformable Part Models (DPMs). Surprisingly, some recent proposal methods which seemed quite promising based on recall plots resulted in rather poor detector performance.

So, I set out to see if there is a link between object proposal recall at various Intersection over Union (IoU) thresholds and detector average precision (AP). I asked the following: is proposal recall at a given IoU threshold (or thresholds) correlated with detector AP? Or do we need an alternate method for measure effectiveness of object proposals?

For the analysis I used the AP Pascal 2007 test scores for the DPM variant LM-LLDA. The AP numbers are obtained from Table 1 of Jan et al.’s BMVC paper. Recall values of each proposal method for each IoU threshold (at 1000 proposals) can be obtained from Figure 4b in Jan’s paper. However, while I used boxes provided by Jan I ran my own evaluation code (available here) to obtain the exact recall values (and the curves/numbers differ slightly from Figure 4b). Source code for reproducing these results can be found at the end of this post.

Here are the recall curves for 12 of the 14 methods evaluated by Jan et al. I excluded “Uniform” and “Superpixels” from the analysis since there AP is extremely low (under 20%). This is similar to Figure 4b from Jan’s paper except as mentioned I used my own evaluation code which differs slightly.


In the first experiment I computed the Pearson correlation coefficient between recall at various IoU values and AP. Here is the result:corr2For IoU values between 0.7 and 0.8 the correlation is quite high, reaching an r value of .976 at IoU of 0.7. What this means is that knowing recall of a proposal method at IoU value between 0.7 and 0.8 very strongly predicts AP for the proposals. Conversely, recall below about 0.6 or above 0.9 is only weakly predictive of detector performance.

Here is a scatter plot of recall at IoU of 0.7 versus AP for the 12 methods (a best fit line is shown):

corr3As the high correlation coefficient suggests, there is a very strong linear relationship between recall at IoU=0.7 and AP.

I also analyzed correlation between recall across all IoU thresholds and AP by considering the area under the curve (AUC). However, correlation between AUC and AP is slightly weaker: r=0.9638. This makes sense as recall at low/high IoU values is not predictive of AP as discussed above. Instead, we can measure correlation between partial AUC (for a fixed threshold range) and AP. I found the optimal partial AUC for predicting AP is in the range of IoU thr of 0.625 to 0.850. The correlation in this case is r=0.984, meaning the partial AUC between these recall values predicts AP quite well. Here is the plot:

corr4Overall, this is a fairly strong result. First, it implies that we should be optimizing proposals for recall at IoU threshold in the range of 0.625 to 0.850 (at least for DPMs). Note that his is despite the fact that in all the experiments AP was measured at a IoU of 0.5 for detector evaluation. Second, it means that we should be able to predict with reasonable reliability detector performance given recall in the given IoU range. One thing is clear: it is probably best to stop optimizing proposals for IoU of 0.5.

The main limitation of the analysis is that it is only for a single detector (DPM variant) on a single dataset (PASCAL 2007). It would be great to repeat for another detector, especially a modern detector based on convolutional neural networks such as Ross Girshick’s RCNN. Second, in the long term we definitely want to move toward having more accurate proposals. Presumably, as proposal accuracy improves, so will our ability to exploit the more accurate proposals.

I would like to thank Jan Hosang and Rodrigo Benenson for providing the output of all object proposal methods in a standardized format, and for Jan, Rodrigo and Ross Girshick for excellent discussions on the subject.

Matlab code for reproducing all results is below. Copying/pasting into Matlab should work. It is simple to swap in AP/recall for other detectors/datasets.

%% Recall matrix used for experiments nThresholds x nAlgorithms
R=[0.9196    0.8537 0.9197 0.8779 0.9116 0.8835 0.7724 0.8614 0.8107 0.8965 0.7737 0.4643
   0.8830 0.8339 0.9100 0.8566 0.8955 0.8669 0.7556 0.8407 0.7848 0.8831 0.7490 0.4391
   0.8276 0.8104 0.9013 0.8358 0.8769 0.8433 0.7409 0.8177 0.7564 0.8649 0.7218 0.4129
   0.7472 0.7875 0.8909 0.8125 0.8560 0.8108 0.7247 0.7924 0.7290 0.8438 0.6951 0.3881
   0.6483 0.7630 0.8801 0.7866 0.8349 0.7632 0.7089 0.7641 0.7026 0.8202 0.6643 0.3571
   0.5487 0.7360 0.8659 0.7575 0.8086 0.6973 0.6906 0.7322 0.6725 0.7957 0.6307 0.3216
   0.4450 0.7104 0.8488 0.7241 0.7811 0.6048 0.6727 0.7016 0.6425 0.7693 0.5936 0.2823
   0.3529 0.6818 0.8266 0.6884 0.7548 0.4953 0.6509 0.6699 0.6120 0.7388 0.5534 0.2423
   0.2743 0.6507 0.7941 0.6547 0.7241 0.3877 0.6263 0.6333 0.5737 0.7084 0.5076 0.2064
   0.2099 0.6178 0.7488 0.6160 0.6915 0.2879 0.6031 0.5964 0.5379 0.6740 0.4574 0.1711
   0.1587 0.5809 0.6855 0.5792 0.6560 0.2150 0.5724 0.5544 0.4961 0.6369 0.3985 0.1400
   0.1144 0.5396 0.6026 0.5342 0.6159 0.1578 0.5325 0.5106 0.4601 0.5913 0.3354 0.1109
   0.0850 0.4996 0.4982 0.4918 0.5713 0.1138 0.4599 0.4614 0.4175 0.5436 0.2685 0.0865
   0.0625 0.4490 0.3935 0.4417 0.5246 0.0812 0.3580 0.4154 0.3714 0.4931 0.1996 0.0662
   0.0456 0.3979 0.2886 0.3940 0.4728 0.0530 0.2369 0.3616 0.3256 0.4390 0.1368 0.0484
   0.0312 0.3430 0.1961 0.3424 0.4195 0.0328 0.1321 0.3020 0.2793 0.3791 0.0862 0.0352
   0.0225 0.2832 0.1258 0.2818 0.3542 0.0181 0.0648 0.2427 0.2245 0.3101 0.0480 0.0234
   0.0144 0.2149 0.0731 0.2135 0.2738 0.0086 0.0242 0.1786 0.1650 0.2363 0.0212 0.0141
   0.0091 0.1433 0.0358 0.1369 0.1834 0.0027 0.0051 0.1146 0.1022 0.1568 0.0076 0.0085
   0.0053 0.0621 0.0123 0.0623 0.0821 0.0004 0.0007 0.0498 0.0367 0.0696 0.0017 0.0046
   0.0001 0.0003 0.0003 0.0002 0.0002 0      0      0.0001 0.0001 0.0002 0      0 ];
save 'R' 'R';
%% define algorithm names, IoU thresholds, recall values and AP scores
T=.5:.025:1; load R; %% see below for R matrix actually used in plots
AP=[21.8 29.9 31.8 31.2 32.4 25.0 29.6 30.5 30.7 31.7 27.3 20.7];
% plot correlation versus IoU and compute best threshold t
S=corrcoef([R' AP']); S=S(1:end-1,end); [s,t]=max(S);
figure(2); plot(T,S,'-or'); xlabel('IoU'); ylabel('corr'); grid on;
% plot AP versus recall at single best threshold t
figure(3); R1=R(t,:); plot(R1,AP,'dg'); grid on; text(R1+.015,AP,nms);
xlabel(sprintf('recall at IoU=%.3f',T(t))); axis([0 1 20 34])
title(sprintf('correlation=%.3f',s)); ylabel('AP'); hold on;
p=polyfit(R1,AP,1); line([0 1],[p(2),sum(p)],'Color',[1 1 1]/3); hold off
% compute correlation against optimal range of thrs(a:b)
n=length(T); S=zeros(n,n);
for a=1:n, for b=a:n, s=corrcoef(sum(R(a:b,:),1),AP); S(a,b)=s(2); end; end
[s,t]=max(S(:)); [a,b]=ind2sub([n n],t); R1=mean(R(a:b,:),1);
figure(4); plot(R1,AP,'dg'); grid on; text(R1+.015,AP,nms);
xlabel(sprintf('recall at IoU=%.3f-%.3f',T(a),T(b))); axis([0 1 20 34])
title(sprintf('correlation=%.3f',s)); ylabel('AP'); hold on;
p=polyfit(R1,AP,1); line([0 1],[p(2),sum(p)],'Color',[1 1 1]/3); hold off
% make figures nicer and save
for i=2:4, figure(i); hs=get(gca,{'title','xlabel','ylabel'});
  hs=[gca hs{:}]; set(hs,'FontSize',12); end
for i=2:4, figure(i); savefig(['corr' int2str(i)],'png'); end


Structured Random Forests

8 Mar


Peter Kontschieder and co-authors have written a number of interesting papers that hack random forests to accomplish tasks beyond standard classification / regression. In 2011 they published two papers at ICCV and BMVC (pdf, pdf) on structured learning for semantic image labeling and in 2012 a paper at NIPS (pdf) on context sensitive random forests for object detection. What these papers all have in common is that they exploit the strengths of random forests and the format of the training data to learn over images in an interesting manner. In this blog post I focus on the first two papers but the third is similar in spirit.

In semantic segmentation the goal is to assign each pixel a label (e.g. grass, sky, cow). The foundational work of Jamie Shotten used random forests to predict the label of each pixel independently (using a patch surrounding the pixel) turning the problem into a multi-class classification problem. The predictions obtained this way are reasonable but noisy so a smoothing approach such as a CRF is employed as a second step.  Peter and co-authors propose a very natural yet powerful variant of this idea: what if instead of predicting the label of the center pixel of a patch, the tree predicts the labels of a subset of the pixels or even the entire patch?

The cool thing is it’s not all that hard to alter a random forest to achieve structured prediction. The figure at the top of this post demonstrates the basic idea. For classification, during training the label of each training example (e.g. of the center pixel) is used to determine the split. Instead one can *sample* the label on which to split on if each example contains multiple labels (e.g. for the whole patch). This works because individual splits in random forests are inherently noisy anyway – it’s the application of multiple consecutive splits that results in child nodes with relatively pure distributions of the labels. The idea can be generalized: so long as there’s a (noisy) way of deriving a multi-class label from a structured label the concept should apply.

For semantic segmentation, the structured forest predicts the patch labels for an entire neighborhood of pixels. Each tree in the forest at each pixel location predicts such a labeling. All these labels need to be combined into a single global solution for the image: the authors formulate this as a “label puzzle game” and offer an iterative approach guaranteed to converge to a local minimum. This step removes the need for CRFs and leads to better performance on standard datasets. Both the ICCV and BMVC papers cover this overall outline, with the ICCV focusing more on the structured forests and the BMVC more on solving for the global solution (note: the papers cover very similar material and should probably have been a single submission).

In each of the following rows the authors show (1) the original image, (2) baseline random forest output, (3) structured forest+puzzle solver output:


Predicting labels for the whole patch (and combining the results in a smart way) removes the “noisy pixel” problem. The approach is conceptually appealing and results are reasonable. Random forests are naturally well suited for structured learning in this manner: by design random forests are trained in a noisy manner, so introducing an alternative (noisy) splitting criteria works well in practice. Moreover random forests can “memorize” the structured labels observed at the leaves during training so at test time inference (per patch) is trivial.

This paper is a great example of why machine learning research greatly benefits when grounded in a real problem: the structure of the data defines a unique learning problem that could never be formulated on “black box” datasets where training and testing data are clearly defined and the real data abstracted away (e.g. feature vectors and labels).  Formulating novel learning paradigms for real datasets is among the most interesting directions in computer vision.

Discriminative Patches

20 Feb


Paper of the day: Unsupervised Discovery of Mid-level Discriminative Patches. Saurabh Singh, Abhinav Gupta and Alexei A. Efros. ECCV 2012. Website+code.

The authors describe their goal as learning a “fully unsupervised mid-level visual representation”. A good way to think about their approach is as an improvement over the standard “bag of words” model.  “Visual words” are typically created by clustering a large set of patches (sparsely detected using an interest point detector). While bag of words and extensions are reasonably effective, the “visual words” themselves tend to code fairly simple structures such as bars or corners. Instead, the goal here is to find “discriminative patches”, that while similar to “visual words”, satisfy two properties: they are frequent and discriminative.

The idea is to discover structures that occur frequently in a “discovery dataset” D and are distinctive from most patches in a massive “natural world dataset” N. This can be done via an iterative discriminative clustering scheme: cluster patches from D and for each cluster train a classifier (HOG+linear SVM) against all patches in N (using bootstrapping). Resulting classifiers that achieve high scores on patches from D and low scores on patches from N, and fire sufficiently densely on D, are consider good discriminative patches. I’m glossing over details (including a nice trick of splitting D and N into two to have validation sets), but the basic idea is quite intuitive and straightforward.

Visualization of “visual words” versus “discriminative patches”:


The choice of discovery dataset D seems crucial since D must contain interesting patches with relatively high frequency. While the title includes the words “Unsupervised Discovery”, the more interesting discriminative patches are obtained when some weak supervision is used: specifically if the choice of D emphasizes particular object categories or scene structures. For example, using various categories from the MIT indoor-67 scene dataset for D gives the following discriminative patches:


This is pretty cool, these patches are definitely informative!

The high level goal of discovering representative and meaningful image structure is quite compelling. While the unsupervised aspect is interesting, the weakly supervised results are actually more promising. It would be great to see more application of discriminative patches (the authors show a few in their paper, but it seems like they’re barely scratching the surface here) and alternate ways of exploiting weak supervision. In that vein the authors published a paper at SIGGRAPH 2012 that is quite fun: What Makes Paris Look like Paris? (similar to supervised discriminative patches but supervision comes from geographic region in which the images were captured).

Discriminative patches certainly aren’t the last word in image representation (no pun intended) but they’re an interesting new direction!

Object Detection Grammars

15 Dec


Paper of the day: Object Detection with Grammar Models. Ross Girshick, Pedro Felzenszwalb, David McAllester. NIPS 2011.

The paper describes an approach using formal grammars applied to object recognition — and while the idea of using grammars for computer vision has a long history — an approach that achieves compelling results on PASCAL (specifically the person category) is quite new. The work is essentially an extension of Deformable Part Models (DPM) that tries to formalize the notion of objects, parts, mixture models, etc., by defining a formal grammar for objects. The grammar allows for such things as optional parts, mixtures at the part levels, explicit occlusion reasoning and recursive parts (parts of parts). The grammar model is defined by hand, while learning of all parameters (HOG template appearances, deformation parameters, scores of productions) is performed using latent SVMs (an extension of what was done for DPMs).

The NIPS paper is quite dense. To understand the model better I found it useful to first briefly skim the Wikipedia entry on Context Free Grammars first and to read the earlier tech report that goes into far more detail on defining the model itself. Online talks by Ross (link), Pedro (link) and David (link) are also helpful.

The particular grammar used achieves 49.5 AP on the person category in PASCAL with contextual reasoning. For comparison, DPMs in their latest incarnation achieved 44.4 (in their raw form) and poselets (the state-of-the-art algorithm on this data) had 48.5 AP. My impression is that the largest gains come from the fact that the model performs explicit occlusion reasoning — which is very cool — and hence does not require a mixture model at the object level or bounding box prediction. Some example results:


The general idea behind this work is quite interesting. My impression though is that making object grammars work in practice for a wide range of categories has major limitations. I see three issues: (1) the model must be defined by hand for each category, (2) initialization is likely crucial and obtaining a good initialization is not trivial (for the person category there was a trick the authors were able to use for initialization described in S4.3), and (3) performance seems to be quite sensitive to the model used. In David’s talk, David discusses how they tried typing in numerous grammars for person and it was finally the occlusion style model that really helped. I suspect that if the grammars worked for a wider range of models we would have seen results on all of PASCAL.

Regardless, the paper provides great food for thought and suggests future directions where DPMs can go.

Superpixels Part II

12 Dec

[This is a followup post to my original post on Superpixels.]

Today’s “paper of the day”: M. Van den Bergh, X. Boix, G. Roig, B. de Capitani and Luc Van Gool, SEEDS: Superpixels Extracted via Energy-Driven Sampling, ECCV 12. Website (with code).

The paper describes a very fast algorithm — “SEEDS” — for generating superpixels. SEEDS starts by partitioning an image into a grid of square superpixels, and then refines each superpixel by shifting pixels (or blocks of pixels) at boundaries from one superpixel to another. What’s cool about this approach is that it can be made very fast — each update is local and only requires very simple comparisons to reach a decision about transferring pixel ownership. Moreover, the algorithm can be stopped at any point in time (this is how the authors are able to achieve 30Hz processing time — running the algorithm longer results in better segmentations). While the presentation of the paper has some issues (convoluted descriptions, overly complex notation, tiny unreadable figures), the algorithm itself is very simple, fast, and effective. Good stuff.

Some superpixel results (taken from author website):

78098 309040

The quality of the superpixel segmentations is quite high according to the standard error metrics for superpixels (see this paper for an excellent description of the error metrics). In fact, the quality of the superpixels is much higher than that of SLIC, the subject of my previous post. One issue I realized about the SLIC paper is that it presented and compared results at only a single setting of the number of superpixels — the SEEDS paper does a much better job of evaluating performance as a function of the number of superpixels.

Overall, I’m a big fan of the work on superpixels. In my opinion segmentation has received a disproportionate amount of attention in the literature — but really superpixels are much more useful (and more commonly used) than complete segmentations. Superpixels are an acknowledgement that segmentation is not an end in itself but rather as a pre-processing step for further computer vision.

What I’d really like to see is superpixel algorithms that can achieve extremely high boundary recall (currently boundary recall is at around 90% for 600 superpixels or so, although the localization of the edges is a bit rough). While of course this is THE goal of superpixel algorithms, it’s surprising to me that achieving higher recall is as challenging as it is. Segmentation is notoriously difficult — but with superpixels one can cheat by being very liberal at reporting edges. I’d be curious to see an analysis of what edges are still being missed. While there’s some edges that have no gradient (illusory contours), this does not seem to be the dominant issue.. So, what would it take to get higher boundary recall?


1 Dec

SLIC Superpixels Compared to State-of-the-art Superpixel Methods, Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. PAMI12. Website.

Achanta et al. analyze the requirement for superpixels, provide reasonable targets and performance measures, and compare a few of the leading approaches for superpixels (including Felzenszwalb and Huttenlocher’s approach and normalized cuts from Berkeley). They propose a new approach termed “simple linear iterative clustering” (SLIC). SLIC starts by defining cluster centers essentially in a grid over the image (with some smart initialization), assigning all remaining pixels to one of the cluster centers, recomputing the centers, and iterating. L2 distance over color and position is used to assign pixels to clusters. Only nearby cluster centers are checked for each pixel to keep things efficient.. I’m omitting some details, but that’s the gist of the algorithm. Couldn’t ask for anything simpler!

The result is a very fast algorithm (fraction of a second per image) and results are visually pleasing. Examples of superpixels at different scales (taken from author website):


The superpixels are fairly regular, compact, and provide good boundary recall (qualitative results are provided). Essentially all you could ask for from superpixels!

Source code (GPL unfortunately — people please stop using GPL so code can be used by all — consider BSD or LGPL) is provided on the project website. This is definitely THE algorithm I’ll use if ever I use superpixels in my code. Nice.

A great example of a simple algorithm that outperforms much more complex and computationally expensive approaches. (Note: similar ideas were proposed by my colleagues Larry Zitnick and Sing Bing Kang in their IJCV07 paper as the authors note). These ideas weren’t presented at a computer vision conference (indeed papers of this form often have trouble getting into CV conferences due to some misguided notion of “novelty”) — which is unfortunate because it may not get as much exposure as it deserves.

Going home tonight and will be thinking about applications for superpixels in sliding window object detection…