Paper of the day: Unsupervised Discovery of Mid-level Discriminative Patches. Saurabh Singh, Abhinav Gupta and Alexei A. Efros. ECCV 2012. Website+code.
The authors describe their goal as learning a “fully unsupervised mid-level visual representation”. A good way to think about their approach is as an improvement over the standard “bag of words” model. “Visual words” are typically created by clustering a large set of patches (sparsely detected using an interest point detector). While bag of words and extensions are reasonably effective, the “visual words” themselves tend to code fairly simple structures such as bars or corners. Instead, the goal here is to find “discriminative patches”, that while similar to “visual words”, satisfy two properties: they are frequent and discriminative.
The idea is to discover structures that occur frequently in a “discovery dataset” D and are distinctive from most patches in a massive “natural world dataset” N. This can be done via an iterative discriminative clustering scheme: cluster patches from D and for each cluster train a classifier (HOG+linear SVM) against all patches in N (using bootstrapping). Resulting classifiers that achieve high scores on patches from D and low scores on patches from N, and fire sufficiently densely on D, are consider good discriminative patches. I’m glossing over details (including a nice trick of splitting D and N into two to have validation sets), but the basic idea is quite intuitive and straightforward.
Visualization of “visual words” versus “discriminative patches”:
The choice of discovery dataset D seems crucial since D must contain interesting patches with relatively high frequency. While the title includes the words “Unsupervised Discovery”, the more interesting discriminative patches are obtained when some weak supervision is used: specifically if the choice of D emphasizes particular object categories or scene structures. For example, using various categories from the MIT indoor-67 scene dataset for D gives the following discriminative patches:
This is pretty cool, these patches are definitely informative!
The high level goal of discovering representative and meaningful image structure is quite compelling. While the unsupervised aspect is interesting, the weakly supervised results are actually more promising. It would be great to see more application of discriminative patches (the authors show a few in their paper, but it seems like they’re barely scratching the surface here) and alternate ways of exploiting weak supervision. In that vein the authors published a paper at SIGGRAPH 2012 that is quite fun: What Makes Paris Look like Paris? (similar to supervised discriminative patches but supervision comes from geographic region in which the images were captured).
Discriminative patches certainly aren’t the last word in image representation (no pun intended) but they’re an interesting new direction!