Paper of the day: Object Detection with Grammar Models. Ross Girshick, Pedro Felzenszwalb, David McAllester. NIPS 2011.

The paper describes an approach using formal grammars applied to object recognition — and while the idea of using grammars for computer vision has a long history — an approach that achieves compelling results on PASCAL (specifically the person category) is quite new. The work is essentially an extension of Deformable Part Models (DPM) that tries to formalize the notion of objects, parts, mixture models, etc., by defining a formal grammar for objects. The grammar allows for such things as optional parts, mixtures at the part levels, explicit occlusion reasoning and recursive parts (parts of parts). The grammar model is defined by hand, while learning of all parameters (HOG template appearances, deformation parameters, scores of productions) is performed using latent SVMs (an extension of what was done for DPMs).

The NIPS paper is quite dense. To understand the model better I found it useful to first briefly skim the Wikipedia entry on Context Free Grammars first and to read the earlier tech report that goes into far more detail on defining the model itself. Online talks by Ross (link), Pedro (link) and David (link) are also helpful.

The particular grammar used achieves 49.5 AP on the person category in PASCAL with contextual reasoning. For comparison, DPMs in their latest incarnation achieved 44.4 (in their raw form) and poselets (the state-of-the-art algorithm on this data) had 48.5 AP. My impression is that the largest gains come from the fact that the model performs explicit occlusion reasoning — which is very cool — and hence does not require a mixture model at the object level or bounding box prediction. Some example results:

The general idea behind this work is quite interesting. My impression though is that making object grammars work in practice for a wide range of categories has major limitations. I see three issues: (1) the model must be defined by hand for each category, (2) initialization is likely crucial and obtaining a good initialization is not trivial (for the person category there was a trick the authors were able to use for initialization described in S4.3), and (3) performance seems to be quite sensitive to the model used. In David’s talk, David discusses how they tried typing in numerous grammars for person and it was finally the occlusion style model that really helped. I suspect that if the grammars worked for a wider range of models we would have seen results on all of PASCAL.

Regardless, the paper provides great food for thought and suggests future directions where DPMs can go.