Peter Kontschieder and co-authors have written a number of interesting papers that hack random forests to accomplish tasks beyond standard classification / regression. In 2011 they published two papers at ICCV and BMVC (pdf, pdf) on structured learning for semantic image labeling and in 2012 a paper at NIPS (pdf) on context sensitive random forests for object detection. What these papers all have in common is that they exploit the strengths of random forests and the format of the training data to learn over images in an interesting manner. In this blog post I focus on the first two papers but the third is similar in spirit.
In semantic segmentation the goal is to assign each pixel a label (e.g. grass, sky, cow). The foundational work of Jamie Shotten used random forests to predict the label of each pixel independently (using a patch surrounding the pixel) turning the problem into a multi-class classification problem. The predictions obtained this way are reasonable but noisy so a smoothing approach such as a CRF is employed as a second step. Peter and co-authors propose a very natural yet powerful variant of this idea: what if instead of predicting the label of the center pixel of a patch, the tree predicts the labels of a subset of the pixels or even the entire patch?
The cool thing is it’s not all that hard to alter a random forest to achieve structured prediction. The figure at the top of this post demonstrates the basic idea. For classification, during training the label of each training example (e.g. of the center pixel) is used to determine the split. Instead one can *sample* the label on which to split on if each example contains multiple labels (e.g. for the whole patch). This works because individual splits in random forests are inherently noisy anyway – it’s the application of multiple consecutive splits that results in child nodes with relatively pure distributions of the labels. The idea can be generalized: so long as there’s a (noisy) way of deriving a multi-class label from a structured label the concept should apply.
For semantic segmentation, the structured forest predicts the patch labels for an entire neighborhood of pixels. Each tree in the forest at each pixel location predicts such a labeling. All these labels need to be combined into a single global solution for the image: the authors formulate this as a “label puzzle game” and offer an iterative approach guaranteed to converge to a local minimum. This step removes the need for CRFs and leads to better performance on standard datasets. Both the ICCV and BMVC papers cover this overall outline, with the ICCV focusing more on the structured forests and the BMVC more on solving for the global solution (note: the papers cover very similar material and should probably have been a single submission).
In each of the following rows the authors show (1) the original image, (2) baseline random forest output, (3) structured forest+puzzle solver output:
Predicting labels for the whole patch (and combining the results in a smart way) removes the “noisy pixel” problem. The approach is conceptually appealing and results are reasonable. Random forests are naturally well suited for structured learning in this manner: by design random forests are trained in a noisy manner, so introducing an alternative (noisy) splitting criteria works well in practice. Moreover random forests can “memorize” the structured labels observed at the leaves during training so at test time inference (per patch) is trivial.
This paper is a great example of why machine learning research greatly benefits when grounded in a real problem: the structure of the data defines a unique learning problem that could never be formulated on “black box” datasets where training and testing data are clearly defined and the real data abstracted away (e.g. feature vectors and labels). Formulating novel learning paradigms for real datasets is among the most interesting directions in computer vision.