Recently, a number of us interested in image captioning gathered at Berkeley to exchange ideas (many thanks to Trevor Darrell for hosting us). Present were many of the authors of the various recent image captioning papers as well as a few additional folks who have worked in the area. For a nice summary of the recent work in the image captioning space please see John Platt’s post. For reference, here is a list of the recent image/video captioning papers in no particular order:
- Baidu/UCLA: Explain Images with Multimodal Recurrent Neural Networks
- Toronto: Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
- Berkeley: Long-term Recurrent Convolutional Networks for Visual Recognition and Description
- Google: Show and Tell: A Neural Image Caption Generator
- Stanford: Deep Visual-Semantic Alignments for Generating Image Description
- UML/UT: Translating Videos to Natural Language Using Deep Recurrent Neural Networks
- Microsoft/CMU: Learning a Recurrent Visual Representation for Image Caption Generation
- Microsoft: From Captions to Visual Concepts and Back
Note: I worked on the last of these papers in my past life 🙂
The various teams presented their work, and while I thought there would be a lot of repetition in the talks / ideas, if anything I was surprised by the diversity of insights. Oriol Vinyals had a nice high level summary of the various approaches: the two main axes on which methods differ is whether they are “pipeline” or “end-to-end” systems and whether they perform “generation” or “retrieval”. Oriol advocated end-to-end generation systems. Personally, I think pipeline systems (where you first learn visual detectors than a separate language module on top) have the major advantage that you can debug/develop/experiment with the modules independently, although I also prefer generation over retrieval. Regardless, a lot of subtleties in this space. I’ve heard a comment before that all these papers are the “same”; this is like saying all papers on deep learning are the same 😛
After presentations of the methods papers, the discussion focused on (1) datasets, (2) evaluation, and (3) next tasks. This was in many ways the most important part of the event imo.
(1) On the datasets front COCO (http://mscoco.org/) has been adopted as the dataset of choice in many of the papers in the current batch of work on image captioning (full disclosure: I am part of the COCO team). I believe most of the papers in the list above evaluated on COCO. That is not to say that COCO is not without its shortcomings (more on this later). The Flickr captioning dataset (http://vision.cs.stonybrook.edu/~vicente/sbucaptions/; EDIT: corrected link for Flickr 30K is: http://shannon.cs.illinois.edu/DenotationGraph/) was also used in a number of papers.
(2a) As far as evaluation metrics, the captioning community is in a bit of disarray. While the BLEU metric was adopted by many of the groups, the results in the various papers are NOT comparable due to various subtleties/choices in the BLEU metric. Lame. The COCO team is working to remedy this by setting up an automatic evaluation server where authors upload captions and comparisons are automatically generated (very standard stuff). We should have this up and running in a few weeks, many of the teams seem interested in uploading and comparing results once this is up and running. Then we will know who is best!
(2b) The evaluation of captions needs improvement. The BLEU metric is deeply flawed in that automatic methods are now outperforming humans according to BLEU (!!). Devi Parikh gave a very nice talk about evaluating captioning. If you are interested in this space I strongly recommend her recent paper: http://arxiv.org/abs/1411.5726. The main takeaway is that while automatic evaluation of captions is noisy, given enough human reference captions for the same image (~50 or so), automatic metrics become well correlated with human judgement of caption quality. Well BLEU still sucks, but there are others that do better (e.g. METEOR and the newly introduced CIDEr are well correlated with human judgement given enough reference captions). Given Devi’s findings, the COCO team is adopting the new metric (CIDEr) along w the old (BLEU, METEOR) and labeling a subset (5%) of the test images in COCO with 40 captions each.
(3a) Next tasks. Captioning is a potentially appealing task because in theory it requires (1) a detailed understanding of an image and (2) ability to communicate that information via natural language. Unfortunately, the consensus is that automatic captioning (at least on many images) can be done with only partial understanding of the image and rudimentary language skills. My collaborator Larry Zitnick coined this the “giraffe-tree” problem: the caption “a giraffe next to a tree” is a valid caption for a high percentage of images containing giraffes. In fact, *originality* of the generated captions is a big issue: depending on the approach a sizable percentage of the automatically generated captions on test images are exact duplicates of captions in the training set.
(3b) A lot of discussion then centered on alternative or more challenging tasks compared to image captioning that would require deeper understanding of the image content and more sophisticated language models to communicate this information. Question-answer was perhaps the most popular suggestion, the issue with Q&A is that large scale Q&A datasets are difficult to define/gather (e.g. while the Q&A dataset presented at NIPS by Mario Fritz http://arxiv.org/abs/1410.0210 is a nice first step it is relatively small scale and the diversity of the questions is low). Folks from MPI presented a very impressive new video-captioning dataset (http://arxiv.org/abs/1501.02530), although quite promising it is still somewhat small (50K videos/captions) given the diversity of videos in the world. Tamara Berg (http://www.tamaraberg.com/) gave a great talk about some alternative strategies for gathering information about an image, see her recent papers with “refer” in the title (e.g. ReferIt). Defining the right challenges is a wide open problem, however, that is not to say that image captioning is solved by any stretch of the imagination.
Having focused on the negatives of the image captioning task for most of this post, I did want to conclude by stating that in general there was a LOT of excitement in the room. While numerous groups converged on a relatively similar solution in a short period of time, and there was unfortunate attention by the popular media, this should not undermine how cool this topic really is. When we started this problem we had no idea automatic image captioning would work as well as it does! Anyway, clearly the current batch of work is just a first step, but many of us in the room were excited about the possibilities going forward.
The Flickr30K data set that is used in these papers is available from http://nlp.cs.illinois.edu/Denotations.html
Julia: thank you for pointing this out. I edited the post to reflect the correct link. Much appreciated. -Piotr
[And actually here is the link that worked for me: http://shannon.cs.illinois.edu/DenotationGraph/]
Nice summary of the workshop! After the meeting, another interesting aspect that Kate and I thought of was that none of generation metrics captured specificity or detail of the generated sentence. e.g. A description for an image could read “A woman is jumping on bars.” vs “A gymnast is jumping on bars.” Although most people might just say man/woman (which likely to increase CIDEr scores), saying gymnast and being more specific might be a lot more informative. It would be interesting to come up with a metric to capture that.
-Subhashini Venugopalan