[[{"fid":"655","view_mode":"embedded_left","fields":{"format":"embedded_left","field_file_image_alt_text[und][0][value]":"Samy Bengio","field_file_image_title_text[und][0][value]":"","field_file_caption_credit[und][0][value]":"","field_file_caption_credit[und][0][format]":"full_html"},"type":"media","link_text":null,"attributes":{"alt":"Samy Bengio","height":273,"width":250,"class":"media-element file-embedded-left"}}]]Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this talk, I'll present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively.
If time permits, I'll also describe an improvement on our basic image captioning approach that considers the discrepancy between how we train these models and how we actually use them at inference time, and how adding some exploration during training mitigates this problem.
Joint work with Oriol Vinyals, Alex Toshev, Dumitru Erhan, Navdeep Jaitly and Noam Shazeer.
Samy Bengio (PhD in computer science, University of Montreal, 1993) is a research scientist at Google since 2007. Before that, he was senior researcher in statistical machine learning at IDIAP Research Institute since 1999. His most recent research interests are in machine learning, in particular deep learning, large scale online learning, image ranking and annotation, music and speech processing. He is action editor of the Journal of Machine Learning Research and on the editorial board of the Machine Learning Journal. He was associate editor of the journal of computational statistics, general chair of the Workshops on Machine Learning for Multimodal Interactions (MLMI'2004-2006), programme chair of the International Conference on Learning Representations (ICLR'2015-2016), programme chair of the IEEE Workshop on Neural Networks for Signal Processing (NNSP'2002), chair of BayLearn (2012-2015), and several times on the programme committee of several international conferences such as NIPS, ICML, ECML and ICLR. More information can be found on his website: http://bengio.abracadoudou.com.