Toward a General Framework for Words & Pictures

This material is based upon work supported by the National Science Foundation under the Faculty Early Career Development (CAREER) Program: Award #1054133

PI: Tamara L Berg

Funded Students: Kota Yamaguchi, Vicente Ordonez

``It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin--that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns'' -- description of Scarlett O'Hara, Gone with the Wind.


Abstract

Pictures convey a visual description of the world directly to their viewers. Computer vision strives to design algorithms to extract the underlying world state captured in the camera's eye, with an overarching goal of general computational image understanding. To date much vision research has approached image understanding by focusing on object detection, only one perspective on the image understanding problem. This project looks at an additional, complimentary way to collect information about the visual world -- by directly analyzing the enormous amount of visually descriptive text on the web to reveal what information is useful to attach to, and extract from pictures. This project presents a comprehensive research program geared toward modeling and exploiting the complimentary nature of words and pictures. One main goal is studying the connection between text and images to learn about depiction -- communication of meaning through pictures. This goal is addressed through 3 broad challenges: 1) Developing a richer vocabulary to describe the information provided by depiction. 2) Developing image representations that can visually capture this more nuanced vocabulary. 3) Constructing a comprehensive joint words and pictures framework.

This project has direct significance to many concrete tasks that access images on the internet including: image search, browsing, and organization, as well as commercial applications such as product search, and societally important applications such as web assistance for the blind. Additionally, outputs of this project, including progress toward a natural vocabulary and structure for visual description, have great potential for cross-cutting impact in both the computer vision and natural language communities.


Projects and Publications Funded

  • Predicting Entry-Level Categories
    Vicente Ordonez, Wei Liu, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg,
    To appear in International Journal of Computer Vision (IJCV) 2015.
    Project Page

  • Learning to Name Objects
    Vicente Ordonez, Wei Liu, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg,
    to appear in Communications of the ACM (CACM) 2015.

  • Refer-to-as Relations as Semantic Knowledge
    Song Feng, Sujith Ravi, Ravi Kumar, Polina Kuznetsova, Wei Liu, Alexander C. Berg, Tamara L. Berg, Yejin Choi,
    AAAI Conference on Artificial Intelligence (AAAI) 2015.

  • ReferItGame: Referring to Objects in Photographs of Natural Scenes
    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, Tamara L. Berg,
    Empirical Methods in Natural Language Processing (EMNLP) 2014.
    Project Page, ReferItGame

  • TREETALK: Composition and Compression of Trees for Image Descriptions
    Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, Yejin Choi,
    Transactions of the Association for Computational Linguistics (TACL) - to be presented at EMNLP 2014.

  • Learning High-level Judgments of Urban Perception
    Vicente Ordonez, Tamara L. Berg,
    European Conference on Computer Vision (ECCV) 2014.
    Project Page

  • Chic or Social: Visual Popularity Analysis in Online Fashion Networks
    Kota Yamaguchi, Tamara L. Berg, Luis E. Ortiz,
    ACM Multimedia (ACM MM) 2014.

  • Materials Discovery: Fine-Grained Classification of X-ray Scattering Images
    Hadi Kiapour, Kevin Yager, Alexander C. Berg, Tamara L. Berg,
    Winter Conference on Applications of Computer Vision (WACV) 2014.

  • From Large Scale Image Categorization to Entry-Level Categories
    Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg,
    International Conference on Computer Vision (ICCV) 2013 (oral).
    Winner of the Marr prize

  • Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing
    Kota Yamaguchi, Hadi Kiapour, Tamara L. Berg,
    International Conference on Computer Vision (ICCV) 2013.
    Paperdoll parsing demo

  • Exploring the role of gaze behavior and object detection in scene understanding
    Kiwon Yun, Yifan Peng, Dimitris Samaras, Greg Zelinsky, Tamara L Berg
    Frontiers in Psychology, Perception Science, Dec 2013.

  • Generalizing Image Captions for Image-Text Parallel Corpus
    Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi
    Association for Computational Linguistics, (ACL) 2013.

  • Studying Relationships Between Human Gaze, Description, and Computer Vision
    Kiwon Yun, Yifan Peng, Greg Zelinsky, Dimitris Samaras, Tamara L Berg
    Computer Vision and Pattern Recognition, (CVPR) 2013.

  • BabyTalk: Understanding and Generating Simple Image Descriptions
    Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, Tamara L Berg
    Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2013.

  • Two-person Interaction Detection Using Body-Pose Features and Multiple Instance Learning [pdf]
    Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L. Berg, Dimitris Samaras
    The 2nd International Workshop on Human Activity Understanding from 3D Data at Conference on Computer Vision and Pattern Recognition, (CVPR) 2012.

  • Collective Generation of Natural Image Descriptions [pdf]
    Polina Kuznetsova, Vicente Ordonez, Alex Berg, Tamara L Berg, Yejin Choi
    Association for Computational Linguistics. ACL 2012.

  • Midge: Generating Image Descriptions From Computer Vision Detections [pdf]
    Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Sratos, Xufeng Han, Alysssa Mensch, Alex Berg, Tamara L. Berg, Hal Daume III
    European Chapter of the Association for computational Linguistics, EACL 2012.

  • Understanding and Predicting Importance in Images [pdf]
    Karl Stratos, Aneesh Sood, Alyssa Mensch, Xufeng Han, Margaret Mitchell, Kota Yamaguchi, Jesse Dodge, Amit Goyal, Hal Daume III, Alex Berg, Tamara L Berg
    Computer Vision and Pattern Recognition, CVPR 2012.

  • Detecting Visual Text [pdf]
    Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Karl Stratos, Kota Yamaguchi, Yejin Choi, Hal Daume III, Alex C Berg, Tamara L Berg,
    North American Chapter of the Association for Computational Linguistics. NAACL 2012.

  • Im2Text: Describing Images Using 1 Million Captioned Photographs [pdf]
    Vicente Ordonez,, Girish Kulkarni,, Tamara L. Berg
    Neural Information Processing Systems (NIPS), 2011.
    Dataset: SBU Captioned Photo Dataset

  • Composing Simple Image Descriptions using Web-scale N-grams. [pdf]
    Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, Yejin Choi
    Computational Natural Language Learning (CoNLL), 2011.

  • Baby Talk: Understanding and Generating Simple Image Descriptions [pdf]
    Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, Tamara L. Berg
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2011 (ORAL)

  • High Level Describable Attributes for Predicting Aesthetics and Interestingness [pdf]
    Sagnik Dhar, Vicente Ordonez, Tamara L. Berg,
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2011