Toward a General Framework for Words & Pictures
This material is based upon work supported by the National Science Foundation under the Faculty Early Career Development (CAREER) Program: Award #1054133
PI: Tamara L Berg
Funded Students: Kota Yamaguchi, Vicente Ordonez
``It was an arresting face, pointed of chin,
square of jaw. Her eyes were pale green without a touch of hazel,
starred with bristly black lashes and slightly tilted at the ends.
Above them, her thick black brows slanted upward, cutting a
startling oblique line in her magnolia-white skin--that skin so
prized by Southern women and so carefully guarded with bonnets,
veils and mittens against hot Georgia suns'' -- description of
Scarlett O'Hara, Gone with the Wind.
Abstract
Pictures convey a visual description of the world directly to their viewers.
Computer vision strives to design algorithms to extract the underlying world
state captured in the camera's eye, with an overarching goal of general
computational image understanding. To date much vision research has approached
image understanding by focusing on object detection, only one perspective on
the image understanding problem. This project looks at an additional,
complimentary way to collect information about the visual world -- by directly
analyzing the enormous amount of visually descriptive text on the web to reveal
what information is useful to attach to, and extract from pictures. This
project presents a comprehensive research program geared toward modeling and
exploiting the complimentary nature of words and pictures. One main goal is
studying the connection between text and images to learn about depiction --
communication of meaning through pictures. This goal is addressed through 3
broad challenges: 1) Developing a richer vocabulary to describe the information
provided by depiction. 2) Developing image representations that can visually
capture this more nuanced vocabulary. 3) Constructing a comprehensive joint
words and pictures framework.
This project has direct significance to many concrete tasks that access images
on the internet including: image search, browsing, and organization, as well as
commercial applications such as product search, and societally important
applications such as web assistance for the blind. Additionally, outputs of
this project, including progress toward a natural vocabulary and structure for
visual description, have great potential for cross-cutting impact in both the
computer vision and natural language communities.
Projects and Publications Funded
Predicting Entry-Level Categories
Vicente Ordonez,
Wei Liu,
Jia Deng,
Yejin Choi,
Alexander C. Berg,
Tamara L. Berg,
To appear in International Journal of Computer Vision (IJCV) 2015.
Project Page
Learning to Name Objects
Vicente Ordonez,
Wei Liu,
Jia Deng,
Yejin Choi,
Alexander C. Berg,
Tamara L. Berg,
to appear in Communications of the ACM (CACM) 2015.
Refer-to-as Relations as Semantic Knowledge
Song Feng,
Sujith Ravi,
Ravi Kumar,
Polina Kuznetsova,
Wei Liu,
Alexander C. Berg,
Tamara L. Berg,
Yejin Choi,
AAAI Conference on Artificial Intelligence (AAAI) 2015.
ReferItGame: Referring to Objects in Photographs of Natural Scenes
Sahar Kazemzadeh,
Vicente Ordonez,
Mark Matten,
Tamara L. Berg,
Empirical Methods in Natural Language Processing (EMNLP) 2014.
Project Page,
ReferItGame
TREETALK: Composition and Compression of Trees for Image Descriptions
Polina Kuznetsova,
Vicente Ordonez,
Tamara L. Berg,
Yejin Choi,
Transactions of the Association for Computational Linguistics (TACL) - to be presented at EMNLP 2014.
Learning High-level Judgments of Urban Perception
Vicente Ordonez,
Tamara L. Berg,
European Conference on Computer Vision (ECCV) 2014.
Project Page
Chic or Social: Visual Popularity Analysis in Online Fashion Networks
Kota Yamaguchi,
Tamara L. Berg,
Luis E. Ortiz,
ACM Multimedia (ACM MM) 2014.
Materials Discovery: Fine-Grained Classification of X-ray Scattering Images
Hadi Kiapour,
Kevin Yager,
Alexander C. Berg,
Tamara L. Berg,
Winter Conference on Applications of Computer Vision (WACV) 2014.
From Large Scale Image Categorization to Entry-Level Categories
Vicente Ordonez,
Jia Deng,
Yejin Choi,
Alexander C. Berg,
Tamara L. Berg,
International Conference on Computer Vision (ICCV) 2013 (oral).
Winner of the Marr prize
Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing
Kota Yamaguchi,
Hadi Kiapour,
Tamara L. Berg,
International Conference on Computer Vision (ICCV) 2013.
Paperdoll parsing demo
Exploring the role of gaze behavior and object detection in scene understanding
Kiwon Yun,
Yifan Peng,
Dimitris Samaras,
Greg Zelinsky,
Tamara L Berg
Frontiers in Psychology, Perception Science, Dec 2013.
Generalizing Image Captions for Image-Text Parallel Corpus
Polina Kuznetsova,
Vicente Ordonez,
Alexander C. Berg,
Tamara L. Berg,
Yejin Choi
Association for Computational Linguistics, (ACL) 2013.
Studying Relationships Between Human Gaze, Description, and Computer Vision
Kiwon Yun,
Yifan Peng,
Greg Zelinsky,
Dimitris Samaras,
Tamara L Berg
Computer Vision and Pattern Recognition, (CVPR) 2013.
BabyTalk: Understanding and Generating Simple Image Descriptions
Girish Kulkarni,
Visruth Premraj,
Vicente Ordonez,
Sagnik Dhar,
Siming Li,
Yejin Choi,
Alexander C. Berg,
Tamara L Berg
Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2013.
Two-person Interaction Detection Using Body-Pose Features and Multiple Instance Learning
[pdf]
Kiwon Yun,
Jean Honorio,
Debaleena Chattopadhyay,
Tamara L. Berg,
Dimitris Samaras
The 2nd International Workshop on Human Activity Understanding from 3D Data at Conference on Computer Vision and Pattern Recognition, (CVPR) 2012.
Collective Generation of Natural Image Descriptions
[pdf]
Polina Kuznetsova,
Vicente Ordonez,
Alex Berg,
Tamara L Berg,
Yejin Choi
Association for Computational Linguistics. ACL 2012.
Midge: Generating Image Descriptions From Computer Vision Detections
[pdf]
Margaret Mitchell,
Jesse Dodge,
Amit Goyal,
Kota Yamaguchi,
Karl Sratos,
Xufeng Han,
Alysssa Mensch,
Alex Berg,
Tamara L. Berg,
Hal Daume III
European Chapter of the Association for computational Linguistics, EACL 2012.
Understanding and Predicting Importance in Images
[pdf]
Karl Stratos,
Aneesh Sood,
Alyssa Mensch,
Xufeng Han,
Margaret Mitchell,
Kota Yamaguchi,
Jesse Dodge,
Amit Goyal,
Hal Daume III,
Alex Berg,
Tamara L Berg
Computer Vision and Pattern Recognition, CVPR 2012.
Detecting Visual Text
[pdf]
Jesse Dodge,
Amit Goyal,
Xufeng Han,
Alyssa Mensch,
Margaret Mitchell,
Karl Stratos,
Kota Yamaguchi,
Yejin Choi,
Hal Daume III,
Alex C Berg,
Tamara L Berg,
North American Chapter of the Association for Computational Linguistics. NAACL 2012.
Im2Text: Describing Images Using 1 Million Captioned Photographs
[pdf]
Vicente Ordonez,,
Girish Kulkarni,,
Tamara L. Berg
Neural Information Processing Systems (NIPS), 2011.
Dataset: SBU Captioned Photo Dataset
Composing Simple Image Descriptions using Web-scale N-grams.
[pdf]
Siming Li,
Girish Kulkarni,
Tamara L. Berg,
Alexander C. Berg,
Yejin Choi
Computational Natural Language Learning (CoNLL), 2011.
Baby Talk: Understanding and Generating Simple Image Descriptions
[pdf]
Girish Kulkarni,
Visruth Premraj,
Sagnik Dhar,
Siming Li,
Yejin Choi,
Alexander C. Berg,
Tamara L. Berg
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2011 (ORAL)
High Level Describable Attributes for Predicting Aesthetics and Interestingness
[pdf]
Sagnik Dhar,
Vicente Ordonez,
Tamara L. Berg,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2011