Visual Madlibs Q&A

Dataset Overview



  • 360,001 focused descriptions for 10,738 images.
  • 12 types of fill-in-the-blanks:
    • General scene
    • Emotional content
    • What happened before
    • What will happen next
    • The most interesting part
    • Appearance, activity and location of each person
    • Appearance, affordance and position of each object
    • Interaction between people and object
  • Collected using Madlibs style.
  • Two evaluation tasks:
    • Multiple-choice question-answering
    • Fill-in-the-blank image description

Visual Madlibs


A user is presented with an instruction, an image and a fill-in-the-blank template, and asked to fill in the blank.

Describe what happened immediately after this picture was taken.
- One or two seconds after this picture was taken, ____.
Describe the activity of the indicated person/people. 
- Person A is ____.


Two Evaluation Tasks


Task1: Fill-in-the-blank description


As how we collect the dataset, now we ask your algorithm to generate the fill-in-the-blank description for image automatically, with our Madlibs prompt.

Task 2: Multiple-choice question-answering


This is a new targeted multiple-choice question answering task for images. Among the four choices, there are three distractors chosen from either similar images or random images depending on the level of difficulty desired, i.e., easy and hard.

12 types of fill-in-the-blanks


Type 1: image's scene

Describe the type of scene/place shown in this picture.
- The place is a(n) tennis court.

Type 2: image's emotion

Describe the emotional content of this picture.
- When I look at this picture, I feel hungry and hot.

Type 3: image's interesting

Describe the most interesting or unusual aspect of this picture.
- The most interesting aspect of this picture is the kites.

Type 4: image's past

Describe what happened immediately before this picture was taken.
- One or two seconds before this picture was taken, they slowed the horses.

Type 5: image's future

Describe what happened immediately after this picture was taken.
- One or two seconds after this picture was taken, they drove around.

Type 6: object's attribute

Describe the appearance of the indicated object.
- The car is white.

Type 7: object's affordance

Describe the function of the indicated object.
- People could relax on the couches.

Type 8: object's position

Describe the position of the indicated object.
- The bicycle is in front of the bus.

Type 9: person's attribute

Describe the appearance of the indicated person/people.
- Person A is a balding male.

Type 10: person's activity

Describe the activity of the indicated person/people.
- Person D is standing around.

Type 11: person's location

Describe the location of the indicated person/people.
- Person B is next to an elephant.

Type 12: pair's relationship

Describe the relationship between the indicated person and object.
- The person is putting food in the bowl.