Research Programs

One of the greatest mysteries of vision is the remarkable ability of the human brain to understand novel scenes and events rapidly and effortlessly. Whether walking down the street of an unfamiliar city, watching the fast cuts in movie trailers, or simply looking for our keys in an office, we experience the phenomenon of rapid scene understanding often without even the slightest effort. The main focus of our research programs lies in studying human and computational abilities at real world scene understanding, including object, scene and place analysis, perception, recognition and memory as well as the role of attentional mechanisms and learning in search tasks. In our research activities, we integrate tools and theories from image processing, image statistics, visual perception and cognition, cognitive science, computational vision, computer graphics and cognitive neuroscience (fMRI). Ultimately, the results of characterizing human perceptual and cognitive abilities and limitations in a natural setting holds promise for inspiring the next generation of artificial vision systems, interactive visual displays but also gives invaluable insights for the understanding of visual disorders, a novel research step of the laboratory.

Scene Understanding

Keywords:scene gist, basic-level category, spatial envelope, global feature

One remarkable aspect of visual recognition is that humans are able to recognize the meaning (or "gist", for a review, Oliva, 2005) of complex visual scenes within 1/20 of a second, independently of the quantity of objects in the image. This rapid understanding phenomenon can be experienced while looking at rapid sequences in television advertisements and quick cuts in modern movie trailers. How is this remarkable feat accomplished? Research over the last decade has made substantial progress toward understanding the mechanisms underlying single object recognition, but less progress has been made toward understanding scene and natural environments recognition. For example, computer systems fall well short of human performance in tasks that require recognizing the "gist" (semantic category) of a scene. In the lab, we have undertaken a novel approach to this challenging question by studying mechanisms of analysis that are global in nature, focusing on statistically robust features describing the spatial layout of the scene (e.g. its volume, its perspective, its level of clutter, cf. the spatial envelope model,Oliva & Torralba, 2001) and not merely its components (e.g., the objects in a scene). With National Science Foundation support, Professor Aude Oliva and her team will conduct a five-year study to examine how a global approach to image analysis can explain human's remarkable ability to recognize natural complex scenes (Greene & Oliva, 2006). Moreover we will use this approach to define operational strategies for machine vision systems. This program of research combines a number of methodologies, including behavioral experiments (psychophysics, eye tracking), cognitive neuroscience methods and computational modeling. Applications of this work include, among others, scene understanding systems to assist drivers, and automatic systems that could provide semantic descriptions of the contents of large image databases.

Team: Timothy Brady, Michelle Greene, Aude Oliva, Michael Ross

Modeling natural human vision: Integrating the Local and Global Structure of Natural Scenes.

Keywords: natural images, segmentation, local and global analysis, depth cues

One of the fundamental problems in modeling human vision is understanding the visual cues and computations that underlie the perception of natural visual scenes. Recent studies have suggested that there exist important aspects of scene perception which do not depend on the recognition of objects in the scene and are more global or holistic in nature. The objective of this project is to use integrated theoretical and experimental approaches to gain insight into the information processing that underlies the representation of natural scenes and the computation of their global and spatial layout properties. The research will be driven by the theoretical hypothesis that visual system representations at both a local and global level are adapted to the statistical structure of the natural images and scenes. This project will investigate local structure of natural images by developing hierarchical statistical models of local textures and testing to what extent human observers are sensitive to the same statistical features. The spatial structure of natural images will be investigated by developing statistical models that identify scene regions over which there are smooth changes in the local texture distribution and comparing the resulting segmentation to that of human observers. The global structure of natural scenes will be investigated by developing a statistical model that learns holistic, statistical representations, with the aim to evaluate scene depth and spatial layout information as human observers do. The broader impact of this work is that it will develop theoretical models that can be directly tested at a perceptual level and are also sufficiently detailed that they could lead to testable models of the underlying neural mechanisms. Furthermore, it will be essential to understand the computational principles underlying human perception in order to emulate their behavior in machines and also to better understand our own visual experience.

Team: Michael Lewicki (CMU), Aude Oliva, Michael Ross

Spatial Scale Perception

Keywords:spatial frequency, hybrid image, perceptual grouping

Another research program is concerned with the human visual perception of spatial scales and spatial frequencies and their applications for visual displays and clinical rehabilitation for patients with low vision. In order to determine the respective role of spatial frequency bands in human image analysis, we create visual stimuli termed "hybrid images", composed of two different images at two different spatial resolution. A hybrid image is a static image with two (or more) interpretations, whose meaning varies with distance of viewing (e.g. one of the most known hybrid is Dr. Angry and Mr. Smile, showing the face of a very angry man in low spatial frequencies and the face of a neutral woman in higher spatial frequencies). A series of papers evaluated to which extent the percept of one interpretation over the other varies with temporal frequency [Psyc Sci, 94]; familiarity [Cognition, 99]; task constraints [Cog.Psy, 97] and viewing distance [Siggraph 06]. Beyond the study of perceptual mechanisms of human image analysis, the hybrid visual illusion has generated notable interests in other communities than visual perception. Our current research directions related to human image analysis are concerned with applications of the hybrid image concept in the Human-Computer Interaction community, evaluation of performances of patient population with low vision (particularly, age-related macular degeneration), and the interaction between spatial frequency scales and rules of perceptual grouping.

Team: Timothy Brady, Aude Oliva

Models of Contextual Guidance on Object Search

Keywords: object, natural image, eye movements, categorical priors, identity priors, learning

Behavioral studies have shown that human observers make extensive use of contextual scene information during object search in natural images. In this research program, we investigate the influence of top-down contextual priors in object search by monitoring eye movements as participants searched real-world scenes for objects (e.g. a pedestrian, a painting, a cup). In Torralba, Oliva et al (2006), we show that a computational model that relies on top-down categorical priors (the identification of the scene as a street, a park, etc.) can predict the location of human eye movements when the target object is small and camouflaged in the scene. In Hidalgo-Sotelo et al. (2005, in preparation), we evaluate how the strength of contextual priors (e.g. the probability of the association between a specific scene and the object presence or location) influences the different stages of object search: the initial glance at the scene, the search process itself and the processing of the target object. In the real world, co-occurrence may be built at a global level (e.g., a kitchen will predict the presence of a stove) or at a local level (e.g., a nightstand will predict the presence of an alarm clock); some contextual associations are definite, other probabilistic; observers may act upon an object in a consistent manner or not, and may choose or not to rely on memory instead of vision when searching for objects in familiar scenes (Oliva et al, 2004). The respective role of all of these factors in explaining contextual influences constitute a challenging area for future investigation.

Team: Barbara Hidalgo-Sotelo, Aude Oliva, Antonio Torralba

Human Memory Capacity and Fidelity

Keywords: memory capacity, memory fidelity, visual object representation, long-term memory

A novel research endeavor in the lab is concerned with the capacity and fidelity of long term visual memory. One of the major lessons of memory research in the past 50 years has been that human memory is fallible, imprecise, and subject to interference. In particular, memory for visual details is thought to be exceptionally poor, suggesting memory stores only abstract representations of visual images. Evidence that memory systems usually do not store the details of events often leads to the inference that memory cannot store the details. Contrary to this view, recent behavioural results and modeling in the lab shows that, under favorable conditions, human long-term memory is capable of storing a massive number of visual images with remarkable detail, even after a single exposure. This indicates that far more visual details can be stored in long-term memory than previously believed and increase the current estimate of long term capacity by an order of magnitude. These results present a great challenge to neural models of memory storage and retrieval, as well as models of object recognition, which must be able to account for such a large and detailed storage capacity.

Team: George Alvarez, Timothy Brady, Talia Konkle, Aude Oliva

Space Understanding

Space is a material substance like stone and wood

Real world scenes and places are inherently 3 dimensional space we act within. The concept of navigation (e.g. moving our hand on a desk or going from one place to the next) depends on a reliable representation of the 3 dimensional spatial layout (with a sense of the distribution of both "mass" and "holes"). Human and computer algorithms use a variety of cues to define mean depth and distances towards objects in the world, resulting in pretty robust, but computationally very expensive methods. Even for the human brain, estimating where surfaces are in complex natural scenes is time consuming and (attentional) resources demanding: it cannot be reliably performed within a glance. At which resolution does the brain represent the spatial layout of a novel place? How does 3d layout unfold over the time of a glance? Do we mandatory need to "segment" or parse the scene into subregions to infer its 3 dimensional layout? How does the level of clutter of a scene and properties of spatial regularities (e.g. symmetry) interfere with or benefit the building of a 3D layout? Preliminary data from the lab (Konkle et al., VSS 2006) indicate that the human brain seems to build a low resolution 3 D "frame" or cube of the space the scene subtends, initially separating first the closest and furthest surface planes in the image. Additionally, statistical regularities found in the texture layout of a scene image (Torralba & Oliva, 2002) are reliably correlated with the mean depth of the scene in the world. These statistics are relatively cheap to compute and could constrain the finer resolution that a 3D layout representation requires. By studying how the human brain unfolds surfaces and planes in complex natural scenes to build the "gist of the space", we aim to discover novel heuristics for potentials applications in domains such as self-propelled systems, and aid-systems for the visually-impaired population.

Team: Timothy Brady, Talia Konkle, Aude Oliva

Scene Understanding in Visually Impaired Observers

Keywords:panoramic viewing, environmental scene, space memory, object recognition in context, central and peripheral visual loss, low vision (age related macular degeneration, stardardt disease, glaucoma, retinitis pigmentosa).

Vision lies at the center of how we interact with the world. When portions of the visual field cease to provide information, communication with a changing environment must be maintained despite the difference in interface. Peripheral vision and its interaction with central vision are therefore crucial in enabling humans to perceive and act seemingly effortlessly with and within the world. However, the integration of central and peripherally-derived information remains relatively unexplored and is often neglected in psychophysics and visual cognition studies. The approach proposed by the PI, Dr. Oliva (USA) and her collaborator, Dr. Boucart (France) involves testing performance at ecological visual recognition and spatial navigation tasks, situating normal and visually-impaired observers (patients with age-related macular degeneration, stargardt disease, glaucoma and retinitis pigmentosa) in simulated environments displayed in a panoramic format on a 180 degrees screen. Our goals are (1) to explore the capacities and limits of peripheral vision for various natural scene recognition tasks ; (2) to evaluate how well peripheral vision guides object recognition in complex visual environments; and (3) to identify key principles of spatial layout that govern navigation and object search in patients with severe visual deficits. Understanding the visual competence that is spared in the presence of visual impairment is an enterprise that we believe will push the development of fast and reliable rehabilitation strategies. We seek to identify likely targets for technology-based therapies and ways to create physical environments (indoors and outdoors) that will help compensate for visual losses in these populations, aiding in nearly all aspects of daily life.

Team MIT: Emmanuelle Boloix, Talia Konkle, Aude Oliva

Team: Muriel Boucart, Emmanuelle Boloix, Sabine Defoort, Bernard Puech, Daniele Basset, Pascal Despretz.

virtual scenes

Illustration of a virtual scene environment (left) created by our lab, and projected on the hemispheric screen (right). Observers feel immersed in the scene while performing various perception and memory tasks. The natural peripheral distortion applied by the visual system is compensated by the projection system, so that objects seen in periphery look normal to the observer sitting in the middle of the panoramic environment.