Scene Understanding: The Spatial Envelope Theory

One remarkable aspect of visual recognition is that humans are able to recognize the meaning (or "gist") of complex visual scenes within 1/20 of a second, independently of the quantity of objects in the image (for a short review o scene perception, see Oliva, 2009). This rapid understanding phenomenon can be experienced while looking at rapid sequences in television advertisements and quick cuts in modern movie trailers. How is this remarkable feat accomplished?

Within the traditional theories, scene recognition was framed as chicken-egg problem: what comes first, the forest (scene context) or the trees (objects)? The work of Aude Oliva has changed the focus of the question in a simple yet basic way: the existence of a global level of information that can support the recognition of the trees or the forest independently of each other. In the group, we have undertaken a novel approach to scene understanding by studying mechanisms of perception that are global in nature, focusing on statistically robust features describing the spatial envelope of the scene (e.g. its volume, its perspective, its openness, its level of naturalness, its level of clutter, see schema below and Oliva and Torralba, IJCV, 2001; for reviews, Oliva and Torralba, 2006; Oliva et al, 2011).

This program of research combines a number of methodologies, including behavioral experiments (psychophysics, eye tracking), cognitive neuroscience (fMRI) and computational modeling. Over the years, this research program in scene understanding has been realized via the following projects:

  • In a paper published in Psychological Science, 1994, as part of my PhD thesis, I have shown that humans recognize the semantic category of a scene from very low image resolution, corresponding to image sizes of 16 to 32 pixels (see also Cognitive Psychology 2000). At such low resolutions, the objects that compose the image are simple blobs and cannot be recognized in isolation. This work has been the first to demonstrate that a coarse description of the input (oriented "blobs" in a particular spatial organization, layout of colored regions) could initiate scene recognition before the identity of the objects (available in high spatial frequency) is processed.
  • In my Cognitive Psychology paper of 1997, additional experiments showed that high spatial frequencies were also perceived by the visual system at the beginning of image analysis, but not necessarily used for performing a task. These results show that the initial image representation covers the full range of spatial frequencies. High spatial frequency features, however, might not be well localized in space in the absence of focused attention.
  • Inspired by these results, in Int. Journal of Computer Vision 2001, I have proposed a computational model of scene recognition based on a very low-dimensional description of the scene: the Spatial Envelope (see PAMI 2002 ; BMCV 2002 ; Network 2003 ; Progress in Brain Research 2006 ; TICS 2007 ; Journal of Vision 2010 ). Based on experimental data, I have found a set of global properties that represent the 3D layout of a scene (properties of spatial boundaries and properties of content) and make it possible to calculate similarities between images. At the image level, each scene property, or the whole image, can be represented by a low-dimensional vector than encodes the distribution of orientations and scales in the image along with a coarse description of the spatial layout. Therefore, the spatial resolution of this global feature is coarse, (e.g. like a sketch of the image), but it contains all the range of spatial frequencies (from low to high). In the computer vision community, this vector is refered as the GIST descriptor and is now widely used in the field. The spatial envelope representation, which provides semantic attributes about the image, provides a way of computing high-level scene and space similarities between 2D images. The model is a proof of concept that object shape or identity might not be a requirement for scene categorization.
  • In a paper published in Cognitive Psychology 2009, we have shown that using global properties to represent scene images predicts human scene categorization performances and produces a pattern of errors like those predicted by the model. Global scene properties are perceived at the beginning of the time course of image analysis (Psychological Science , 2009), and are prone to adaptation (JEP:HPP 2010), suggesting a strong representational role of global properties in rapid scene categorization.

The spatial envelope model allows to make concrete predictions about the neural underpinning of scene and space recognition. Specifically, the model suggests the existence of two separable levels of representation: spatial boundary (i.e., the shape and volume or size of the space the scene represents) and content (the type of elements and how cluttered the space is). In recent neuro-imaging work (Journal of Neuroscience , 2011), we found that real world scene images are analyzed in a distributed and complementary manner in high-level brain regions, testifying to distinct neural pathways for representing properties of the spatial boundaries and the content of a visual scene. Our behavioral and computational work so far suggests the existence of a property- based neural representation of scenes and objects in the brain, an hypothesis the group is currently exploring.


  • Park, S., Brady, T.F., Greene, M.R., & Oliva, A. (2011). Disentangling scene content from its spatial boundary: Complementary roles for the PPA and LOC in representing real-world scenes Journal of Neuroscience, 31(4), 1333-1340. abstractarticle
  • Oliva, A., Park, S., & Konkle, T. (2011). Representing, perceiving and remembering the shape of visual space. Vision in 3D Environments, ed. L.R. Harris and M. Jenkin. Cambridge University Press abstractarticle
  • Xiao, J., Hayes, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN Database: Large-scale Scene Recognition from Abbey to Zoo. Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (pp. 3485-3492), IEEE Computer Society. abstractarticlewebsite
  • Greene, M.R. & Oliva, A. (2010). High-Level Aftereffects to Global Scene Property. Journal of Experimental Psychology: Human Perception & Performance, 36(6), 1430-1442. abstractarticle
  • Greene, M.R., & Oliva, A. (2009). The briefest of glances: the time course of natural scene understanding. Psychological Science, 20 (4), 464-472. abstractarticle
  • Greene, M.R., & Oliva, A. (2009). Recognition of Natural Scenes from Global Properties: Seeing the Forest Without Representing the Trees. Cognitive Psychology, 58(2), 137-179. abstractarticle
  • Oliva, A. (2009). Visual Scene Perception. In Encyclopaedia of Perception, Ed: Bruce Goldstein. Sage Edition. abstractarticle
  • Oliva, A. & Torralba, A. (2006). Building the Gist of a Scene: The Role of Global Image Features in Recognition. Progress in Brain Research: Visual perception, 155, 23-36.abstractarticle
  • Oliva, A. & Torralba, A. (2007). The Role of Context in Object Recognition. Trends in Cognitive Sciences, 11(12), 520-527. abstractarticle
  • Oliva, A. & Torralba, A. (2006). Building the Gist of a Scene: The Role of Global Image Features in Recognition. Progress in Brain Research: Visual perception, 155, 23-36.abstractarticle
  • Goffaux, V., Jacques, C., Mouraux, A., Oliva, A., Rossion, B., & Schyns. P.G. (2005). Diagnostic colors contribute to early stages of scene categorization: behavioral and neurophysiological evidences. Visual Cognition, 12, 878-892.abstractarticle
  • Oliva, A. (2005). Gist of the scene. In the Encyclopedia of Neurobiology of Attention. L. Itti, G. Rees, and J.K. Tsotsos (Eds.), Elsevier, San Diego, CA (pages 251-256).article
  • Torralba, A., & Oliva, A. (2003). Statistics of Natural Images Categories.Network: Computation in Neural Systems, 14, 391-412.abstractarticle
  • Torralba, A., & Oliva, A. (2002). Depth estimation from image structure. IEEE Pattern Analysis and Machine Intelligence, 24,1226-1238.abstractarticle
  • Oliva, A., & Torralba, A. (2002). Scene-centered description from spatial envelope properties. Lecture Note in Computer Science Serie Proc. Second International Workshop on Biologically Motivated Computer Vision, Eds: H. Bulthoff, S.W. Lee, T. Poggio, & C. Wallraven. Srpinger-Verlag, Tuebingen, Germany (pp.263-272).<article
  • Oliva, A., & Torralba, A. (2001). Modeling the Shape of the Scene: a Holistic Representation of the Spatial Envelope. International Journal in Computer Vision, 42, 145-175.abstractarticledatabase
  • Oliva, A., & Schyns, P.G. (2000). Diagnostic colors mediate scene recognition. Cognitive Psychology, 41,176-210.abstractarticle
  • Oliva, A. & Schyns, P.G. (1997). Coarse blobs or fine edges? Evidence that information diagnosticity changes the perception of complex visual stimuli. Cognitive Psychology, 34, 72-107.abstractarticle
  • Schyns, P.G. & Oliva, A. (1994). From blobs to boundary edges: Evidence for time- and spatial-scale-dependent scene recognition. Psychological Science, 5, 195-200.abstractarticle