Image of a series of saccades and fixations on human faces, from Alfred Yarbus.
An illustrative example of human vision, from experiments by Alfred Yarbus in the 1960s. Vision is a dynamic process composed of a series of saccades (movements) and fixations (pauses) by the eyes.

The mammalian vision system is highly efficient. Let’s take a page out of Mother Nature’s book and mimic her design.

Humans and most mammalians have eyes that accommodate high-resolution vision at the center of the visual field (the fovea) and lower resolution vision outside of this region. We move our eyes about a scene, capturing brief glimpses and mentally stitching them together to form a more comprehensive understanding of a scene, without using precious brain resources to neurally process the entire visual field at high resolution. We do not “scroll” our eyes over a scene, but rather, we fixate on a location and hesitate, then quickly (nearly instantaneously) reposition our eyes onto a new fixation location — a movement referred to as a saccade.

We are building deep learning models that use a saccade-and-fixate methodology to search an image — locating a target object or identifying multiple types of objects in the image. Deep reinforcement learning is used to train the models to execute desired tasks. Critically, this can allow for task completion with orders of magnitude fewer hardware operations, thereby accelerating execution or lowering hardware energy requirements. See the video for a conceptual demonstration.