Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Cogn Sci
Impact Factor: 2.218

Models of visual categorization

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Visual categorization refers to our ability to organize objects and visual scenes into discrete categories. It is an essential skill as it allows us to distinguish friend from foe or edible versus poisonous food. Understanding how the visual system categorizes objects and scenes is a challenge because it requires bridging the gap between different levels of understanding–from the level of neural circuits and neural networks to the level of information processing and, ultimately, behavior. Computational models have become powerful tools for integrating knowledge across these levels of analysis. We review recent progress in our understanding of the computational mechanisms underlying visual categorization and discuss some of the remaining challenges. WIREs Cogn Sci 2016, 7:197–213. doi: 10.1002/wcs.1385

Computational models of visual categorization. Visual categorization has traditionally been described as a two‐stage process: (a) Visual features must be computed to build a visual representation of an input stimulus x i . It is desirable for the representation to be both tolerant to the many factors that can affect the appearance of an object and also selective enough to capture subtle differences between exemplars across the category boundary. Different computational models of feature computation vary in their degree of invariance and specificity. For illustration purposes, two features x i k are being computed (superscripts are used as feature indexes and subscripts as stimulus indexes) but more generally, the total number of features N used to represent visual stimuli can be quite large (N ≈ 102 − 104). More generally, visual stimuli can be thought of as N‐dimensional feature vectors (also called data points) x i = x i 1 x i k x i N in this representational space whereby the kth coordinate of x i corresponds to the response of the kth feature detector x i k . (b) A categorization process associates these data points x i to category labels y i through a learned function f such that f(x i ) ≈ y i . Here, we consider a binary classification task with a positive (target) and a negative (distractor) category label (y i = {−1, 1}). Shown in red is a linear classification function f that separates the positive and negative examples. This function is parametrized by the vector w = (w 1, w 2), which is the vector normal to the underlying decision boundary. In practice, these functions are learned from training examples. For instance, supervised learning algorithms learn this mapping from the presentation of (x i , y i ) exemplar‐label pairs. After learning, the algorithm tries to predict the category label of a new stimulus x * by considering whether the stimulus projected in the feature space falls on the right or left side of the boundary. This can be done by computing the dot‐product between the input stimulus and the normal vector and subtracting off a fixed threshold θ: f x * = sign w x * θ = sign k w k x * k θ .
[ Normal View | Magnified View ]
Levels of categorization. One can distinguish between three levels of categorization. Shown are hypothetical examples related to the categorization of animal stimuli and many alternatives are possible. (a) The superordinate level corresponds to categorization between animal/animate versus non‐animal/inanimate objects (i.e. any visual scene that does not contain an animal). (b) The basic level (also referred to as the generic level of categorization) requires discrimination between various species (e.g. dog vs. non‐dog animals). (c) Last, the subordinate level requires discrimination between various dog types (e.g. dalmatian vs non‐dalmatian, etc). Of course this classification is not unique and many other classification types can be performed, such as cat versus dogs, etc. Binary classification tasks are very general and it can be shown formally that any multiclass classification task (e.g. what animal is it? dog vs. cat vs. bird, etc) can be decomposed as multiple binary classification problems.
[ Normal View | Magnified View ]
Representational complexity. Not all feature representations are created equal. Here we compare three hypothetical visual representations for the same set of stimuli and how they impact the subsequent classification processes. Individual category exemplars are shown as dots (blue and green corresponding to each of the two classes) and classification functions as a red line. Representation (a) is the best representation for the categorization problem considered because the two classes can be separated by one of the simplest classification functions (i.e. a linear function). The complexity of the corresponding classification function increases from left to right. Representation (b) and (c) will, in principle, require more training examples to properly generalize to new stimuli or equivalently will tend to under‐perform Representation (a) in regimes where relatively small numbers of training examples are available.
[ Normal View | Magnified View ]
Decision boundaries implemented by three different cognitive models of categorization. (a) Prototype‐based, (b) exemplar‐based and (c) linear perceptron. Note that only the perceptron learning algorithm explicitly computes a decision boundary. A decision boundary can, however, be recovered for instance‐based algorithms by assigning a class‐label to every point of the feature space by computing the distance between each point and the closest prototype (obtained by computing the mean of all exemplars for each class) as in the prototype‐based approach or the closest exemplar as in the exemplar‐based approach.
[ Normal View | Magnified View ]
Simple categorization unit. A perceptron‐like categorization unit reweights the response of individual feature detectors x k – from a population of N feature detectors x = (x 1, … x k , … x N ) – by the corresponding vector of synaptic strength w = (w 1, … w k , … w N ) before summing them up (∑ k w k x k ) and subtracting off a threshold θ. This is followed by a rectification stage to obtain a binary class label {−1, 1}. Formally this model unit would be able to implement the classification boundary described in Figure .
[ Normal View | Magnified View ]
Sketch of the (HMAX) hierarchical model of visual processing: Acronyms: V1, V2, and V4 correspond to primary, secondary and quaternary visual areas, PIT and AIT to posterior and anterior inferotemporal areas, respectively (tentative mapping with areas of the visual cortex shown in color, some areas of the parietal cortex and dorsal streams not shown). The model relies on two types of computations: A max operation (shown in the dashed circles, also called invariance pooling) over similar features at different position and scale to gradually build tolerance to position and scale and a bell‐shaped tuning operation (shown in the plain circles, also called selectivity pooling) over multiple features to increase the complexity of the underlying representation, see Ref and text for details.
[ Normal View | Magnified View ]
The neural basis of visual categorization. Shown are areas involved in visual categorization. Areas involved in the computation of visual features are shown in red and areas involved in categorization in cyan. Some subcortical areas known to play a role in categorization are not shown including the striatum. (Adapted from Ref )
[ Normal View | Magnified View ]

Related Articles

Top Ten WCS Articles

Browse by Topic

Neuroscience > Computation
Psychology > Perception and Psychophysics

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts