TL;DR: A neural network model for a mechanism of visual pattern recognition that is self-organized by “learning without a teacher”, and acquires an ability to recognize stimulus patterns based on the geometrical similarity of their shapes without affected by their positions.
Abstract: A neural network model for a mechanism of visual pattern recognition is proposed in this paper. The network is self-organized by “learning without a teacher”, and acquires an ability to recognize stimulus patterns based on the geometrical similarity (Gestalt) of their shapes without affected by their positions. This network is given a nickname “neocognitron”. After completion of self-organization, the network has a structure similar to the hierarchy model of the visual nervous system proposed by Hubel and Wiesel. The network consits of an input layer (photoreceptor array) followed by a cascade connection of a number of modular structures, each of which is composed of two layers of cells connected in a cascade. The first layer of each module consists of “S-cells”, which show characteristics similar to simple cells or lower order hypercomplex cells, and the second layer consists of “C-cells” similar to complex cells or higher order hypercomplex cells. The afferent synapses to each S-cell have plasticity and are modifiable. The network has an ability of unsupervised learning: We do not need any “teacher” during the process of self-organization, and it is only needed to present a set of stimulus patterns repeatedly to the input layer of the network. The network has been simulated on a digital computer. After repetitive presentation of a set of stimulus patterns, each stimulus pattern has become to elicit an output only from one of the C-cell of the last layer, and conversely, this C-cell has become selectively responsive only to that stimulus pattern. That is, none of the C-cells of the last layer responds to more than one stimulus pattern. The response of the C-cells of the last layer is not affected by the pattern's position at all. Neither is it affected by a small change in shape nor in size of the stimulus pattern.
TL;DR: Recognition-by-components (RBC) provides a principled account of the heretofore undecided relation between the classic principles of perceptual organization and pattern recognition.
Abstract: The perceptual recognition of objects is conceptualized to be a process in which the image of the input is segmented at regions of deep concavity into an arrangement of simple geometric components, such as blocks, cylinders, wedges, and cones. The fundamental assumption of the proposed theory, recognition-by-components (RBC), is that a modest set of generalized-cone components, called geons (N £ 36), can be derived from contrasts of five readily detectable properties of edges in a two-dimensiona l image: curvature, collinearity, symmetry, parallelism, and cotermination. The detection of these properties is generally invariant over viewing position an$ image quality and consequently allows robust object perception when the image is projected from a novel viewpoint or is degraded. RBC thus provides a principled account of the heretofore undecided relation between the classic principles of perceptual organization and pattern recognition: The constraints toward regularization (Pragnanz) characterize not the complete object but the object's components. Representational power derives from an allowance of free combinations of the geons. A Principle of Componential Recovery can account for the major phenomena of object recognition: If an arrangement of two or three geons can be recovered from the input, objects can be quickly recognized even when they are occluded, novel, rotated in depth, or extensively degraded. The results from experiments on the perception of briefly presented pictures by human observers provide empirical support for the theory. Any single object can project an infinity of image configurations to the retina. The orientation of the object to the viewer can vary continuously, each giving rise to a different two-dimensional projection. The object can be occluded by other objects or texture fields, as when viewed behind foliage. The object need not be presented as a full-colored textured image but instead can be a simplified line drawing. Moreover, the object can even be missing some of its parts or be a novel exemplar of its particular category. But it is only with rare exceptions that an image fails to be rapidly and readily classified, either as an instance of a familiar object category or as an instance that cannot be so classified (itself a form of classification).
TL;DR: A functional model is proposed in which structural encoding processes provide descriptions suitable for the analysis of facial speech, for analysis of expression and for face recognition units, and it is proposed that the cognitive system plays an active role in deciding whether or not the initial match is sufficiently close to indicate true recognition.
Abstract: The aim of this paper is to develop a theoretical model and a set of terms for understanding and discussing how we recognize familiar faces, and the relationship between recognition and other aspects of face processing. It is suggested that there are seven distinct types of information that we derive from seen faces; these are labelled pictorial, structural, visually derived semantic, identity-specific semantic, name, expression and facial speech codes. A functional model is proposed in which structural encoding processes provide descriptions suitable for the analysis of facial speech, for analysis of expression and for face recognition units. Recognition of familiar faces involves a match between the products of structural encoding and previously stored structural codes describing the appearance of familiar faces, held in face recognition units. Identity-specific semantic codes are then accessed from person identity nodes, and subsequently name codes are retrieved. It is also proposed that the cognitive system plays an active role in deciding whether or not the initial match is sufficiently close to indicate true recognition or merely a ‘resemblance’; several factors are seen as influencing such decisions.
This functional model is used to draw together data from diverse sources including laboratory experiments, studies of everyday errors, and studies of patients with different types of cerebral injury. It is also used to clarify similarities and differences between processes responsible for object, word and face recognition.
TL;DR: A new hierarchical model consistent with physiological data from inferotemporal cortex that accounts for this complex visual task and makes testable predictions is described.
Abstract: Visual processing in cortex is classically modeled as a hierarchy of increasingly sophisticated representations, naturally extending the model of simple to complex cells of Hubel and Wiesel. Surprisingly, little quantitative modeling has been done to explore the biological feasibility of this class of models to explain aspects of higher-level visual processing such as object recognition. We describe a new hierarchical model consistent with physiological data from inferotemporal cortex that accounts for this complex visual task and makes testable predictions. The model is based on a MAX-like operation applied to inputs to certain cortical neurons that may have a general role in cortical function.
TL;DR: Perceptual experiments can be designed to ask which subdivisions of the system are responsible for particular visual abilities, such as figure/ground discrimination or perception of depth from perspective or relative movement--functions that might be difficult to deduce from single-cell response properties.
Abstract: Anatomical and physiological observations in monkeys indicate that the primate visual system consists of several separate and independent subdivisions that analyze different aspects of the same retinal image: cells in cortical visual areas 1 and 2 and higher visual areas are segregated into three interdigitating subdivisions that differ in their selectivity for color, stereopsis, movement, and orientation. The pathways selective for form and color seem to be derived mainly from the parvocellular geniculate subdivisions, the depth- and movement-selective components from the magnocellular. At lower levels, in the retina and in the geniculate, cells in these two subdivisions differ in their color selectivity, contrast sensitivity, temporal properties, and spatial resolution. These major differences in the properties of cells at lower levels in each of the subdivisions led to the prediction that different visual functions, such as color, depth, movement, and form perception, should exhibit corresponding differences. Human perceptual experiments are remarkably consistent with these predictions. Moreover, perceptual experiments can be designed to ask which subdivisions of the system are responsible for particular visual abilities, such as figure/ground discrimination or perception of depth from perspective or relative movement--functions that might be difficult to deduce from single-cell response properties.