
An empirical explanation: objective and perceived motion
Feature based explanations of perceived motion
Feature-based explanations of perceived motion
Given these explanations for a range of motion percepts, it is important to compare the wholly empirical approach we have taken with other explanations that have been offered for some of these phenomena, and indeed for motion perception generally.
With respect to the apparent speed of moving objects, most accounts have focused on the flash-lag effect. Other theories of this effect that have been offered are: 1) that the visual system compensates for neuronal latencies by extrapolating the expected position of a moving object from information in the stimulus latencies (Nijhawan, 1994; Khurana & Nijhawan, 1995; Nijhawan, 1997; Khurana, Watanabe, & Nijhawan, 2000); 2) that vision employs information about the immediate future to “postdict” the position of moving objects (Purushothaman, Patel, Bedell, & Ogmen, 1998; Whitney & Murakami, 1998; Whitney, Murakami, & Cavanagh, 2000); 3) that the flash-lag effect is the consequence of “anticipation” in early retinal processing (Berry, Brivanlou, Jordan, & Meister, 1999); and 4) that the effect occurs because stimulus processing entails shorter latencies for moving stimuli than for static flashes (Eagleman & Sejnowski, 2000a, 2000b, 2000c). These proposals have at least two deficiencies. First, for technical and historical reasons, they have assumed based on limited data that psychophysical function to be explained was linear. Thus none of these theories addresses the actual non-linear function shown in Figure 4B. Second, and more important, these interpretations assume that the perceived lag derives directly from the features of image sequences on the retina and their subsequent processing by the visual system. This interpretation ignores the fact that any direct analysis of images, moving or otherwise, is inevitably meaningless for visually guided behavior because of the inverse problem.
With respect to rationalizing the apparent direction of image sequences, most explanations have focused on aperture effects. One popular approach to rationalizing these phenomena has been to suppose that the visual system calculates image features such as the local velocity vector field in an image sequence (e.g., Adelson & Movshon, 1982; Hildreth, 1984; Horn & Schunck, 1981). Like Helmholtz’s idea about “unconscious inferences” (see Chapter 1), the claim is that whereas the visual system analyzes 2-D image features, help in the face of ambiguity is provided by “knowledge” about the 3-D world gleaned from individual experience. The problem with the idea that aperture effects arise as a result of visual computations that minimize variation in local vectors is two-fold. First, as is generally acknowledged (Hildreth, 1984), solutions that match the perceptions of human observers require ad hoc assumptions to constrain the solution space, including the assumption that moving objects produce a constant velocity field (Limb & Murphy, 1975; Fennema & Thompson, 1979; Marr & Ullman, 1981), that object motion is “rigid” (Ullman, 1979; Bennett et al., 1989; Yuille & Ullman, 1990), or that object motion is “smooth” (Hildreth, 1984; Yuille & Ullman, 1990). Whereas these assumptions are reasonable, they fail to capture the biases that are actually elicited by various apertures. For example, Adelson and Movshon’s (1982) explanation based on common vector-constraint predicts wrongly that the perceived direction in a triangular aperture should always be normal to the line orientation (because all possible local direction vectors on each point on a line are normal to the line orientation). Similarly, Hildreth’s (1984) calculation of a vector-field under the assumption of least variation (i.e., the smoothness of motion assumption) does not account for the increasing biases as the angle of orientation increases in a vertical slit (see Figure8; the Hildreth model predicts that the perceived direction should be strictly vertical). The second problem is simply the absence of any biological reason for why the visual system should generate such percepts, which are simply regarded as anomalies.
Understanding aperture effects in a wholly empirical framework deals with both these problems. Thus an empirical framework does away with the need to calculate the local features of image sequences such as velocity vectors (e.g., Adelson & Movshon, 1982; Hildreth, 1984; Fennema and Thompson, 1979) or spatiotemporal energy (e.g., Adelson & Bergen, 1985). By the same token, an empirical understanding of motion perception abrogates the need for unconscious inferences or even computations of Bayesian priors (e.g., Knill & Richards, 1996; Stocker & Simoncelli, 2006; see below for discussion of this latter point). In an empirical framework, perceived direction simply reflects the distribution of directional biases in the 2-D projections of 3-D sources experienced in the past. The mechanisms underlying these percepts are presumably instantiated in the evolved connectivity of the visual system, abetted by the influences of individual experience on synaptic development. Finally, as already emphasized, there is a clear biological rationale for generating motion percepts in this way: vision on an empirical basis circumvents the inverse problem, generating successful visually guided interactions with the world despite the unknowable nature of image sources.
Explanations of motion perception in Bayesian terms
Another issue that needs to be considered here (as well as in all other aspects of visual perception) are the pros and cons of formulating any empirical explanation of motion perception in terms of Bayesian decision theory. An empirical framework for understanding perceived speed or direction can be formulated in Bayesian terms, and it is important to understand why we have chosen not to use this approach. (This issue is considered in more detail in Howe et al., 2006; see Knill and Richards, 1996, for a review of Bayesian applications in vision.)
Consider predicting the perception of object speed. The pertinent variables are the 3-D object speed, 2-D image speed, and 2-D image distance traveled (see above). In Bayesian terms, the predictive relationship of these parameters can be expressed as,
where P is probability. The first term on the right side of the equation, P(3D speed), is the prior probability distribution, which describes the experience of human observers with 3-D speeds. The second term on the right, P(2D speed, 2D distance | 3D speed), is the likelihood function, which describes the probability that a given 3-D speed will have generated any specific 2-D image speed and distance. The product of the prior and relevant likelihood function divided by a normalization constant, P(2D speed, 2D distance), generates the posterior probability distribution, P(3D speed | 2D speed, 2D distance). The posterior distribution is therefore a subset of the prior, indicating the relative probabilities of possible 3-D object speeds that could have produced a specific 2-D image sequence in question. Since specific motion percepts must be predicted by a particular value in the posterior distribution, a basis for choosing this value is needed. Typically, the criterion used for this choice is biological usefulness; under the assumption that the most useful percept would be the most frequently occurring source in past experience, an index such as the mean, median, or mode of the posterior distribution is used to generate the value that determines the percept. For instance, to predict the psychophysical observations in Figure 9B Bayesian terms, posterior probability distributions for stimuli traversing different projected distance at a given speed would be calculated. A problem is that a Bayesian formulation predicts that observers should perceive the speed in such circumstances as being approximately the same, regardless of the projected distance traveled. However, the psychophysical data in Figure 9B show that perceived speed decreases progressively as the projected distance increases (see Wojtach et al., 2008).
The empirical framework used to explain aperture effects can also be formulated in terms of Bayesian decision theory as
In this instance, the prior describes the probability distribution of the 3-D directions independent of the current projected image. The likelihood function, P(2D direction, 2D orientation | 3D direction), is the probability distribution of 3-D to 2-D transforms that could underlie a particular projected image. The product of the prior and likelihood function divided by a normalization constant, P(2D direction, 2D orientation), generates the posterior probability distribution, P(3D direction | 2D direction, 2D orientation), which again describes the probability distribution of different states of the world, given a particular image. Under the assumption that the most frequently occurring source in past experience would provide the greatest benefit to the observer, the decision is typically made by selecting the mode of the posterior. Why, then, did we not use a Bayesian formulation to explain aperture effects?
As before, a Bayesian formulation takes percepts to be estimates of particular states of the world, the implication being that the goal of perception is to discover the actual 3-D directions of moving objects. However, the inverse optics problem prevents the visual system from ever specifying the properties of objects as such; thus the directions we see cannot (and do not) map onto the actual 3-D directions of the sources of stimuli. In the wholly empirical framework we have used visual percepts only gain meaning operationally according to the evolutionary consequences of the relative success of past behavior, and are therefore not tied to particular object properties or features. Therefore, the predicted percepts do not conform to physical reality, nor would they be expected to. In this conception the response to an image is simply reflexive.
The reason for the different outcome predicted by a Bayesian framework compared to the predictions made by empirical ranking is thus how each approach conceptualizes the goal of vision. A Bayesian framework assumes that motion percepts are determined by the most likely 3-D speed and direction that generated the stimulus, the implied goal being to link percepts with the specific physical characteristics of sources in the world underlying a stimulus sequence. As indicated in Figure 1, however, the inverse optics problem precludes direct access to the properties of the physical world, making the explicit goal of a Bayesian framework impossible to achieve, at least as it is typically formulated. An empirical approach, in contrast, predicts motion percepts that are based on the full range of past experience rather than a particular state of the world. Indeed, in a wholly empirical framework, percepts do not bear a relationship to the world that can be thought of as realistic or correct; percepts simply provide a mapping function that is operational successful.
Physiological explanations of motion processing
Finally, one needs to ask how an empirical theory of motion perception is likely to be related to what is known about the physiology of motion processing. The prevailing physiological models of motion processing are based on a hierarchy in which the lower-order receptive field properties of motion sensitive neurons in V1 are used to progressively construct the more complex responses of higher-order cortical regions such as areas MT and MST in the primate brain and hMT+ in humans, the culmination of this process presumably being the motion perceived (Hubel and Wiesel, 1962,1974,1977;De Valois et al., 1982a,1982b; Livingstone and Hubel, 1987,1988). Although this “bottom-up” approach has been amended by two-stage (Braddick, 1974) or three-stage (Lu and Sperling, 1995) processing schemes, as well as by the addition of “component cells” and “pattern cells” that could explain further details of visual motion perception (Movshon et al., 1985;Rust et al., 2006), it remains a popular conception of how motion percepts are generated. Explaining perceived motion on this basis is problematic, however, since the physical properties of moving stimuli routinely fail to correlate with perception. Although motion sensitive neurons are of course necessary causes of motion percepts, they are clearly not sufficient for explaining the directions and speeds that we actually see.










