A review paper on audiovisual speech perception from me and Mitch Sommers (2015) is now in press in Cortex (part of a forthcoming special issue on predictive processes in speech comprehension). In this review Mitch and I have tried to start unifying two separate lines of research. The first is that ongoing oscillations in auditory cortex affect perceptual sensitivity. There is continued interest in the role of cortical oscillations in speech perception, even for auditory-only speech, where there is evidence that cortical oscillations entrain to the ongoing speech signal (Giraud & Poeppel 2012; Peelle & Davis 2012). Aligning cortical oscillations to perceptual input can increase sensitivity (i.e., faster or more accurate at detecting near-threshold inputs). Entrainment is amplified by visual input, making multimodal integration in auditory cortex a viable mechanism for audiovisual processing (Schroeder et al., 2008).
Alongside this increased perceptual sensitivity comes the visual information that restricts the possible sounds (i.e., words). For example, when trying to make a "cat/cap" distinction, having the lips open gives a clear indication that "cap" is not correct. This perspective is described within the intersection density framework, which is a straightforward extension of unimodal lexical competition to audiovisual speech: speech perception is constrained to items that are compatible with both auditory and visual input.
We discuss these complementary types of integration in the context of schematic models of audiovisual speech processing. Although it seems like a basic point, from our perspective the available evidence suggests that multisensory processing influences perception at multiple levels (and in neuroantomically dissociable regions).
Finally, one very important aspect worth emphasizing: like all speech processing (Peelle 2012), the details of audiovisual speech processing are likely heavily influenced by the type of stimulus and task that we are doing. So, connected speech (sentences) may provide visual information that aids in processing that is simply unavailable in single words or phonemes. Similarly, phoneme studies (say, with a token of /da/) will not require the lexical competition and selection processes involved in word perception. This is not to say that any of these levels are more or less valid to study; however, we have to be cautious when trying to make generalizations, and sensitive to differences in visual information as a function of linguistic level (phoneme, word, sentence).
There are still many unresolved questions regarding the representations of visual-only speech, and audiovisual integration during speech processing. Hopefully the suggestions Mitch and I have made will be useful, and we look forward to having some more data in the coming years that speak to these issues.
References:
Giraud A-L, Poeppel D (2012) Cortical oscillations and speech processing: Emerging computational principles and operations. Nat Neurosci 15:511-517. doi:10.1038/nn.3063
Peelle JE (2012) The hemispheric lateralization of speech processing depends on what "speech" is: A hierarchical perspective. Frontiers in Human Neuroscience 6:309. doi:10.3389/fnhum.2012.00309 (PDF)
Peelle JE, Davis MH (2012) Neural oscillations carry speech rhythm through to comprehension. Frontiers in Psychology 3:320. doi:10.3389/fpsyg.2012.00320 (PDF)
Peelle JE, Sommers MS (2015) Prediction and constraint in audiovisual speech perception. Cortex. doi:10.1016/j.cortex.2015.03.006 (PDF)
Schroeder CE, Lakatos P, Kajikawa Y, Partan S, Puce A (2008) Neuronal oscillations and visual amplification of speech. Trends in Cognitive Sciences 12:106-113. doi:10.1016/j.tics.2008.01.002