Introduction

Speech serves as a vital medium for real-time communication, where listening and speaking are closely interlinked. Humans are innately capable of discriminating phonetic features across languages1,2, which are refined through speech production experience3,4. In daily conversations, speaking and listening often occur simultaneously, fostering a continuous feedback loop between the production and perception systems5,6. Theories and neuroimaging studies suggest that speech-production-related motor areas, including the ventral premotor and motor cortex in the precentral gyrus (preCG), supplementary motor area (SMA), and Broca’s area in the inferior frontal gyrus (IFG), also contribute to auditory speech perception7,8, although their exact role remains unclear.

Speech perception involves the precise analysis and differentiation of distinctive auditory or articulatory features. However, variability in acoustic signals complicates categorization of speech sounds into phonemes or words. One predominant perspective proposes that speech is processed as invariant articulatory gestures within motor areas to facilitate recognition, aligning with the motor theory of speech perception9,10. Research has shown that listening to speech activates the frontal motor areas involved in producing these sounds11,12,13,14,15,16, suggesting potential motoric representations of articulatory gestures within these speech-motor areas during speech perception. Such production-perception mapping has been observed even in pre-babbling infants with limited articulatory abilities2. Moreover, studies on songbirds17,18 and non-human primates19 indicate a role for vocal production systems in voice/speech perception. This evidence implies that motoric encoding for auditory language perception has an early-developmental or evolutionary basis, possibly revealed in speech-motor areas20.

An alternative perspective argues that the motoric representations within speech-motor areas play a supplementary role in aiding speech discrimination during challenging situations, while fine-grained auditory processing underpins the core of speech perception21,22,23,24. Extending this, recent theories propose that speech-motor areas might also encode acoustic features, complementing the auditory cortex5,6,25. Consistent with this idea, an electrocorticography (ECoG) study has demonstrated that ventral precentral cortex activity is organized by acoustic features during perception26. Additionally, a functional magnetic resonance imaging (fMRI) study indicated that articulatory and acoustic information was co-represented in the ventral precentral cortex and IFG during speech perception27. Research on speech production further illustrates this complex motor-sensory interaction. Zhang et al.28 found that during overt and silent articulation, articulatory motoric information was represented not only in speech-motor regions such as the left anterior insula and IFG, but also in somatosensory and auditory regions like the superior temporal gyrus (STG), suggesting an interaction between motor regions and sensory pathways.

Nonetheless, acoustic information remained localized in auditory regions, particularly the STG, but not in motor areas28. Yet, it remains an open question whether motoric, auditory, or both types of representations are present in speech-motor areas during perception, and how motor areas interact with sensory regions.

When language is presented visually (as in reading), do production-perception interactions occur similarly to those in auditory processing? Unlike speech, reading is a later-acquired skill through intensive training and is closely intertwined with handwriting during literacy development29. This reading-writing coupling process refines the spatial representations of visual word forms30,31,32,33 and supports motor memory for written words34,35,36,37. Consequently, it is proposed that the brain regions involved in both reading and writing may integrate to contribute to visual word recognition38,39,40,41.

Recent studies have extended our understanding of the role of production-related motor regions in reading. It has been observed that visual word recognition involves not only the ventral occipitotemporal system for recognizing word shape, but also the dorsal writing-associated motor regions, such as Exner’s area in the posterior part of the left superior frontal sulcus or the dorsal left middle frontal gyrus42,43,44. Exner’s area is regarded as the graphemic center45, crucial for associating orthographic representations with handwriting motor commands46,47,48, and it plays a significant role in facilitating visual word processing38,42,43,44. Another important handwriting-related motor region involved in reading is the left superior parietal lobule/intraparietal sulcus (SPL/IPS), which is suggested to engage in the visual-motor sequence processing of written strokes and also contributes to reading30,49.

However, the precise nature of neural representations within these writing-motor areas during visual word perception remains underexplored. The prevailing view is that the writing-motor regions encode motoric writing gestures during reading40,41,43, as evidenced by the increased activation within Exner’s area when viewing words or characters in stroke-by-stroke motion43,49,50, or following handwriting training30,31,51. Nevertheless, most of these studies might be influenced by confounding factors, such as explicit sequence processing. In contrast to motoric encoding, recent behavioral studies underscore the significance of visual-spatial information over writing-motoric knowledge in affecting reaction times for letter or character discrimination33,52, suggesting a greater role of visual-form encoding53. Despite these insights, it remains unclear whether writing-motor areas represent motoric, sensory, or both features, and how they associate with visual processing regions.

For a clearer picture on the encoding and integration mechanisms in writing-motor areas, representational similarity analysis (RSA) offers a promising approach. RSA characterizes the information represented in a brain region by quantifying the similarity between multi-voxel neural response patterns and the feature parameters across different stimuli, using representational dissimilarity matrices (RDMs)54,55. Rothlein and Rapp (2014) applied RSA to differentiate the stroke-motoric and visual features of English letters in passive reading. They found motoric encoding in the left IPS and visual encoding in the posterior occipitotemporal and precuneus regions, suggesting motoric and visual representations in distinct regions for English letter processing56. Nevertheless, their use of simple letter stimuli may not reflect typical processing of words or morphemes in reading. Chinese characters offer a unique window to explore motoric/sensory representations in visual word processing, as single-syllable characters could form meaningful words, serving as the basic units in natural reading. Moreover, the complex structure of Chinese characters likely links reading acquisition in Chinese more tightly to handwriting practice43,44, providing a valuable context to investigate the interaction between the representations of writing-motor and visual processing systems.

Taken together, substantial evidence suggests that language-motor areas contribute to both auditory and visual modalities of language perception, though the precise nature of these representations is not yet fully understood. Notably, these motor areas exhibit a modality-bound encoding pattern, consistent with the sensorimotor hypothesis of embodied cognition57,58,59,60, which suggests that motor regions mirror the sensorimotor experience of specific language processing modalities in language perception. Specifically, speech-motor areas likely encode articulatory or acoustic information linked to speaking and listening, while writing-motor areas are suggested to encode writing programs or visual features of written words related to handwriting and reading. Nevertheless, most studies have focused on individual modalities, with little exploration of how these language-motor regions function across auditory and visual modalities. Existing studies have explored neural activation patterns between speech and reading, emphasizing distinctions between unimodal versus multimodal regions, and modality-specific effects in language comprehension61,62,63,64,65. Despite differences in functional localization or activation strength, speech perception and reading may share consistent representational patterns, such as in processing high-level semantic information65. However, no study has systematically compared how language-motor areas represent sensory and motoric features in auditory and visual language processing in a comparable experimental framework.

The primary aim of this study was to characterize the neural representations in speech-motor areas during auditory speech perception and in writing-motor areas during visual word perception. We focused on the phonetic and orthographic sensorimotor features tied to the articulatory and handwriting motor programs involved in producing these stimuli, as well as their auditory and visual forms. These features are fundamental to higher-order language processes like lexical and semantic processing66. To ensure comparability, we employed consistent experimental designs and analysis protocols for both modalities, including a mixed block and event-related design for robust item-based analysis67,68, and RSA to evaluate encoding patterns within language-motor regions. Using fMRI, we measured blood-oxygen-level dependent (BOLD) responses in language-motor areas during perception tasks in two modalities: auditory speech syllable perception (SP) and visual character perception (CH). To identify speech-motor and writing-motor areas, participants also performed language production localizer tasks for syllable articulation (AR) and finger writing (WR). Speech and character stimuli were categorized along motoric and sensory dimensions, enabling the construction of Representational Dissimilarity Matrices (RDMs). RSA was then applied to assess correlations between neural activity and motoric/sensory features of language stimuli55, reflecting motoric or sensory encoding.

More importantly, a cross-modality comparison of results allowed us to explore whether the sensorimotor system generally encodes modality-bound motoric (production-related) or sensory (perception-related) features across modalities, or it exhibits distinct representation patterns for each modality. One hypothesis posits that language-motor areas share general sensorimotor encoding patterns across modalities to support recognition39, consistent with the sensorimotor hypothesis of embodied cognition60. Alternatively, motor areas may exhibit modality-specific encoding patterns to meet the unique demands of auditory and visual perception. We hypothesize that speech-motor areas may exhibit stronger motoric or sensory encoding effects in speech perception than writing-motor areas during visual character perception. Two key factors likely support this hypothesis: (1) Speech perception involves strengthened motor-sensory interaction due to evolutionary or innate production-perception-coupling mechanisms2,19,20. This coupling might be further reinforced by the more frequent co-occurrence of perception and production compared to reading-writing (especially in mature adult readers); (2) Speech perception typically imposes stricter temporal demands, necessitating rapid sensorimotor encoding27,69,70,71,72, while reading generally allows more processing time and may exhibit subtler sensorimotor encoding effects.

Our multivariate RSA results revealed shared motoric representational patterns in language-motor regions across modalities (i.e., right preCG for SP task and left SFG for CH task, respectively), supporting the involvement of language-motor areas in encoding production-related features during both auditory and visual language perception. Additionally, motoric representations were also observed in sensory areas in both modalities (left STG and HG for SP task, and right SOG for CH task); this motoric encoding within sensory regions was correlated with motor regions, highlighting a dynamic sensory-motor interaction that facilitates integrative language processing across modalities. However, sensory encoding demonstrated a modality-specific representational pattern, as it was observed only in the auditory task within both motor and sensory regions (left IFG and STG), but not in the visual task. These findings align with the embodied hypothesis of language processing57,58,59,60 and suggest that motor areas may play differential roles in sensory encoding during auditory versus visual tasks, potentially shaped by early developmental or genetic factors and the distinct demands of each sensory modality.

Results

Language-motor areas were identified using two block-designed localizer tasks: a syllable articulation (AR) task to localize speech-motor areas (for experimental paradigm see Fig. 1a, c) and a finger writing (WR) task to localize writing-motor areas (for experimental paradigm see Fig. 1b, d; for material details see Table S1). Participants’ neural (BOLD) responses and task performance were examined in two modality-specific language perception tasks. In the auditory speech perception (SP) task, participants listened to syllables or reversed speech sounds (auditory control) spoken by a male speaker, and responded to target female voices (Fig. 1e). In the visual character perception (CH) task, participants viewed Chinese characters or scrambled visual stimuli (visual control) in black font, and responded to infrequent gray-colored targets (Fig. 1f). These perception tasks utilized a mixed block and event-related design67,68, enabling univariate analysis of sensorimotor regions involved in perception, as well as item-based representational similarity analysis (RSA) to investigate neural representational patterns (for analysis steps see Fig. 2). Stimuli in each task were systematically categorized based on motoric and sensory properties to ensure comparisons across modalities (for experimental paradigm see Figs. 1g, h and 2a, b; for material and value details see Tables S2S9). This design provided a framework to examine modality-general and modality-specific encoding of motoric and sensory features in language-related motor and sensory regions (for details, see Methods).

Fig. 1: Experimental design and example stimuli.
figure 1

a Experimental design for articulation (AR) localizer task. Participants were instructed to repeatedly articulate the prompted syllables or move their lips up and down during the asterisk display. The functional localization hypotheses for the activated speech-motor areas were shown with orange circles. b Experimental design for finger writing (WR) localizer task. Participants were instructed to repeatedly write the prompted characters (listed in Table S1) or draw simple shapes with their right index finger during the asterisk display. The functional localization hypotheses for the activated writing-motor areas were shown with orange circles. c Example stimuli for AR task. d Example stimuli for WR task. e Experimental design for auditory speech perception (SP) task. Participants heard intelligible syllables or unintelligible reversed speech signals, responding to detect female voice in 1/8 of all trials. The functional localization hypotheses for the activated speech-motor regions and auditory processing regions were shown with orange and blue circles, respectively. f Experimental design for visual character perception (CH) task. Participants viewed characters or scrambled characters, responding to detect gray-colored stimuli in 1/8 of all trials. The functional localization hypotheses for the activated writing-motor regions and visual processing regions were shown with orange and blue circles, respectively. g Example stimuli and motoric/sensory categorization for SP task. h Example stimuli and motoric/sensory categorization for CH task (listed in Table S3). Abbreviations: SOA = stimulus onset asynchrony; preCG = precentral gyrus; IFG = inferior frontal gyrus; HG = Heschl’s gyrus; STG = superior temporal gyrus; SFG = superior frontal gyrus; MFG = middle frontal gyrus; SPL/IPS= superior parietal lobule/inferior parietal sulcus; ITG = inferior temporal gyrus. See also the abbreviations of brain regions in Table S14.

Fig. 2: Representational similarity analysis (RSA) steps for motor and sensory regions of interest (ROIs).
figure 2

a Representational similarity analysis (RSA) steps for the auditory speech perception (SP) task. For the key speech-motor regions and auditory processing regions (orange and blue circles illustrating the hypothesized locations), neural activity patterns were assessed using item-based general linear model (GLM) t-maps. Activity patterns were compared across items using Pearson correlation, and the resulting values were subtracted from 1 to generate neural representational dissimilarity matrices (RDMs). These neural RDMs were then correlated with the speech-motoric RDM, the high-level acoustic feature RDM, and the low-level acoustic spectrogram RDM, respectively, to explore the contents of neural representations. Detailed values were provided in Tables S4S6. b RSA steps for the visual character perception (CH) task. For the key writing-motor regions and visual processing regions (orange and blue circles illustrating the hypothesized locations), neural activity patterns were assessed using item-based GLM t-maps. Activity patterns were compared across items using Pearson correlation, and the resulting values were subtracted from 1, resulting in neural RDMs. These neural RDMs were correlated with the stroke-motoric RDM, the high-level visual feature RDM, and the low-level visual pixel RDM. Detailed values were provided in Tables S7S9. Abbreviations: preCG = precentral gyrus; IFG = inferior frontal gyrus; HG = Heschl’s gyrus; STG = superior temporal gyrus; SFG = superior frontal gyrus; MFG = middle frontal gyrus; SPL/IPS= superior parietal lobule/inferior parietal sulcus; ITG = inferior temporal gyrus. Also see the abbreviations of brain regions in Table S14.

Behavioral performance of perception tasks

The average accuracy was 94.17% (SD = 6.34%) for auditory detection and 92.22% (SD = 6.54%) for visual detection, indicating sustained attention during both tasks.

Brain activation in articulation and finger writing tasks

To explore brain activation patterns in the language production localizer tasks, we employed a general linear model (GLM) with boxcar regressors for each task. Syllable articulation and character writing were contrasted with their respective motor controls (lip movement and drawing) to generate individual-level contrast images. These contrast images were then used in group-level analyzes to identify speech-motor and writing-motor areas (see Methods for the preprocessing and modeling details). Results were reported at p < 0.05, FDR corrected, at the whole-brain level.

The results showed that, the AR task elicited syllable-production-related activation (vs. lip movement) in typical frontal speech-motor regions, including the left precentral gyrus (preCG) and the triangular part of the left inferior frontal gyrus (IFGtri), as shown in Fig. 3a (voxel-level: n = 29, t = 4.084, p < 0.05 FDR-corrected). The WR task showed character-writing-related activation (vs. drawing) predominantly in parietal and prefrontal regions, encompassing the bilateral superior frontal gyrus (SFG), left supplementary motor area (SMA), bilateral superior parietal lobules (SPL), left inferior parietal lobule (IPL), and right cerebellum, as shown in Fig. 3b (voxel-level: n = 29, t = 3.432, p < 0.05 FDR-corrected). The activation patterns were in line with previous studies both in speech production7,11,49,69,73 and writing tasks47,74,75,76,77. Articulation activated more ventral frontal brain regions, whereas writing activated more dorsal frontal and parietal brain regions (refer to Table S10 for peak activation coordinates and effect sizes).

Fig. 3: Univariate results for language production localizers and perception tasks.
figure 3

a Univariate general linear model (GLM) results for speech production (syllable articulation contrasts with lip-movement). b Univariate GLM results for character production (finger writing contrasts with drawing). For both a and b, brain activations (t-value maps) for language production were displayed with superior, lateral and medial surface-rendered views. These results were thresholded at voxel-corrected p < 0.05 with an FDR correction method for multiple comparison (cluster extent threshold = 10 voxels). The peak coordinate results and effect sizes for each cluster (Cohen’s d and Hedges’ g) were reported in Table S10. c Univariate GLM results for auditory speech perception (spoken syllable contrasts with reversed speech or rest/fixation). d Univariate GLM results for visual character perception (visual character contrasts with scrambled character or rest/fixation). For both c and d, brain activations (t-value maps) for language perception were displayed with superior, lateral and medial surface-rendered views. These results were thresholded at uncorrected p < 0.01 at the voxel level (cluster extent threshold = 10 voxels) for the contrast between speech/character and control conditions, and at p < 0.05 with FDR correction for contrasts between speech/character and rest/fixation. The peak coordinate results and effect sizes for each cluster (Cohen’s d and Hedges’ g) were reported in Tables S11S12. Abbreviations: L = left; R = right.

Brain activation in speech syllable and visual character perception tasks

Likewise, GLMs were conducted to investigate brain activation patterns in the perception tasks. At the individual level, spoken syllable perception was contrasted with reversed speech perception (auditory control) and rest (fixation) in SP task, and visual character perception was contrasted with scrambled character perception (visual control) and rest (fixation) in CH task. Participant-specific contrast images were then used in group-level analyzes to identify regions involved in auditory and visual language perception (see Methods). Results were reported at p < 0.01 (uncorrected) for the “speech/character vs. control” contrasts to address weaker motor activations in passive perception tasks11,12,27,38, and at p < 0.05 (FDR corrected) for the “speech/character vs. rest/fixation” contrasts.

As shown in Fig. 3c, in the SP task, listening to intelligible speech syllables elicited greater activation mainly in typical ventral frontal sensorimotor regions and the superior temporal regions (spoken syllable vs. reversed speech perception: n = 29, t = 2.462, p < 0.01 uncorrected; spoken syllable vs. rest/fixation: n = 29, t = 2.832, p < 0.05 FDR-corrected; refer to Table S11 for peak activation coordinates and effect sizes). Regions that were significantly activated in auditory syllable perception included the bilateral precentral and postcentral gyrus (preCG/postCG), right superior frontal gyrus (SFG), the opercular part of the bilateral inferior frontal gyrus (IFGoper), bilateral insula (Ins), bilateral supplementary motor area (SMA), bilateral superior temporal gyrus/Heschl’s gyrus (STG/HG), and the right supramarginal gyrus (SMG).

As shown in Fig. 3d, in the CH task, viewing visual character activated a broader network, covering frontal, temporal, parietal, and occipital regions, during visual character perception (visual character vs. scrambled character perception: n = 29, t = 2.462, p < 0.01 uncorrected; visual character vs. rest/fixation: n = 29, t = 3.396, p < 0.05 FDR-corrected; refer to Table S12 for peak activation coordinates and effect sizes). The brain activations for visual character perception were mainly found in the bilateral precentral and postcentral gyrus (preCG/postCG), left superior /inferior frontal gyrus (SFG/IFG), bilateral middle frontal gyrus (MFG), left insula (Ins), bilateral supplementary motor area (SMA), bilateral superior/middle/inferior temporal gyrus (STG/MTG/ITG), left superior occipital gyrus (SOG), bilateral inferior occipital gyrus (IOG), bilateral lingual and fusiform(Ling/Fusi), bilateral inferior parietal lobule (IPL), and bilateral supramarginal gyrus (SMG).

The brain activation patterns for perception tasks indicate the involvement of both sensory and motor systems in language perception across auditory and visual modalities, consistent with earlier research on the neural correlates of auditory speech perception7,11,27 and visual character/word perception43,44,50.

Representational patterns in speech syllable and visual character perception tasks

To investigate the motoric and sensory representational patterns during language perception, RSA was performed in key regions of interest (ROIs) identified from the production and perception task results. These ROIs included speech-motor areas (bilateral preCG, IFG, Insula, SMA), writing-motor areas (bilateral SFG, MFG, preCG, SPL/IPS), auditory processing areas (bilateral HG, STG), and visual processing areas (bilateral ITG-Fusiform, IOG, SOG). Neural RDMs were constructed by selecting from the top 100 most active voxels within each ROI (as defined by AAL anatomical templates), based on activation patterns from perception task. Predictor RDMs captured motoric and sensory features of the stimuli, including speech-motoric, high-level acoustic, and low-level acoustic features for SP task, and stroke-motoric, high-level visual, and low-level visual features for CH task (Fig. 2a, b). For each ROI, the relationship between neural and feature RDMs was assessed using a two-stage permutation test78,79. At the individual level, feature RDMs were shuffled 100 times to create a null distribution of randomized correlations with neural RDMs. At the group level, one randomized correlation from each participant’s null distribution was sampled, and the group mean was calculated. This process was repeated 10,000 times to generate a null distribution of group-level correlations. P-values were corrected for multiple comparisons using the FDR method. The RSA results are shown in Fig. 4.

Fig. 4: Representational similarity analysis (RSA) results within motor and sensory regions of interest (ROIs) for perception tasks.
figure 4

a Representational similarity analysis (RSA) results for auditory speech perception (SP) task. The localization of speech-motor and auditory processing regions of interest (ROIs) was shown with the AAL anatomical template, where the top 100 active voxels for each ROI were selected for item-based RSA. Bar plots (and error bars) indicate mean values (and standard errors) of Fisher z-transformed correlation effects between neural representational dissimilarity matrices (RDMs) and speech-motoric RDM, high-level acoustic feature RDM, and low-level acoustic spectrogram RDM across participants, in each ROI at left and right hemispheres. b RSA results for the visual character perception (CH) task. The localization of writing-motor and visual processing ROIs was shown with the AAL anatomical template, where the top 100 active voxels for each ROI were selected for item-based RSA. Bar plots (and error bars) indicate mean values (and standard errors) of Fisher z-transformed correlation effects between neural RDMs and stroke-motoric RDM, high-level visual feature RDM, and low-level visual pixel RDM across participants, in each ROI at left and right hemispheres. For both a and b, a half violin plot was overlaid behind each bar plot to depict the distribution of the data, and individual data points were represented by light-colored circles to the right of the bar plots. Asterisks indicate significant correlation effects (*p < 0.05 with FDR correction). Abbreviations: Hemi. = hemisphere; preCG = precentral gyrus; IFG = inferior frontal gyrus; SMA = supplementary motor area; HG = Heschl’s gyrus; STG = superior temporal gyrus; SFG = superior frontal gyrus; MFG = middle frontal gyrus; SPL/IPS= superior parietal lobule / inferior parietal sulcus; ITG-Fusi = inferior temporal gyrus and fusiform gyrus; IOG = inferior occipital gyrus; SOG = superior occipital gyrus. Also see the abbreviations of brain regions in Table S14.

When listening to auditory syllables (Fig. 4a), speech-motoric dissimilarity predicted neural dissimilarity in right preCG (z = 0.0990, p = 0.009), left HG (z = 0.0697, p = 0.0401), and left STG (z = 0.0775, p = 0.0227). Moreover, the effect of high-level speech-acoustic dissimilarity was observed in both left IFG (z = 0.0638, p = 0.0461) and left STG regions (z = 0.0783, p = 0.0248), while the effect of low-level acoustic spectrogram dissimilarity was only found in right HG (z = 0.0610, p = 0.0309). No correlation effects were observed in other ROIs (ps > 0.057).

In viewing visual characters (Fig. 4b), significant correlations between stroke-motoric dissimilarity and neural dissimilarity were found in left SFG (z = 0.0196, p = 0.0297) and right SOG (z = 0.0175, p = 0.0414). No motoric effects were observed in other ROIs (ps > 0.122). Neither the high-level visual feature nor low-level visual pixel RDMs could predict the neural RDMs in any writing-motor ROIs or visual processing ROIs (ps > 0.267).

To further validate our findings, we examined the representational connectivity80 between the motor and sensory ROIs that demonstrated significant feature encoding during language perception tasks (Figure S1). We computed the correlations between motor-ROI neural RDMs and sensory-ROI neural RDMs using the top 100 most active voxels. Group-level significance was assessed by a permutation approach similar to RSA. Specifically, group-mean correlations between neural RDMs from different ROIs were compared against a null distribution generated from 50,000 random correlations (derived from 100 shuffled correlations per participant) for all ROI pairs. P-values were corrected using the Bonferroni method for multiple comparisons. For the SP task, areas involved in articulatory motoric processing exhibited significant cross-regional correlations, including left STG vs. right preCG (z = 0.166, p = 0.0001), left HG vs. right preCG (z = 0.201, p = 0.0001), and left STG vs. left HG (z = 0.290, p = 0.0001). Additionally, for high-level acoustic processing, significant correlations were observed between left IFG vs. left STG (z = 0.223, p = 0.0001). Similarly, for the CH task, areas involved in stroke-motoric processing also showed significant correlations (left SFG vs. right SOG: z = 0.094, p = 0.0005). These results further support the notion of shared motoric/sensory representations between motor and sensory regions in language perception.

Discussion

The present study investigated brain activity during auditory speech and visual character perception, corroborating existing literature on the involvement of language-motor and sensory regions in these processes, in both auditory7,11,14,27 and visual modalities43,44,50,81. Using RSA, we observed that the language-motor areas, along with some sensory regions, exhibited a general motoric representation pattern across both modalities, while sensory representation was observed solely during auditory perception.

In speech perception, the right preCG exhibited representational patterns associated with the place of articulation features of spoken syllables. While previous research has reported potential motoric encoding in the left preCG11,12,26 or bilaterally14,27,82,83, our findings provide direct evidence for motoric representations within speech-motor areas during speech perception using RSA84. This result aligns with theories proposing that motor mechanisms are integral for speech perception9,10,85. Notably, the motoric representation effect in the right preCG contrasts with earlier reports emphasizing left-hemisphere dominance. A plausible explanation is that the left speech-motor area assumes a dominant role in high-demand contexts69,73, while the right area might compensate in low-demand conditions, such as passive listening or simple tasks. This hypothesis is consistent with prior findings that the right speech-motor area encodes phoneme-specific information in quiet but not noisy situations69. Additionally, even during passive listening, participants might engage in implicit imitation of the heard speech, activating bilateral sensory-motor mechanisms, as demonstrated in Cogan et al.’s 14 ECoG study on overt tasks like listen-speak and listen-imitation. This bilateral sensory-motor transformation potentially provides a unified interface essential for speech production, acquisition, and self-monitoring14. Future research should explore whether this lateralization pattern reflects distinct processes of sensory-motor integration in speech perception.

In visual word perception, motoric encoding was observed in the left SFG, likely corresponding to Exner’s area, a critical center for handwriting46,47. The SFG/Exner’s area is believed to link orthographic representations and handwriting motor commands30,43,44,50,74,75. Moreover, this region has shown sensitivity to orthographic irregularities, suggesting its role in complex orthographic-motor transformations86. The correlation between SFG activity and stroke-motoric features provides direct evidence for motoric representation in this area. Interestingly, such motoric representation was not evident in the left SPL/IPS, which has been suggested to process motoric writing-sequence information in visual word reading38,42,50,56,87. Previous RSA research on English letters reported stroke-motoric sensitivity in the left IPS during reading56, while our findings highlight SFG/Exner’s area for processing stroke-motoric features in Chinese characters. This divergence may reflect differences in representational contents encoded within these writing-motor areas74, or distinct neural mechanisms underpinning different writing systems, such as alphabetic letters (English) versus logographic characters (Chinese)44. It is important to explore how sensorimotor neural networks adapt to the varied demands of diverse languages and writing systems to further clarify the functional roles of these regions in visual word perception.

Remarkably, motoric representations were also identified in sensory processing regions. In speech perception, motoric representations were observed in the left STG and HG, regions known for fine-grained acoustic processing during speech recognition27,88,89. These findings are consistent with research indicating that auditory cortex encodes articulatory motoric information, during both speech perception27 and speech production28. In visual character perception, motoric representation was also detected in the right SOG. The right SOG is typically associated with intensive visual-spatial analysis in orthographic processing90,91,92. Our result suggests that visual processing regions may encode motoric information related to handwriting, potentially facilitating visual-spatial analysis required for written characters processing50.

Grounded in the above findings, our study suggests that both auditory and visual language perception engage motoric representations tied to language production39. These representations, observed in both motor and sensory regions, reflect a shared motoric encoding mechanism within a cooperative sensorimotor network. Furthermore, the results of cross-region representational connectivity suggest that motor areas (e.g., right preCG in SP task, or left SFG in CH task) work closely with sensory regions (e.g., left STG and HG in SP task, or right SOG in CH task) in encoding motoric features, possibly supporting the integrative/interactive processing through reciprocal connections27,50. Earlier studies have demonstrated robust connections between motor and sensory systems during language perception across both auditory15,70,93 and visual modalities40. Moreover, in speech perception, motor areas (e.g., IFG and preCG) have been observed to exert top-down modulatory effects on temporal auditory regions (such as STG)69,70. These interactions suggest that motoric representations in sensory areas may originate from motor regions, supporting a sensorimotor integration framework rather than a purely bottom-up model of language processing.

Sensory areas may encode motoric features that are either identical or complementary to those represented in the motor areas26,38,43,69,73. While motor areas possibly prioritize the processing of movement planning features, sensory areas likely play a sensory-motor transformation role, encoding abstract acoustic or visual counterparts of these movements. For example, during speech imitation, the auditory cortex encodes both acoustic formant and articulatory motoric features, indicating a bimodal representation pattern that bridges sensory input and motor output28. This mechanism, rooted in feedback between motor and sensory systems during natural language communication, likely enhances both language perception and production94. Although less explored in visual word processing, similar feedback mechanisms are supposed to exist between the visual and motor systems, as frequent transformations between visual shapes and writing programs could strengthen sensorimotor connections30,41. This may be supported by findings that early-blind adults and patients who undergo surgical restoration of vision demonstrate remarkable plasticity in the visual temporal-occipital cortex, which adapts to process tactile and motoric information, compensating for visual deficits and possibly relating to multimodal feedback between visual, tactile and motor systems95,96.

Moreover, this sensory-motor representation/transformation mechanism may become more pronounced under challenging conditions to facilitate the recognition of syllables/characters. For instance, Du et al.69 found that both speech-motor regions and the auditory cortex demonstrated phoneme discriminability, with motor areas enhancing discrimination in noisy conditions to help with speech recognition. These enhanced motoric representations may also play a predictive role in aiding perception69,73. Similarly, viewing handwritten words more strongly engages both visual processing and writing-motor areas compared to printed words36,97. In this study, while the use of simple stimuli and passive listening minimized explicit motor engagement, it potentially reduced the need for motoric encoding or prediction. Future studies should explore the effects of task complexity and investigate the spatiotemporal connectivity between motor and sensory regions to elucidate the interplay between motoric and sensory representations under varying conditions.

The observed motoric representations within language-related motor and sensory regions align with the sensorimotor hypothesis proposed by the theory of embodied cognition57,58,59,60. This theory suggests that language cognition is deeply intertwined with the bodily sensory and motor experiences in the environment, such as learning words through speaking or handwriting. Consequently, the internal representations of language stimuli may comprise production-related motoric information, which could be retrieved and activated within the language-motor system during both auditory11,12,26,27 and visual language perception43,50. Understanding this relationship may also illuminate how children acquire language in different modalities, as they often develop perceptual language skills concurrently with motor skills, including speaking3,4,98 and handwriting30,35.

In contrast to motoric representations, our findings reveal distinct patterns of sensory representations in motor areas for auditory and visual modalities. During speech perception, significant sensitivity was observed in the left IFG and left STG to high-level acoustic features (categorized by the manner of articulation), and in the right HG to low-level acoustic spectrogram information, consistent with previous research27. However, no analogous sensory encoding effect was observed in either sensory or motor regions for visual character perception, despite previous evidence of visual encoding in occipitotemporal regions43,56. Additionally, the visual task demonstrated weaker activation and feature-brain correlations compared to the auditory task.

This divergence likely arises from the inherent distinctions between auditory and visual language processing. In speech perception, sensory encoding plays a fundamental role, integrating closely with production processes11,12,27 from early language acquisition stages. Sensorimotor integration is evident even in pre-babbling infants, where speech perception and production systems mutually reinforce each other to support the development of robust speech perceptual abilities1,2,3,4,14. Furthermore, auditory speech comprehension often demands fine sensory discrimination to overcome environmental noise, strengthening the sensorimotor network during auditory tasks69.

In contrast, visual language processing, particularly reading comprehension, appears to rely less on sensorimotor coupling or detailed visual encoding, as evidenced by the absence of sensory representation effects and weaker activation/representation effects for visual character perception in our study. Interestingly, motoric representations were more prominent during visual character perception, contrasting with prior behavioral findings emphasizing the role of visual analysis over motoric information in word recognition33,52,53, while consistent with research showing increased motor area involvement in individuals with extensive handwriting experiences30,31,43,44,51. However, the overall reduction in sensorimotor encoding may reflect the gradual development of sensorimotor integration in the visual modality, which becomes more pronounced with handwriting practice, compared to the more innate production-perception coupling in speech.

Nevertheless, the observed modality differences could also be influenced by additional factors. For example, the relative simplicity of the visual task may have lowered the reliance for detailed sensory encoding, particularly given the absence of visually similar distractors or high visual interference. Moreover, stroke-motoric representations might facilitate visual-spatial processing, potentially reducing the cognitive load required for visual analysis36,87. Additionally, differences in stimulus presentation between tasks could also contribute to the observed disparities. The auditory task utilized a smaller stimulus set with more repetitions, likely leading to more stable sensory and motoric representations. In contrast, the visual task involved a larger stimulus set with fewer repetitions, potentially resulting in less robust estimates of these representations. These factors suggest that while our findings highlight differences in the neural representations of motoric and sensory features across modalities, they should be interpreted within a broader context of task complexity and stimulus characteristics.

Our results contribute to the growing body of literature on the multimodal nature of language perception, emphasizing the role of motor system in supporting motoric and sensory representations across modalities26,27,43,50. The motor system’s engagement in language perception reflects a general mechanism of sensorimotor collaboration, yet its involvement is adaptive, likely varying with the degree of sensory-motor coupling demanded by auditory versus visual processing modalities. The pronounced sensory encoding in both motor and sensory regions during speech perception, contrasted with its absence in visual character perception, suggests modality-dependent reliance on sensorimotor integration networks. Auditory perception appears to depend more heavily on sensorimotor interactions, possibly due to its evolutionary basis or the need for precise auditory discrimination, especially under challenging conditions like noisy environments2,19,20. These findings advance our understanding of language comprehension within sensorimotor networks and offer practical implications for developing interventions for language disorders or optimizing language learning strategies. Additionally, our study employed a modality-comparable paradigm and used unified motoric and sensory features to explore sensorimotor encoding, providing a foundation for future cross-modality investigations.

While our study sheds light on motor regions’ automatic engagement under natural perception conditions, its scope might be limited by the use of simple stimuli (monosyllabic sounds and single-radical characters) and passive perception tasks. These conditions likely have lessened sensory processing demands, resulting in weaker sensorimotor encoding effects, especially for visual characters with minimal visual interference. Motor regions may engage selectively in language processing under diverse demanding conditions22,23,24. Future research could explore how sensorimotor representations adapt to varying perceptual difficulties, such as noisy speech or handwritten words, which could intensify motor area engagement and enhance motor-sensory connectivity69,97. Additionally, the current design does not address higher-order language processes, such as semantic, syntactic, or prosodic processing, where motor areas might demonstrate significant involvement99,100,101,102. Exploration of how basic sensorimotor representations in language-motor areas influences the higher-order language processes remains an exciting direction for future research. Moreover, while fMRI provides insights into the spatial patterns of sensory-motor integration, its limited temporal resolution precludes analysis of dynamic interplay between language-motor and sensory systems. Future studies using high temporal resolution techniques like MEG/EEG or multimodal MEG-fMRI could capture the fine-grained temporal-spatial dynamics of sensory-motor integration across modalities, which would further enhance our understanding of how motor and sensory systems coordinate during real-time language comprehension.

Methods

Participants

Thirty-six native Mandarin-Chinese-speaking participants (mean age = 19.75 years, SD = 2.63, range 18–30 years; 19 females, 17 males) were recruited in the current study. All participants were right-handed, reported normal hearing and normal or corrected-to-normal vision, and were neurologically healthy. The study was approved by the Ethic and Human Protection Committee of Shenzhen University. Participants provided informed written consent before the formal experiment. All ethical regulations relevant to human research participants were followed.

Experimental Design and Procedure

Participants completed four tasks: speech perception (SP), visual-character perception (CH), articulation localizer (AR) and finger writing localizer (WR) tasks. To minimize task order effects, participants were randomly assigned to one of four sequences (SP-CH-AR-WR; CH-SP-WR-AR; AR-WR-SP-CH; WR-AR-CH-SP). Response hands for SP and CH tasks were counterbalanced among participants. Each task included a brief practice to ensure comprehension of task instructions. Stimulus presentation was controlled using custom-written MATLAB scripts with Psychtoolbox103.

Production Localizer Tasks

The language production localizer tasks aimed to identify language-motor regions of interest (ROIs). Both AR and WR tasks were performed using a block design, each comprising eight language-motor blocks and eight control blocks presented in an interleaved manner. Each block was initialized with a 2-s visual cue that prompted the stimuli for participants to speak or write, and then a 1-s blank period, followed by a 16-s response period with an asterisk on the screen. The inter-block intervals had durations of 4 s, 4 s, 4 s, and 19 s in every four blocks. A fixation cross was displayed throughout the intervals (Fig. 1a, b).

The AR localizer included eight speech-motor blocks and eight lip-movement (control) blocks. In each speech-motor block, participants saw a cue of a written consonant-vowel (CV) syllable (e.g., /ba/) on the screen and silently repeated the syllable with minimal movement during the asterisk display. Eight CV syllables were used (i.e., /ba/, /pa/, /da/, /ta/, /ga/, /ka/, /sa/, /ca/), identical to those in SP task (see stimuli details in SP task). In each lip-movement (control) block, participants viewed a cue of a line-drawn mouth and repeatedly moved their lips up and down minimally without articulating syllables (Fig. 1a, c).

The WR localizer included eight writing blocks and eight drawing (control) blocks. In each writing block, participants repeatedly wrote the prompted Chinese characters with their right index finger in the air during the asterisk display. Although this differs slightly from natural handwriting, it was designed to isolate neural regions associated with orthographic and graphomotor processing unique to handwriting (e.g., tracing stroke trajectories) while excluding non-specific sensorimotor components (e.g., pen-holding, visual feedback)75,104,105. The eight characters were selected for familiarity and ease of writing (see details in Tables S1, S2), excluding those used in CH task. In each drawing (control) block, participants repeatedly drew circles or triangles with their right index finger in the air (Fig. 1b, d).

Perception Tasks

Participants completed four SP runs and four CH runs, using a mixed block/event-related design67,68. This experimental design ensured robust statistical power for detecting activation during speech and visual word perception and allowed detailed analysis of sensory-motor representations in language-motor areas with item-based RSA. To focus on motoric/sensory encoding during passive perception, a detection task (in 1/8 of trials, excluded from the main analysis) was applied to maintain participants’ attention on stimuli while minimizing additional motor responses and rehearsal.

Each SP run consisted of two syllable perception blocks and two auditory control blocks, interleaved, each lasting 50 s with 15-s inter-block intervals. Ten syllable/control stimuli were presented per block in a random, temporally jittered manner following an event-related design (SOA 3–6 s with a uniform distribution). Each syllable/control stimulus was repeated twenty times across four runs. Trial numbers were optimized to balance participant engagement and data quality within fMRI session limits. More trials were used for speech stimuli to counteract scanner noise and achieve a higher signal-to-noise ratio. Participants listened to syllables spoken by a male speaker and their time-reversed control signals binaurally through earphones. Participants performed an auditory detection task, pressing a button by their left/right hand as quickly and correctly as possible when hearing a female voice in 1/8 of the trials (Fig. 1e).

Each CH run consisted of three character perception blocks and three visual control blocks, interleaved, each lasting 36 s with 15-s intervals. Eight character/control stimuli were presented per block in a random, jittered order (SOA 3–7 s with a uniform distribution). Each stimulus was displayed for 500 ms, followed by a black fixation cross on a light gray background. Each stimulus was repeated four times, once in each of the four runs. Participants focused on the visual stimuli and pressed a button upon detecting dark-gray character/control stimuli among black stimuli (in 1/8 of all trials) (Fig. 1f).

Stimuli for Perception Tasks

To isolate fundamental sensorimotor representations, simple spoken CV syllables and visually-presented Chinese characters were used in SP and CH tasks. Action-related words were excluded to avoid confounding motor-related effects from higher-level lexical/semantic processing99,101.

Speech Perception Task (SP)

In the SP task, eight CV syllables (/ba/, /pa/, /da/, /ta/, /ga/, /ka/, /sa/, /ca/) were selected, similar to Cheung et al.26. We substituted the consonant /ʃ/ in English with /c/ in Mandarin Chinese because /c/ more closely resembles other alveolar consonants (/d/, /t/, /s/) in articulation, which are articulated with the anterior part of the tongue, compared to the retroflex fricative /ʃ/. The consonants represented distinct phonetic features based on the place of articulation (bilabial: /b/, /p/; alveolar: /d/, /t/, /s/, /c/; velar: /g/, /k/) and the manner of articulation (voiced plosives: /b/, /d/, /g/; voiceless plosives: /p/, /t/, /k/; fricatives: /s/, /c/). These distinctions enabled classification of syllables based on both articulator-bound acoustic parameters (for place of articulation) and articulator-free acoustic characteristics (for manner of articulation)88,106. The vowel /a/ was chosen for its better noise resistance in the fMRI scanner. We categorized the three place of articulation features as more indicative of speech-motoric (articulatory) characteristics, while the three manner of articulation features were considered more representative of higher-order acoustic characteristics (Fig. 1g).

Syllables were recorded by a male and a female Chinese speaker in a soundproof chamber at a sampling rate of 22,050 Hz. Post-processing in the Adobe Audition software and MATLAB codes included bandpass filtering (80–10,500 Hz), removal of acoustic transients (clicks) at onset/offset with 10-ms raised cosine ramps, length equalization to 300 ms, and matching the average root-mean-square (RMS) sound pressure level. Time-reversed versions of these speech stimuli served auditory control stimuli (Fig. 1g).

For RSA, item-based similarity matrices were computed in the motoric and auditory domains (Fig. 2a). Speech-motoric similarity was binary-coded (1=similar, 0=dissimilar) based on the place of articulation features in pairs of syllables. Auditory similarity was defined in two dimensions. The first dimension focused on the high-level acoustic feature similarity, which was binary-coded for the manner of articulation26. The second dimension was low-level acoustic spectrogram similarity, which quantified the perceptual similarity of speech sounds. The spectrogram similarity between signals was assessed by computing the Pearson correlation coefficient between the absolute values of their spectrograms. Each signal’s spectrogram was computed using Short-Time Fourier Transform (STFT) with a moving Hamming window (1024 points, 75% overlap), providing detailed temporal and frequency resolution for the analysis of short syllables. The resulting RDMs were obtained by subtracting the similarity matrices from 1 (Tables S4-S6). Pearson correlation analysis showed that, the articulatory speech-motoric RDM was not significantly correlated with either the higher-level auditory feature RDM (r = −0.182, p = 0.352) or the lower-level acoustic spectrogram RDM (r = −0.172, p = 0.931), whereas a moderate correlation was observed between the two acoustic features RDMs (r = 0.394, p = 0.038).

Visual Character Perception Task (CH)

The CH task used twelve pairs of simple Chinese characters (twenty-four characters), chosen based on their varying levels of stroke-motoric similarity and visual-form similarity (Table S3). These characters were selected from a set of 211 simple characters with a single radical and no more than six strokes. Characters with multiple radicals were excluded to avoid confounding effects caused by high overlap in both visual and stroke-motoric domains, which could obscure distinctions between the targeted features of interest. Motoric representation of characters refers to the abstract, effector-independent movement sequences required to produce the character shapes56,107,108. To account for stroke count, direction, and order, we quantified stroke-motoric similarity using the Levenshtein distance. This metric measures the minimum single-stroke feature edits—insertions, deletions or substitutions (see details of the eleven fundamental strokes in Chinese characters in Table S2)—needed to transform one character’s stroke sequence to another’s, normalized by the longer sequence length52. For instance, the Levenshtein distance between the character “止” (丨一丨一) and “正” (一丨一丨一) is 1 (from “止” to “正”, add 一 in the string) divided by 5 (the larger number of stroke features), so the stroke-motoric similarity between these two characters was (1–1/5) = 0.8. For visual form similarity, we quantified both high-level, font-invariant visual features and low-level, font-specific pixel features. High-level visual feature similarity was determined by the ratio of overlapping visual features to the total features of both characters. Based on the existing literature and the decomposition of Chinese character forms/strokes, twenty fundamental visual features were categorized in the current study. Nineteen features are shared among scripts in different languages, including four line types (horizontal, vertical, slanted right, and slanted left), four curve types (open to the top, bottom, left, and right), three intersection types (L, T or X), eight termination types (top, bottom, left, right, top-left, top-right, bottom-left, bottom-right), and the number of disconnected components109,110,111. In addition, dots were included as another type of visual feature, as dots are one of the basic components of Chinese characters. These features abstract away from specific font details, providing insight into structural and compositional visual similarities. Low-level visual similarity was assessed through a pixel-based measure to capture the spatial and geometric overlaps between characters. The pixel similarity was obtained by experimenting with various alignments to optimize the character positioning for maximum overlap, and the values were derived from the point of maximum overlap between the two characters. This measure aligns closely with early-stage visual processing, emphasizing the fine-grained distinctions critical for character recognition54,112,113, and is sensitive to the specific font used in experiments (e.g., KaiTi font in our study).

Pairwise stroke-motoric and visual similarities were calculated among the full set of 211 simple characters. High similarity was defined as values exceeding the mean similarity across all character pairs by at least half a standard deviation, while low similarity was below the mean similarity minus half a standard deviation. For stroke-motoric, visual feature, and visual pixel similarities, thresholds for high values were >0.38, >0.54, and >0.29, respectively, and thresholds for low values were <0.21, <0.39, and <0.22, respectively. Three pairs of characters were finally selected for each combination of high/low visual similarity conditions and high/low stroke-motoric similarity conditions (twenty-four characters in total). To avoid semantic processing effects in the motor areas99, no characters with semantic relation to body movement were chosen. Examples include “止” and “正” (high similarity in both dimensions), “术” and “永” (high visual similarity but low stroke-motoric similarity), “午” and “白” (low visual similarity but high stroke-motoric similarity), and “皮” and “斗” (low similarity in both measures) (see Tables S3-S9 and Figs. 1h & 2a). The resulting dissimilarity matrices (RDMs) were computed by subtracting the similarity matrices from 1 (Tables S7-S9). Pearson correlation analysis indicated a moderate correlation between the writing stroke-motoric RDM and both the high-level visual-feature RDM (r = 0.406, p < 0.001) and the low-level visual-pixel RDM (r = 0.472, p < 0.001), as well as between the two visual RDMs (r = 0.392, p < 0.001) (Fig. 1h).

The character stimuli were presented as black line drawings (approximately 120 × 120 pixels) in the KaiTi font. Scrambled versions of these characters served as visual control stimuli, created by dividing each character image into several 30 × 30 pixel sections and randomizing their positions to produce unintelligible images that retained the overall pixel density and low-level visual features of the original stimuli114 (Fig. 1h).

Data Acquisition and Preprocessing

Magnetic resonance imaging (MRI) images were collected using a 3 T Siemens Prisma scanner at Shenzhen University in China. Functional images were acquired using an interleaved multi-slice echo planar imaging (EPI) sequence (TR = 1000 ms, TE = 30 ms, flip angle = 35°, voxel size = 2 × 2 × 2 mm3, FOV = 1728 mm × 1728 mm, slice thickness = 2 mm, slice number = 78), providing whole-head coverage. The slices were acquired in the axial plane. Auditory stimuli were presented using an MRI-compatible pneumatic headset. Visual stimuli were projected onto a translucent screen via an LCD projector, and participants viewed the stimuli through a mirror mounted on the head coil. Anatomical images were obtained using a T1-weighted Magnetization Prepared Rapid Gradient Echo (MPRAGE) sequence (TR = 2,300 ms, TE = 2.26 ms, flip angle = 8°, voxel size = 1 × 1 × 1 mm3, FOV = 232 mm × 256 mm, slice thickness = 1 mm).

Functional images were preprocessed and analyzed using SPM12 (http://www.fil.ion.ucl.ac.uk/spm) in MATLAB (version R2018a, MathWork). Initial dummy scans for signal equilibration were excluded from the analysis. The remaining images underwent slice timing correction and were realigned for motion correction by registration to the mean image. Subsequently, these images were coregistered with the T1-weighted 3D images and normalized to MNI space with cubic voxels at 2 × 2 × 2 mm3 spatial resolution. Functional images were spatially smoothed with an 8 mm full width at half maximum (FWHM) isotropic Gaussian kernel. Data from six participants were discarded from further analysis due to excessive head motion during fMRI, with exclusion criteria set at head motion >3 mm translation, or >3° rotation. The criteria were relatively lenient considering two production localizer tasks involved in the present study.

Statistics and reproducibility

All statistical analyzes in this study were conducted using custom Matlab scripts. Brain activation patterns associated with production and perception tasks were examined by modeling individual participants’ data at the first level and then comparing these individual results across the group to identify consistent brain activity patterns. A paired t-test was used to determine statistical significance for the contrasts between conditions. For the multivariate representational similarity analysis (RSA), Spearman’s correlation was applied to assess the relationship between neural activity within sensorimotor ROIs and motoric/sensory features of language stimuli. Statistical significance was assessed through one-tailed permutation testing, comparing the observed neural-feature correlations with random permutations of the data.

Univariate analysis

To identify brain activation patterns during language production and perception tasks, a general linear model (GLM) approach was employed, with experimental regressors modeled as boxcar functions and convolved with a canonical hemodynamic response function (HRF). A high-pass filter with a cut-off of 128 s was applied to remove low-frequency drifts. Six head motion parameters (translation and rotation) were entered as nuisance regressors to account for variance caused by head movement. All brain activation maps were visualized with the BrainNet Viewer115 (http://www.nitrc.org/projects/bnv/). Effect sizes for the activated brain regions were computed using Cohen’s d and Hedges’ g. Cohen’s d was calculated by dividing the mean effect size at each cluster’s peak voxel in the group analysis by the standard deviation across participants. Hedges’g, which corrects for the small sample sizes in fMRI group analyzes116, was computed using SPM 12 and the Measures of Effect Size (MES) toolbox117 (https://github.com/hhentschke/measures-of-effect-size-toolbox).

For the localizer tasks, block-based GLMs were conducted to define critical speech- and writing-motor areas, independent of the main perception tasks. At the first level, GLMs included the following regressors: (1) syllable articulation, lip movement (control), and rest/fixation period for the AR task; and (2) character writing, drawing (control), and rest/fixation period for the WR task. Speech-motor areas were identified by contrasting syllable articulation with lip movement, while writing-motor areas were identified by contrasting character writing with drawing. Participant-specific contrast images were then used to create group-level activation maps, using a voxel-level threshold of p < 0.05 (FDR corrected for multiple comparisons) with a cluster extent threshold of k ≥ 10 consecutive voxels.

For the perception tasks, block-based GLMs were employed to explore key regions involved in auditory and visual language perception. At the first level, GLMs included three regressors: (1) spoken syllable perception, auditory control stimuli (reversed speech) perception, and rest/fixation for the SP task; and (2) visual character perception, visual control stimuli (scrambled character) perception, and rest/fixation for the CH task. For auditory speech perception, activation maps were obtained by contrasting spoken syllables with auditory control and rest (fixation). For visual character perception, activation maps were derived by contrasting visual character with visual control and rest (fixation). Group-level activation maps were generated from participant-specific contrasts using second-level GLM analysis. For the contrasts of “speech/character > control”, we used a voxel-level threshold of p < 0.01 uncorrected (cluster extent ≥10 voxels) to capture subtle neural activations in language-motor regions, which are known to exhibit weaker responses during language perception tasks11,12,27,38. For the “speech/character > rest (fixation)” contrasts, we applied a stricter voxel-level threshold of p < 0.05 FDR corrected (cluster extent ≥10 voxels) to identify activations. These univariate results guided region of interest (ROI) selection for the subsequent RSA, which focused on multivoxel representational patterns rather than activation strength and was independently validated using permutation-based significance testing, ensuring the independence of results from univariate statistical thresholds.

Representational similarity analysis (RSA)

The RSA was performed to investigate whether neural patterns in the language-motor cortex during speech/character perception could be predicted by the motoric or sensory features of language stimuli. Functional images were smoothed using a 4 mm FWHM kernel to enhance sensitivity in multivariate analyzes78,118. Separate GLMs were constructed for the two perception tasks. The preprocessed data were analyzed using an event-related design with delta function regressors. For the SP task, regressors included eight CV syllables and one auditory control condition. For the CH task, regressors included twenty-four Chinese characters and one visual control condition. High-pass filtering at 128 s and motion covariates were applied in all analyzes. T-contrasts were generated for each stimulus type relative to control, and the resulting t-maps were used for RSA to enhance the detection of task-dependent effects55.

RSA was implemented using the CoSMoMVPA toolbox119 (https://www.cosmomvpa.org/) and custom MATLAB code. ROI-based analyzes focused on speech-motor regions for the SP task and writing-motor regions for the CH task. Additionally, representational activity in language-sensory areas was examined for each modality to compare motoric and sensory feature representations across language-motor and sensory regions. To address individual variability in motor and sensory region locations, we adopted a dual-step ROI-voxel selection procedure for RSA (following Patel et al., 2023)78. First, anatomical ROIs were defined based on the guidance of group-level univariate results, and were created with the AAL templates on each participant’s brain using the WFU Pickatlas Toolbox120,121, covering key sensorimotor areas involved in auditory and visual language production and perception tasks. Specifically, motor ROIs were defined from the AR and WR tasks (articulation/writing vs. control), while auditory and visual ROIs were defined from group-level univariate contrasts in the SP and CH tasks (speech/character vs. control). Next, for each participant, the top 100 most active voxels within each anatomical region were selected based on their t-values in the “speech/character vs. control” contrast of the perception tasks. This approach ensured that RSA was focused on the most informative voxels that may engage in processing motoric or sensory features of language stimuli while accounting for individual differences in activation strength and mitigating potential voxel count biases across different ROIs. Neural representational dissimilarity matrices (RDMs) were constructed by calculating 1 minus the Pearson correlation values between activation patterns for all stimulus pairs. For the SP task, neural RDM predictors included the speech-motoric RDM, the high-level speech acoustic feature RDM, and the low-level acoustic spectrogram RDM (Fig. 2a). For the CH task, predictors included the stroke-motoric RDM, the high-level visual feature RDM, and the low-level visual pixel RDM (Fig. 2b) (see also Tables S4-S9 for maxima lists of the RDM values).

To test the relatedness between neural activity and feature RDMs, we conducted a two-stage permutation test to determine the significance of the neural-motoric or neural-sensory correlations in ROIs78,79. The procedure involved the following steps: (1) At the individual level, the observed correlation (r₀) between neural RDM (top 100 voxels per ROI) and a specific motoric/sensory RDM was calculated using Spearman correlation and subsequently transformed into a Fisher z-score. To construct a null distribution, motoric/sensory RDM values were shuffled 100 times per ROI, and the correlation with the neural RDM was recalculated for each shuffle, yielding 100 randomized correlations (r₁) per ROI for each participant. (2) At the group level, the observed correlations were averaged across participants to obtain a group-mean correlation (mean r₀). A null distribution of group-level correlations was created using a Monte Carlo approach. Specifically, one r₁ value was randomly sampled from each participant’s set of 100 shuffled correlations, and the group mean was calculated across participants (mean r₁). This process was repeated 10,000 times to generate a set of group-mean null correlations. (3) The significance (p-value) was determined by comparing the observed group mean correlation (mean r₀) to the null distribution (mean r₁) and calculating its percentile rank. For example, if the observed correlation was ranked among the top 100 values out of 10,000 permutations, the p-value would be approximated as 100/ (10,000 + 1) ≈ 0.01. This approach provided a robust evaluation of whether the observed neural-motoric or neural-sensory correlations exceeded those expected by chance. P-values were then corrected for multiple comparisons using a FDR procedure.

Sample size justification

The formal analysis included 30 participants (another 6 were excluded due to excessive head movement). This sample size aligns with standard practices in fMRI research, which typically involve 20 to 40 participants, and is generally sufficient to detect medium effect sizes in task-based analyzes122. To further justify our sample size, we computed effect sizes for each peak significant voxel in the group-level GLMs utilizing Cohen’s d and Hedges’ g116. Additionally, a post-hoc power analysis was performed using G*Power software123 (t-tests - Means: Difference between two dependent means, matched pairs). The results showed that the observed effect sizes for all tasks are moderate to large (Cohen’sd > 0.48, Hedges’g > 0.47), with power exceeding 70%. Detailed results were provided in Tables S10S13. These results indicate that our findings were reliable with this sample size.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.