Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2306.06954 (eess)

[Submitted on 12 Jun 2023]

Title:Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition

Authors:Belen Alastruey, Lukas Drude, Jahn Heymann, Simon Wiesler

View PDF

Abstract:Convolutional frontends are a typical choice for Transformer-based automatic speech recognition to preprocess the spectrogram, reduce its sequence length, and combine local information in time and frequency similarly. However, the width and height of an audio spectrogram denote different information, e.g., due to reverberation as well as the articulatory system, the time axis has a clear left-to-right dependency. On the contrary, vowels and consonants demonstrate very different patterns and occupy almost disjoint frequency ranges. Therefore, we hypothesize, global attention over frequencies is beneficial over local convolution. We obtain 2.4 % relative word error rate reduction (rWERR) on a production scale Conformer transducer replacing its convolutional neural network frontend by the proposed F-Attention module on Alexa traffic. To demonstrate generalizability, we validate this on public LibriSpeech data with a long short term memory-based listen attend and spell architecture obtaining 4.6 % rWERR and demonstrate robustness to (simulated) noisy conditions.

Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2306.06954 [eess.AS]
	(or arXiv:2306.06954v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2306.06954

Submission history

From: Lukas Drude [view email]
[v1] Mon, 12 Jun 2023 08:37:36 UTC (1,706 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators