Chen et al., 2025 - Google Patents
3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and DiarizationChen et al., 2025
View PDF- Document ID
- 5983358091608000997
- Author
- Chen Y
- Zheng S
- Wang H
- Cheng L
- Zhu T
- Huang R
- Deng C
- Chen Q
- Zhang S
- Wang W
- Li X
- Publication year
- Publication venue
- ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
External Links
Snippet
We introduce 3D-Speaker-Toolkit, an open-source toolkit for multimodal speaker verification and diarization, designed for meeting the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic …
- 238000012795 verification 0 title abstract description 14
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3074—Audio data retrieval
- G06F17/30743—Audio data retrieval using features automatically derived from the audio content, e.g. descriptors, fingerprints, signatures, MEP-cepstral coefficients, musical score, tempo
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition | |
Chung et al. | Spot the conversation: speaker diarisation in the wild | |
Hazarika et al. | Self-attentive feature-level fusion for multimodal emotion detection | |
Xu et al. | Ava-avd: Audio-visual speaker diarization in the wild | |
Friedland et al. | The ICSI RT-09 speaker diarization system | |
Kim et al. | St-bert: Cross-modal language model pre-training for end-to-end spoken language understanding | |
JP2017016131A (en) | Speech recognition apparatus and method, and electronic device | |
Chen et al. | 3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and Diarization | |
Ram et al. | Sparse subspace modeling for query by example spoken term detection | |
Santoso et al. | Speech emotion recognition based on self-attention weight correction for acoustic and text features | |
Bluche et al. | Predicting detection filters for small footprint open-vocabulary keyword spotting | |
US20230169988A1 (en) | Method and apparatus for performing speaker diarization based on language identification | |
Du et al. | Expressive voice conversion: A joint framework for speaker identity and emotional style transfer | |
Prajwal et al. | Visual keyword spotting with attention | |
Mangalam et al. | Learning spontaneity to improve emotion recognition in speech | |
Soboleva et al. | Replacing human audio with synthetic audio for on-device unspoken punctuation prediction | |
Nwe et al. | Speaker clustering and cluster purification methods for RT07 and RT09 evaluation meeting data | |
Wang et al. | A DNN-HMM-DNN hybrid model for discovering word-like units from spoken captions and image regions | |
Cheng et al. | Multi-input multi-output target-speaker voice activity detection for unified, flexible, and robust audio-visual speaker diarization | |
Jiang et al. | Target speech diarization with multimodal prompts | |
WO2020110815A1 (en) | Keyword extraction device, keyword extraction method, and program | |
Kebe et al. | Bridging the gap: Using deep acoustic representations to learn grounded language from percepts and raw speech | |
Flemotomos et al. | Multimodal clustering with role induced constraints for speaker diarization | |
US20240221721A1 (en) | Systems and methods for audio transcription switching based on real-time identification of languages in an audio stream | |
CN112395414A (en) | Text classification method and training method, device, medium and equipment of classification model |