CN106713818A

CN106713818A - Speech processing system and method during video call

Info

Publication number: CN106713818A
Application number: CN201710093114.9A
Authority: CN
Inventors: 陈天武
Original assignee: Fujian Jiangxia University
Current assignee: Fujian Jiangxia University
Priority date: 2017-02-21
Filing date: 2017-02-21
Publication date: 2017-05-24

Abstract

The invention provides a voice processing system and method in a video call. The video call terminals are interconnected through the basic communication network; the video call includes an online server with an external enhanced call function; the online server with an external enhanced call function includes an online voice-to-text module and an online The call atmosphere module; the online speech-to-text module includes a speech recognition unit; the user makes a call through a video call terminal; the terminal's local speech-to-text module or the speech recognition unit of the online speech-to-text module processes the other party's audio data for speech recognition Then convert it into text, store it in the text-to-subtitle storage module, superimpose the recognized text content on the video screen of the terminal for display; and call the terminal's local call atmosphere module or the online call atmosphere module of the external enhanced call function server; according to the identified The text content renders the overall atmosphere of the call into images and text effects, which are synthesized with video images and rendered and displayed on the terminal.

Description

Speech processing system and method in video call

技术领域technical field

本发明涉及一种视频通话中语音处理系统及方法。The invention relates to a voice processing system and method in a video call.

背景技术Background technique

随着技术的进步，人与人的远程沟通方式从书信，电报，语音电话发展到视频电话。视频电话需要同时传输视频数据和音频数据，虽然音视频数据均有压缩，但是其数据量仍旧比纯语音通信的多1-2个数量级。视频通话对基础网络的要求，对终端的硬件配置均有大幅提高。With the advancement of technology, the way of remote communication between people has developed from letters, telegrams, and voice calls to video calls. Video telephony needs to transmit video data and audio data at the same time. Although audio and video data are compressed, the amount of data is still 1-2 orders of magnitude larger than that of pure voice communication. The requirements for video calls on the basic network and the hardware configuration of the terminal have been greatly improved.

视频通话就是音频和视频同时传送，但是技术进步，能让视频通话承载更多的内容，改进视频通话的用户体验，增加用户的粘性。Video calls are audio and video transmissions at the same time, but technological advancements allow video calls to carry more content, improve the user experience of video calls, and increase user stickiness.

发明内容Contents of the invention

本发明的目的是提供一种视频通话中语音处理的方法和系统，用于给视频通话增加一些特性，增加视频通话的趣味性，增加视频通话功能的用户粘性。The purpose of the present invention is to provide a method and system for voice processing in a video call, which is used to add some features to the video call, increase the fun of the video call, and increase the user stickiness of the video call function.

本发明采用以下技术方案实现：The present invention adopts following technical scheme to realize:

一种视频通话中语音处理的系统，其特征在于：包括硬件驱动与操作系统模块、视频通话中间件模块、本地语音转文字模块、本地通话氛围模块、文字转字幕存储模块、文字效果用户设置模块、通话氛围用户设置模块及外部增强通话功能在线服务器；所述外部增强通话功能在线服务器包括在线语音转文字模块及在线通话氛围模块；在线语音转文字模块包括语音识别单元；所述视频通话中间件模块用于接收对方视频通话的音视频数据，并将音视频数据解复用，得到视频数据和音频数据；本地语音转文字模块或在线语音转文字模块将音频数据，调用语音转文件接口，得到用户的文字内容；本地通话氛围模块或在线通话氛围模块将通话整体氛围渲染成图像，并与视频图像合成后在终端渲染显示。A system for voice processing in a video call, characterized in that it includes a hardware driver and an operating system module, a video call middleware module, a local voice-to-text module, a local call atmosphere module, a text-to-subtitle storage module, and a text effect user setting module , call atmosphere user setting module and an external enhanced call function online server; the external enhanced call function online server includes an online speech-to-text module and an online call atmosphere module; the online speech-to-text module includes a speech recognition unit; the video call middleware The module is used to receive the audio and video data of the other party’s video call, and demultiplex the audio and video data to obtain video data and audio data; the local voice-to-text module or the online voice-to-text module transfers the audio data to the voice-to-file interface to obtain The user's text content; the local call atmosphere module or the online call atmosphere module renders the overall atmosphere of the call into an image, which is synthesized with the video image and then rendered and displayed on the terminal.

本发明还提供一种视频通话中语音处理方法，其特征在于：包括以下步骤：S1：视频通话终端通过基础通信网互联互通；提供一外部增强通话功能在线服务器；外部增强通话功能在线服务器包括在线语音识别服务器及在线通话氛围服务器；S2：用户通过视频通话终端进行通话；视频通话中间件模块接收对方视频通话的音视频数据，并将音视频数据解复用，得到视频数据和音频数据；通过终端的本地语音转文字模块或在线语音转文字模块的语音识别单元对对方的音频数据进行语音识别，再转换成文字存储在文字转字幕存储模块，并将识别的文字内容叠加到终端的视频画面上进行显示；S3：调用终端的本地通话氛围模块或外部增强通话功能服务器的在线通话氛围模块；根据S2中识别的文字内容，将通话整体氛围渲染成图像和文字效果，并与视频图像合成后在终端渲染显示。The present invention also provides a voice processing method in a video call, which is characterized in that it includes the following steps: S1: the video call terminals are interconnected and intercommunicated through the basic communication network; an external online server with enhanced call function is provided; the external online server with enhanced call function includes an online Speech recognition server and online call atmosphere server; S2: The user makes a call through the video call terminal; the video call middleware module receives the audio and video data of the other party's video call, and demultiplexes the audio and video data to obtain video data and audio data; The terminal's local speech-to-text module or the speech recognition unit of the online speech-to-text module performs speech recognition on the audio data of the other party, and then converts it into text and stores it in the text-to-subtitle storage module, and superimposes the recognized text content on the video screen of the terminal S3: call the local call atmosphere module of the terminal or the online call atmosphere module of the external enhanced call function server; according to the text content recognized in S2, the overall atmosphere of the call is rendered into image and text effects, and synthesized with the video image Displayed in terminal rendering.

进一步的，用户根据需求选择是否调用本地或在线通话氛围模块。Further, the user chooses whether to call the local or online call atmosphere module according to the requirement.

进一步的，预先存储有多种文字叠加在视频画面的模板，由用户进行选择。Further, templates with various texts superimposed on the video screen are pre-stored and selected by the user.

进一步的，视频通话终端间的数据通信过程包含用户认证过程。Further, the data communication process between video call terminals includes a user authentication process.

进一步的，还包括S4：当S3中调取外部增强通话功能服务器的在线通话氛围模块；终端将音视频数据传输给外部增强通话功能在线服务器，在线服务器处理后，得到文字数据和氛围数据，连同终端的音视频数据一并传输给对方。Further, it also includes S4: when the online call atmosphere module of the external enhanced call function server is called in S3; the terminal transmits the audio and video data to the external enhanced call function online server, and after processing by the online server, text data and atmosphere data are obtained, together with The audio and video data of the terminal are transmitted to the other party together.

与现有技术相比，本发明具有以下优点：扩展了视频通话的使用功能(语音转文字)，增加了功能的用户粘性；增强了通话氛围渲染功能(文字显示的额外效果)，同样增加了功能的用户粘性。Compared with the prior art, the present invention has the following advantages: the use function of the video call is expanded (voice-to-text), and the user stickiness of the function is increased; the call atmosphere rendering function (extra effect of text display) is enhanced, and the User stickiness of the function.

附图说明Description of drawings

图1为视频通话中语音处理系统的总体结构图。FIG. 1 is an overall structural diagram of a voice processing system in a video call.

图2为视频通话中语音处理系统的核心模块框图。Fig. 2 is a block diagram of the core modules of the speech processing system in the video call.

图3为视频通话中语音处理的操作序列图。FIG. 3 is an operation sequence diagram of voice processing in a video call.

具体实施方式detailed description

下面结合附图和具体实施例对本发明做进一步解释说明。The present invention will be further explained below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，视频通话中语音处理系统的总体结构图。视频通话终端通过基础通信网(互联网等)互联互通。视频通话包含外部增强通话功能的在线服务器，如：在线语音识别服务器，在线通话氛围服务器。服务器功能的划分是功能逻辑上划分，并非从物理逻辑上划分，即在线语音视频服务器和在线通话氛围服务器可能是存在于同一台服务器主机上。As shown in Figure 1, the overall structure diagram of the speech processing system in the video call. The video call terminals are interconnected through the basic communication network (Internet, etc.). Video calls include external online servers that enhance call functions, such as online voice recognition servers and online call atmosphere servers. The division of server functions is a logical division of functions, not a physical division, that is, the online voice and video server and the online call atmosphere server may exist on the same server host.

视频通话终端和在线语音视频服务器和在线通话氛围服务器通过基础通信网相连接，他们之前的数据通信是双向的。数据通信过程可能包含必要的用户认证过程。The video call terminal is connected to the online voice and video server and the online call atmosphere server through the basic communication network, and their previous data communication is two-way. Data communication process may include necessary user authentication process.

如图2所示，视频通话中语音处理系统的核心模块框图。视频通话中语音处理系统包括硬件驱动与操作系统模块、视频通话中间件模块、本地语音转文字模块、在线语音转文字模块、文字转字幕存储模块、文字效果用户设置模块、本地通话氛围模块、在线通话氛围模块和通话氛围用户设置模块；所述视频通话中间件模块用于接收对方视频通话的音视频数据，并将音视频数据解复用，得到视频数据和音频数据；本地语音转文字模块或在线语音转文字模块将音频数据，调用语音转文件接口，得到用户的文字内容；本地在线通话氛围模块或在线通话氛围模块将通话整体氛围渲染成图像，并与视频图像合成后在终端渲染显示。As shown in Figure 2, the block diagram of the core modules of the speech processing system in the video call. The voice processing system in the video call includes hardware driver and operating system module, video call middleware module, local voice-to-text module, online voice-to-text module, text-to-subtitle storage module, text effect user setting module, local call atmosphere module, online A call atmosphere module and a call atmosphere user setting module; the video call middleware module is used to receive the audio and video data of the other party's video call, and demultiplex the audio and video data to obtain video data and audio data; the local voice-to-text module or The online voice-to-text module transfers audio data to the voice-to-file interface to obtain the user's text content; the local online call atmosphere module or the online call atmosphere module renders the overall atmosphere of the call into an image, which is synthesized with the video image and displayed on the terminal.

本发明还提供一种视频通话中语音处理方法，其包括以下步骤：S1：视频通话终端通过基础通信网互联互通；提供一外部增强通话功能在线服务器；外部增强通话功能在线服务器包括在线语音识别服务器及在线通话氛围服务器；S2：用户通过视频通话终端进行通话；视频通话中间件模块接收对方视频通话的音视频数据，并将音视频数据解复用，得到视频数据和音频数据；通过终端的本地语音转文字模块或在线语音转文字模块的语音识别单元对对方的音频数据进行语音识别，再转换成文字存储在文字转字幕存储模块，并将识别的文字内容叠加到终端的视频画面上进行显示；S3：调用终端的本地通话氛围模块或外部增强通话功能服务器的在线通话氛围模块；根据S2中识别的文字内容，将通话整体氛围渲染成图像和文字效果，并与视频图像合成后在终端渲染显示。The present invention also provides a voice processing method in a video call, which includes the following steps: S1: video call terminals are interconnected through the basic communication network; an external online server with enhanced call function is provided; the external online server with enhanced call function includes an online voice recognition server and the online call atmosphere server; S2: the user makes a call through the video call terminal; the video call middleware module receives the audio and video data of the other party’s video call, and demultiplexes the audio and video data to obtain video data and audio data; through the terminal’s local The speech recognition unit of the speech-to-text module or the online speech-to-text module performs speech recognition on the audio data of the other party, and then converts it into text and stores it in the text-to-subtitle storage module, and superimposes the recognized text content on the video screen of the terminal for display ;S3: call the local call atmosphere module of the terminal or the online call atmosphere module of the external enhanced call function server; according to the text content recognized in S2, the overall atmosphere of the call is rendered into image and text effects, and rendered on the terminal after being synthesized with the video image show.

如图3所示，视频通话中语音处理的操作序列图。接收对方视频通话的音视频数据之后，语音转文字和通话氛围功能可在服务器上完成，或者视频通话终端的完成。具体的交互过程见操作序列图。As shown in FIG. 3 , the operation sequence diagram of voice processing in a video call. After receiving the audio and video data of the other party's video call, the voice-to-text and call atmosphere functions can be completed on the server, or the completion of the video call terminal. See the operation sequence diagram for the specific interaction process.

以上是本发明的较佳实施例，凡依本发明技术方案所作的改变，所产生的功能作用未超出本发明技术方案的范围时，均属于本发明的保护范围。The above are the preferred embodiments of the present invention, and all changes made according to the technical solution of the present invention, when the functional effect produced does not exceed the scope of the technical solution of the present invention, all belong to the protection scope of the present invention.

Claims

1. in a kind of video calling speech processes system, it is characterised in that：Including hardware driving and operating system module, video Call middleware module, local voice turn character module, local call atmosphere module, word and turn captions memory module, word effect Fruit user setup module, call atmosphere user setup module and outside enhancing call function line server；The outside enhancing Call function line server includes online speech-to-text module and online call atmosphere module；Online speech-to-text module Including voice recognition unit；The video calling middleware module is used to receive the audio, video data of other side's video calling, and will Audio, video data is demultiplexed, and obtains video data and voice data；Local voice turns character module or online speech-to-text mould Voice data is called voice to turn file interface by block, obtains the word content of user；Local call atmosphere module or online call The atmosphere module overall atmosphere that will converse is rendered to image, and with video image synthesis after render display in terminal.

2. method of speech processing in a kind of video calling, it is characterised in that：Comprise the following steps：

S1：Video call terminal is interconnected by Base communication net；One outside enhancing call function line server is provided；Outward The line server of portion's enhancing call function includes online speech recognition server and online call atmosphere server；

S2：User is conversed by video call terminal；The sound that video calling middleware module receives other side's video calling is regarded Frequency evidence, and audio, video data is demultiplexed, obtain video data and voice data；Word mould is turned by the local voice of terminal The voice recognition unit of block or online speech-to-text module carries out speech recognition to the voice data of other side, and reconvert is into word Storage turns captions memory module in word, and the word content of identification is added on the video pictures of terminal is shown；

S3：Call the local call atmosphere module of terminal or the online call atmosphere module of outside enhancing call function server； According to the word content recognized in S2, the overall atmosphere of call is rendered to image and text effects, and with video image synthesis after Display is rendered in terminal.

3. method of speech processing in video calling according to claim 2, it is characterised in that：User selects according to demand It is no to call local call atmosphere module or the outside online call atmosphere module for strengthening call function server.

4. method of speech processing in video calling according to claim 2, it is characterised in that：It is previously stored with kinds of words The template of video pictures is superimposed upon, is selected by user.

5. method of speech processing in video calling according to claim 2, it is characterised in that：Number between video call terminal User authentication process is included according to communication process, the data between video call terminal and outside enhancing call function line server are led to Letter process includes user authentication process.

6. method of speech processing in video calling according to claim 2, it is characterised in that：Also include S4：When tune in S3 Take the online call atmosphere module of outside enhancing call function server；Audio, video data is transferred to outside enhancing call by terminal Function line server, after line server treatment, obtains lteral data and atmosphere data, together with the audio, video data one of terminal And it is transferred to other side.