CN107205131A

CN107205131A - A kind of methods, devices and systems for realizing video calling

Info

Publication number: CN107205131A
Application number: CN201610161286.0A
Authority: CN
Inventors: 程岑
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2016-03-18
Filing date: 2016-03-18
Publication date: 2017-09-26
Also published as: WO2017157168A1

Abstract

A kind of methods, devices and systems for realizing video calling, including：First terminal gathers digital audio and video signals and digital video signal respectively；Digital audio and video signals are converted to text message by first terminal, and text message is packaged into text bag, digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag；Text bag, audio pack and video bag are sent to second terminal by first terminal respectively.

Description

A method, device and system for implementing video calls

技术领域 technical field

本文涉及但不限于视频通话领域，尤指一种实现视频通话的方法、装置和系统。 This article relates to but not limited to the field of video calls, especially a method, device and system for realizing video calls.

背景技术 Background technique

随着移动和互联网宽带技术的飞速发展，使可视通讯增值业务在家庭用户中得到迅速的推广，通过基于这个业务的技术可以得到面对面的交流以及网上视频教学等增值业务的服务，如果为可视通讯业务的音频增加同步字幕，不但能够给听力差的用户提供更好的服务，而且可以在网络不佳的情况下对实际的音频效果作一个有益的补充。 With the rapid development of mobile and Internet broadband technology, the visual communication value-added service has been rapidly promoted among home users. Through the technology based on this service, value-added services such as face-to-face communication and online video teaching can be obtained. Adding synchronous subtitles to the audio of video communication services can not only provide better services to users with poor hearing, but also make a useful supplement to the actual audio effect when the network is not good.

相关技术中，实现视频通话中增加语音字幕的方法大致包括： In related technologies, methods for adding voice subtitles in a video call roughly include:

第一终端分别采集数字音频信号和数字视频信号；对采集的数字音频信号进行语音编码处理，将语音编码处理后的数字音频信号封装成音频包；并将采集的数字音频信号通过语音识别技术转换为文本信息，将文本信息与采集的数字视频信号叠加合成后进行视频编码处理，将视频编码处理后的数字视频信号封装成视频包；分别将音频包和视频包发送给第二终端； The first terminal separately collects digital audio signals and digital video signals; performs voice coding processing on the collected digital audio signals, and encapsulates the digital audio signals after voice coding processing into audio packets; converts the collected digital audio signals through voice recognition technology For text information, the text information and the collected digital video signal are superimposed and synthesized, and then the video encoding process is performed, and the digital video signal after the video encoding process is packaged into a video packet; the audio packet and the video packet are respectively sent to the second terminal;

第二终端接收到音频包和视频包，对音频包中语音编码处理后的数字音频信号进行语音解码得到数字音频信号并播放，对视频包中频编码处理后的数字视频信号进行视频解码得到数字视频信号并显示。 The second terminal receives the audio packet and the video packet, performs speech decoding on the digital audio signal after the speech coding processing in the audio packet to obtain a digital audio signal and plays it, and performs video decoding on the digital video signal after the intermediate frequency coding processing of the video packet to obtain a digital video signal and display.

上述方法中，当网络情况不佳时,由于视频包比较大，所以视频包出现丢包和抖动的概率会更大，这样，文本信息就会随着视频包一起而丢失，造成视频通话过程中信息丢失。 In the above method, when the network condition is not good, since the video packet is relatively large, the probability of packet loss and jitter in the video packet will be greater. In this way, the text information will be lost together with the video packet, resulting in Information is lost.

发明内容 Contents of the invention

本发明实施例提出了一种实现视频通话的方法、装置和系统，能够在网络情况不佳时减少视频通话过程中的信息丢失。 Embodiments of the present invention provide a method, device and system for implementing video calls, which can reduce information loss during video calls when the network condition is poor.

本发明实施例提出了一种实现视频通话的方法，包括： The embodiment of the present invention proposes a method for implementing a video call, including:

第一终端分别采集数字音频信号和数字视频信号； The first terminal respectively collects digital audio signals and digital video signals;

第一终端将数字音频信号转换为文本信息，将文本信息封装成文本包，将数字音频信号封装成音频包，将数字视频信号封装成视频包； The first terminal converts the digital audio signal into text information, encapsulates the text information into a text packet, encapsulates the digital audio signal into an audio packet, and encapsulates the digital video signal into a video packet;

第一终端分别将文本包、音频包和视频包发送给第二终端。 The first terminal respectively sends the text packet, the audio packet and the video packet to the second terminal.

可选的，所述将数字音频信号封装成音频包之前还包括：所述第一终端对所述数字音频信号进行语音编码处理； Optionally, before encapsulating the digital audio signal into an audio packet, the method further includes: performing speech coding processing on the digital audio signal by the first terminal;

所述将数字音频信号封装成音频包包括：所述第一终端对语音编码处理后的数字音频信号封装成所述音频包。 The encapsulating the digital audio signal into an audio packet includes: the first terminal encapsulating the speech encoded digital audio signal into the audio packet.

可选的，所述将数字视频信号封装成视频包之前还包括：所述第一终端对所述数字视频信号进行视频编码处理； Optionally, before encapsulating the digital video signal into a video packet, the method further includes: performing video encoding processing on the digital video signal by the first terminal;

所述将数字视频信号封装成视频包包括：所述第一终端对视频编码处理后的数字视频信号封装成所述视频包。 The encapsulating the digital video signal into a video packet includes: the first terminal encapsulating the video encoded digital video signal into the video packet.

本发明实施例还提出了一种实现视频通话的方法，包括： The embodiment of the present invention also proposes a method for implementing a video call, including:

第二终端接收到来自第一终端的文本包； The second terminal receives the text packet from the first terminal;

第二终端判断出接收到的文本包中的时间戳对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳对应的时间，显示接收到的文本包和缓存的文本包中，时间戳字段对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳字段对应的时间的文本包中的文本信息。 The second terminal judges that the time corresponding to the timestamp in the received text packet is less than or equal to the time corresponding to the timestamp of the audio packet being played or the video packet being displayed, and displays the received text packet and the buffered text packet. , the text information in the text packet whose time corresponding to the timestamp field is less than or equal to the time corresponding to the timestamp field of the audio packet being played or the video packet being displayed.

可选的，当所述第二终端判断出所述接收到的文本包中的时间戳对应的时间大于正在播放的音频包或正在显示的视频包的时间戳对应的时间时，该方法还包括： Optionally, when the second terminal determines that the time corresponding to the timestamp in the received text packet is greater than the time corresponding to the timestamp of the audio packet being played or the video packet being displayed, the method further includes :

所述第二终端缓存所述接收到的文本包。 The second terminal buffers the received text package.

可选的，当第二终端在接收到所述文本包后的预设时间内未接收到音频包和视频包时，该方法还包括： Optionally, when the second terminal does not receive the audio package and the video package within the preset time after receiving the text package, the method also includes:

所述第二终端显示缓存的文本包中的文本信息。 The second terminal displays the text information in the cached text package.

可选的，所述第二终端接收到来自第一终端的文本包后，在所述第二终端判断出接收到的文本包中的时间戳对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳对应的时间之前还包括： Optionally, after the second terminal receives the text packet from the first terminal, the second terminal determines that the time corresponding to the time stamp in the received text packet is less than or equal to the audio packet being played or the audio packet being played. Before the time corresponding to the timestamp of the displayed video package also includes:

所述第二终端判断出字幕显示功能已打开。 The second terminal determines that the subtitle display function has been turned on.

本发明实施例还提出了一种第一终端，包括： The embodiment of the present invention also proposes a first terminal, including:

采集模块，用于分别采集数字音频信号和数字视频信号； Acquisition module, used for respectively collecting digital audio signal and digital video signal;

第一处理模块，用于将数字音频信号转换为文本信息，将文本信息封装成文本包，将数字音频信号封装成音频包，将数字视频信号封装成视频包； The first processing module is used to convert the digital audio signal into text information, encapsulate the text information into a text packet, encapsulate the digital audio signal into an audio packet, and encapsulate the digital video signal into a video packet;

发送模块，用于分别将文本包、音频包和视频包发送给第二终端。 A sending module, configured to respectively send the text packet, the audio packet and the video packet to the second terminal.

可选的，所述第一处理模块具体用于： Optionally, the first processing module is specifically used for:

将数字音频信号转换为文本信息，对所述数字音频信号进行语音编码处理，将文本信息封装成文本包，对语音编码处理后的数字音频信号封装成所述音频包，对所述数字视频信号进行视频编码处理，对视频编码处理后的数字视频信号封装成所述视频包。 converting the digital audio signal into text information, performing speech encoding processing on the digital audio signal, encapsulating the text information into a text packet, encapsulating the speech encoded digital audio signal into the audio packet, and processing the digital video signal Perform video encoding processing, and encapsulate the digital video signal after video encoding processing into the video packet.

本发明实施例还提出了一种第二终端，包括： The embodiment of the present invention also proposes a second terminal, including:

接收模块，用于接收到来自第一终端的文本包； a receiving module, configured to receive a text packet from the first terminal;

第二处理模块，用于判断出接收到的文本包中的时间戳对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳对应的时间，显示接收到的文本包和缓存的文本包中，时间戳字段对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳字段对应的时间的文本包中的文本信息。 The second processing module is used to judge that the time corresponding to the timestamp in the received text packet is less than or equal to the time corresponding to the timestamp of the audio packet being played or the video packet being displayed, and display the received text packet and cache In the text packet, the time corresponding to the timestamp field is less than or equal to the text information in the text packet of the time corresponding to the timestamp field of the audio packet being played or the video packet being displayed.

可选的，所述第二处理模块还用于： Optionally, the second processing module is also used for:

当判断出所述接收到的文本包中的时间戳对应的时间大于正在播放的音频包或正在显示的视频包的时间戳对应的时间时，缓存所述接收到的文本包。 When it is determined that the time corresponding to the time stamp in the received text package is greater than the time corresponding to the time stamp of the audio package being played or the video package being displayed, buffer the received text package.

当在接收到所述文本包后的预设时间内未接收到音频包和视频包时，显示缓存的文本包中的文本信息。 When the audio package and the video package are not received within a preset time after the text package is received, the text information in the cached text package is displayed.

本发明实施例还提出了一种实现视频通话的系统，包括： The embodiment of the present invention also proposes a system for realizing video calls, including:

第一终端，用于分别采集数字音频信号和数字视频信号；将数字音频信号转换为文本信息，将文本信息封装成文本包，将数字音频信号封装成音频包，将数字视频信号封装成视频包；分别将文本包、音频包和视频包发送给第二终端； The first terminal is used to separately collect digital audio signals and digital video signals; convert the digital audio signals into text information, encapsulate the text information into text packets, encapsulate the digital audio signals into audio packets, and encapsulate the digital video signals into video packets ; Send the text packet, audio packet and video packet to the second terminal respectively;

第二终端，用于接收到来自第一终端的文本包；判断出接收到的文本包中的时间戳对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳对应的时间，显示接收到的文本包和缓存的文本包中，时间戳字段对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳字段对应的时间的文本包中的文本信息。 The second terminal is used to receive the text packet from the first terminal; it is determined that the time corresponding to the timestamp in the received text packet is less than or equal to the time corresponding to the timestamp of the audio packet being played or the video packet being displayed , to display the text information in the text packets whose time corresponding to the timestamp field in the received text packet and cached text packet is less than or equal to the time corresponding to the timestamp field of the audio packet being played or the video packet being displayed.

可选的，所述第二终端还用于： Optionally, the second terminal is also used for:

与相关技术相比，本发明实施例的技术方案包括：第一终端分别采集数字音频信号和数字视频信号；第一终端将数字音频信号转换为文本信息，将文本信息封装成文本包，将数字音频信号封装成音频包，将数字视频信号封装成视频包；第一终端分别将文本包、音频包和视频包发送给第二终端。通过本发明实施例的方案，第一终端分别将文本包、音频包和视频包发送给第二终端，实现了在网络情况不佳时，视频包丢失时不会导致文本丢失，从而减少了视频通话过程中的信息丢失。 Compared with related technologies, the technical solutions of the embodiments of the present invention include: the first terminal separately collects digital audio signals and digital video signals; the first terminal converts the digital audio signals into text information, encapsulates the text information into text packets, and converts the digital The audio signal is encapsulated into an audio packet, and the digital video signal is encapsulated into a video packet; the first terminal respectively sends the text packet, the audio packet and the video packet to the second terminal. Through the scheme of the embodiment of the present invention, the first terminal sends the text packet, audio packet and video packet to the second terminal respectively, so that when the network condition is not good, the text will not be lost when the video packet is lost, thereby reducing the Information lost during a call.

附图说明 Description of drawings

下面对本发明实施例中的附图进行说明，实施例中的附图是用于对本发明的进一步理解，与说明书一起用于解释本发明，并不构成对本发明保护范围的限制。 The accompanying drawings in the embodiments of the present invention are described below. The accompanying drawings in the embodiments are used for further understanding of the present invention and are used together with the description to explain the present invention, and do not constitute a limitation to the protection scope of the present invention.

图1为本发明实施例发送端实现视频通话的方法的流程图； FIG. 1 is a flow chart of a method for implementing a video call at a sending end according to an embodiment of the present invention;

图2为本发明实施例接收端实现视频通话的方法的流程图； FIG. 2 is a flow chart of a method for implementing a video call at a receiving end according to an embodiment of the present invention;

图3为本发明实施例第一终端的结构组成示意图； FIG. 3 is a schematic diagram of the structural composition of a first terminal according to an embodiment of the present invention;

图4为本发明实施例第二终端的结构组成示意图； FIG. 4 is a schematic structural diagram of a second terminal according to an embodiment of the present invention;

图5为本发明实施例实现视频通话的系统的结构组成示意图。 FIG. 5 is a schematic structural composition diagram of a system for implementing a video call according to an embodiment of the present invention.

具体实施方式 detailed description

为了便于本领域技术人员的理解，下面结合附图对本发明作进一步的描述，并不能用来限制本发明的保护范围。需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的各种方式可以相互组合。 In order to facilitate the understanding of those skilled in the art, the present invention will be further described below in conjunction with the accompanying drawings, which cannot be used to limit the protection scope of the present invention. It should be noted that, in the case of no conflict, the embodiments in the present application and various manners in the embodiments can be combined with each other.

参见图1，本发明实施例提出了一种实现视频通话的方法，包括： Referring to Fig. 1, the embodiment of the present invention proposes a method for implementing a video call, including:

步骤100、第一终端分别采集数字音频信号和数字视频信号。 Step 100, the first terminal separately collects digital audio signals and digital video signals.

本步骤中，第一终端可以采用G.711(一种由国际电信联盟制定的音频编码方式)中规定的采集时间采集数字音频信号，按照预先设定的视频帧率采集数字视频信号。例如，每10毫秒(ms)采集一次数字音频信号，每40ms采集一次数字视频信号。 In this step, the first terminal may collect digital audio signals at the collection time specified in G.711 (an audio coding method formulated by the International Telecommunication Union), and collect digital video signals at a preset video frame rate. For example, a digital audio signal is collected every 10 milliseconds (ms), and a digital video signal is collected every 40 ms.

步骤101、第一终端将数字音频信号转换为文本信息，将文本信息封装成文本包，将数字音频信号封装成音频包，将数字视频信号封装成视频包。 Step 101. The first terminal converts digital audio signals into text information, encapsulates the text information into text packets, encapsulates digital audio signals into audio packets, and encapsulates digital video signals into video packets.

本步骤中，第一终端可以采用语音识别技术将数字音频信号转换为文本信息。 In this step, the first terminal may use voice recognition technology to convert the digital audio signal into text information.

本步骤中，文本包、或音频包、或视频包可以按照实时传输协议(RTP，Real-time Transport Protocol)包协议的规范来进行封装。 In this step, the text packet, or audio packet, or video packet may be encapsulated according to the specification of the Real-time Transport Protocol (RTP, Real-time Transport Protocol) packet protocol.

RTP包的包头的格式如表1所示。 The format of the header of the RTP packet is shown in Table 1.

表1 Table 1

表1中，V表示协议版本，2比特(bit)， In Table 1, V represents the protocol version, 2 bits (bit),

P表示填充位，1比特，当P置位时，RTP包的包头尾部包含附加的填充字节。 P represents a padding bit, 1 bit, when P is set, the header and tail of the RTP packet contain additional padding bytes.

X为扩展位，1比特，当X置位时，表示在RTP包的包头后扩展一个包头。 X is an extension bit, 1 bit, when X is set, it means that a header is extended after the header of the RTP packet.

CC表示贡献源列表(Contributing Source Identifiers)标识的数目。 CC represents the number of Contributing Source Identifiers identifiers.

M为标记位，1比特。 M is a marker bit, 1 bit.

PT为负载类型(Payload Type)，7比特，对于文本包，可以采用相关技术中未使用的类型来表示，例如20。 PT is a payload type (Payload Type), 7 bits, and for a text packet, it can be represented by a type that is not used in the related art, for example, 20.

序列号，16比特，每发一个RTP包，序列号增加1。本发明实施例中，文本包、音频包、视频包的序列号独立编号。 Sequence number, 16 bits, every time an RTP packet is sent, the sequence number increases by 1. In the embodiment of the present invention, the serial numbers of the text package, the audio package and the video package are numbered independently.

时间戳，32比特，记录RTP包中第一个字节的采样时刻。对于音频包和视频包，时间戳为开始采集的时间，对于文本包，时间戳为对应的音频包开始采集的时间。 Timestamp, 32 bits, records the sampling moment of the first byte in the RTP packet. For audio and video packets, the time stamp is the time when the collection starts, and for the text package, the time stamp is the time when the corresponding audio package starts to be collected.

同步源标识符(SSRC，Synchronization Source Identifier)，32比特，表示RTP包的来源，同一个RTP会话中不能有两个相同的SSRC值。 Synchronization Source Identifier (SSRC, Synchronization Source Identifier), 32 bits, indicates the source of the RTP packet, and there cannot be two identical SSRC values in the same RTP session.

CSRC，0～15项，每项32比特，该字段不是RTP包的包头所必须的。 CSRC, 0~15 items, 32 bits each, this field is not required for the header of the RTP packet.

文本包中的时间戳字段为采集数字音频信号或数字视频信号的时间，Payload Type为语音文本信息类型(可以采用未定义的值，例如20等)。 The timestamp field in the text packet is the time when the digital audio signal or digital video signal is collected, and the Payload Type is the voice text information type (undefined values can be used, such as 20, etc.).

步骤102、第一终端分别将文本包、音频包和视频包发送给第二终端。 Step 102, the first terminal respectively sends the text packet, the audio packet and the video packet to the second terminal.

本步骤中，不同类型的包可以按照不同的策略分别发送。例如，音频包按照音频编码采样频率发送，视频包按照约定的帧率间隔发送，文本包按照音频编码采样频率发送。 In this step, different types of packets may be sent separately according to different strategies. For example, audio packets are sent according to the audio coding sampling frequency, video packets are sent according to the agreed frame rate interval, and text packets are sent according to the audio coding sampling frequency.

可选的，将数字音频信号封装成音频包之前还包括：第一终端对数字音频信号进行语音编码处理；相应的， Optionally, before encapsulating the digital audio signal into an audio packet, it also includes: the first terminal performs speech coding processing on the digital audio signal; correspondingly,

将数字音频信号封装成音频包包括：第一终端对语音编码处理后的数字音频信号封装成音频包。 Encapsulating the digital audio signal into an audio packet includes: the first terminal encapsulating the speech encoded digital audio signal into an audio packet.

可选的，将数字视频信号封装成视频包之前还包括：第一终端对所述数字视频信号进行视频编码处理；相应的， Optionally, before encapsulating the digital video signal into a video packet, the method further includes: performing video encoding processing on the digital video signal by the first terminal; correspondingly,

通过本发明实施例的方案，第一终端分别将文本包、音频包和视频包发送给第二终端，实现了在网络情况不佳时，视频包丢失时不会导致文本丢失，从而减少了视频通话过程中的信息丢失。 Through the scheme of the embodiment of the present invention, the first terminal sends the text packet, audio packet and video packet to the second terminal respectively, so that when the network condition is not good, the text will not be lost when the video packet is lost, thereby reducing the Information lost during a call.

参见图2，本发明实施例还提出了一种实现视频通话的方法，包括： Referring to Fig. 2, the embodiment of the present invention also proposes a method for implementing a video call, including:

步骤200、第二终端接收到来自第一终端的文本包。 Step 200, the second terminal receives the text packet from the first terminal.

步骤201、第二终端判断出接收到的文本包中的时间戳对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳对应的时间，显示接收到的文本包和缓存的文本包中，时间戳对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳对应的时间的文本包中的文本信息。 Step 201, the second terminal determines that the time corresponding to the timestamp in the received text packet is less than or equal to the time corresponding to the timestamp of the audio packet being played or the video packet being displayed, and displays the received text packet and cached In the text package, the text information in the text package whose time corresponding to the time stamp is less than or equal to the time corresponding to the time stamp of the audio package being played or the video package being displayed.

本步骤中，文本信息可以按照预先设置的显示区域和/或字体大小进行显示。具体地，可以根据显示区域和/或字体大小确定屏幕上一次可以显示的字数，计算一个文本包的文本信息需要显示的次数，根据一个文本包对应的音频包的采集频率确定显示一次的停留时间，按照停留时间进行显示。 In this step, the text information can be displayed according to the preset display area and/or font size. Specifically, the number of words that can be displayed on the screen at one time can be determined according to the display area and/or font size, the number of times the text information of a text package needs to be displayed can be calculated, and the dwell time for displaying once can be determined according to the acquisition frequency of the audio package corresponding to a text package , displayed according to dwell time.

例如，一个文本包对应的音频包的采集频率为20ms采集一次，文本包一共100个字，一次可以显示的字数为10个字，那么需要显示10次，每次显示的停留时间为2ms。 For example, the audio packet corresponding to a text packet is collected once every 20ms. The text packet has 100 words in total, and the number of words that can be displayed at one time is 10 words, so it needs to be displayed 10 times, and the dwell time of each display is 2ms.

本步骤中，文本信息可以在屏幕的图形层上进行显示，即叠加到显示数字视频信号的视频层上进行显示。 In this step, the text information can be displayed on the graphics layer of the screen, that is, superimposed on the video layer for displaying digital video signals for display.

可选的，步骤200和步骤201之间还包括： Optionally, between step 200 and step 201:

第二终端判断出字幕显示功能已打开。 The second terminal determines that the subtitle display function has been turned on.

当第二终端判断出字幕显示功能关闭时，结束本流程。 When the second terminal determines that the subtitle display function is off, this process ends.

该方法还包括： The method also includes:

第二终端判断出接收到的文本包中的时间戳对应的时间大于正在播放的音频包或正在显示的视频包的时间戳对应的时间，缓存接收到的文本包。 The second terminal determines that the time corresponding to the timestamp in the received text packet is greater than the time corresponding to the timestamp of the audio packet being played or the video packet being displayed, and buffers the received text packet.

该方法还包括： The method also includes:

第二终端在预设时间内未接收到音频包和视频包，显示缓存的文本包中的文本信息。 The second terminal does not receive the audio package and the video package within the preset time, and displays the text information in the cached text package.

上述方法中，第二终端接收到音频包和/或视频包后，可以按照音视频解码协议标准中约定的规则进行播放或显示。 In the above method, after the second terminal receives the audio packet and/or video packet, it can play or display according to the rules stipulated in the audio and video decoding protocol standard.

其中，第二终端接收到音频包后，可以按照音频解码协议标准(如G711)中约定的规则进行播放，第二终端接收到视频包后，可以按照视频解码协议(如H264)中约定的规则进行显示。 Wherein, after the second terminal receives the audio packet, it can play according to the rules agreed in the audio decoding protocol standard (such as G711); after receiving the video packet, the second terminal can play the to display.

参见图3，本发明实施例还提出了一种第一终端，包括： Referring to Fig. 3, the embodiment of the present invention also proposes a first terminal, including:

本发明实施例的第一终端中，第一处理模块具体用于： In the first terminal of the embodiment of the present invention, the first processing module is specifically used for:

将数字音频信号转换为文本信息，对数字音频信号进行语音编码处理，将文本信息封装成文本包，对语音编码处理后的数字音频信号封装成音频包，对数字视频信号进行视频编码处理，对视频编码处理后的数字视频信号封装成视频包。 Convert the digital audio signal into text information, perform speech coding processing on the digital audio signal, encapsulate the text information into a text package, package the digital audio signal after the speech coding process into an audio package, and perform video coding processing on the digital video signal. The digital video signal processed by video encoding is encapsulated into a video packet.

参见图4，本发明实施例还提出了一种第二终端，包括： Referring to Figure 4, the embodiment of the present invention also proposes a second terminal, including:

本发明实施例的第二终端中，第二处理模块还用于： In the second terminal in the embodiment of the present invention, the second processing module is also used for:

当判断出接收到的文本包中的时间戳对应的时间大于正在播放的音频包或正在显示的视频包的时间戳对应的时间时，缓存接收到的文本包。 When it is determined that the time corresponding to the time stamp in the received text package is greater than the time corresponding to the time stamp of the audio package being played or the video package being displayed, the received text package is cached.

当在接收到文本包后的预设时间内未接收到音频包和视频包时，显示缓存的文本包中的文本信息。 When the audio package and the video package are not received within the preset time after receiving the text package, the text information in the cached text package is displayed.

参见图5，本发明实施例还提出了一种实现视频通话的系统，包括： Referring to Fig. 5, the embodiment of the present invention also proposes a system for implementing video calls, including:

第二终端，用于接收到来自第一终端的将文本包；判断出接收到的文本包中的时间戳对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳对应的时间，显示接收到的文本包和缓存的文本包中，时间戳字段对应的时间小于或等于正在播放的音频包或正在显示的视频包的时间戳字段对应的时间的文本包中的文本信息。 The second terminal is used to receive the text package from the first terminal; it is determined that the time corresponding to the time stamp in the received text package is less than or equal to the time corresponding to the time stamp of the audio package being played or the video package being displayed Time, displaying the text information in the text packets whose time corresponding to the timestamp field is less than or equal to the time corresponding to the timestamp field of the audio packet being played or the video packet being displayed in the received text packet and cached text packet.

本发明实施例的系统中，第二终端还用于： In the system of the embodiment of the present invention, the second terminal is also used for:

需要说明的是，以上所述的实施例仅是为了便于本领域的技术人员理解而已，并不用于限制本发明的保护范围，在不脱离本发明的发明构思的前提下，本领域技术人员对本发明所做出的任何显而易见的替换和改进等均在本发明的保护范围之内。 It should be noted that the above-described embodiments are only for the convenience of those skilled in the art to understand, and are not intended to limit the protection scope of the present invention. Any obvious replacements and improvements made by the invention are within the protection scope of the present invention.

Claims

1. a kind of method for realizing video calling, it is characterised in that including：

First terminal gathers digital audio and video signals and digital video signal respectively；

Digital audio and video signals are converted to text message by first terminal, and text message is packaged into text bag, Digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag；

Text bag, audio pack and video bag are sent to second terminal by first terminal respectively.

2. according to the method described in claim 1, it is characterised in that described to encapsulate digital audio and video signals Also include before into audio pack：The first terminal carries out voice coding processing to the digital audio and video signals；

It is described digital audio and video signals are packaged into audio pack to include：The first terminal is to voice coding processing Digital audio and video signals afterwards are packaged into the audio pack.

3. according to the method described in claim 1, it is characterised in that described to encapsulate digital video signal Also include before into video bag：The first terminal carries out Video coding processing to the digital video signal；

It is described digital video signal is packaged into video bag to include：The first terminal is to Video coding processing Digital video signal afterwards is packaged into the video bag.

4. a kind of method for realizing video calling, it is characterised in that including：

Second terminal receives the text bag from first terminal；

Second terminal judges that the timestamp corresponding time in the text bag received is less than or equal to The timestamp corresponding time of the audio pack of broadcasting or the video bag shown, show the text received In the text bag of bag and caching, the timestamp field corresponding time is less than or equal to the audio pack played Or the text message in the text bag of the timestamp field corresponding time of the video bag shown.

5. method according to claim 4, it is characterised in that when the second terminal judges institute The timestamp corresponding time in the text bag received is stated more than the audio pack played or is shown Video bag timestamp corresponding time when, this method also includes：

The text bag received described in the second terminal caching.

6. method according to claim 5, it is characterised in that when second terminal receive it is described When not receiving audio pack and video bag in the preset time after text bag, this method also includes：

Text message in the text bag of the second terminal display caching.

7. method according to claim 4, it is characterised in that the second terminal, which is received, to be come from After the text bag of first terminal, the timestamp pair in the second terminal judges the text bag that receives The time answered is less than or equal to the audio pack played or the timestamp of the video bag shown is corresponding Also include before time：

The second terminal judges that caption display function has been opened.

8. a kind of first terminal, it is characterised in that including：

Acquisition module, for gathering digital audio and video signals and digital video signal respectively；

First processing module, for digital audio and video signals to be converted into text message, text message is encapsulated Into text bag, digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag；

Sending module, for text bag, audio pack and video bag to be sent into second terminal respectively.

9. first terminal according to claim 8, it is characterised in that the first processing module tool Body is used for：

Digital audio and video signals are converted into text message, the digital audio and video signals are carried out at voice coding Reason, text bag is packaged into by text message, and institute is packaged into the digital audio and video signals after voice coding processing Audio pack is stated, Video coding processing is carried out to the digital video signal, to the number after Video coding processing Word vision signal is packaged into the video bag.

10. a kind of second terminal, it is characterised in that including：

Receiving module, for receiving the text bag from first terminal；

Second processing module, is less than for the timestamp corresponding time in the text bag judging to receive Or equal to the timestamp corresponding time of the audio pack played or the video bag shown, display connects In the text bag and the text bag of caching that receive, the timestamp field corresponding time, which is less than or equal to, to be broadcast Text in the text bag of the timestamp field corresponding time of the audio pack put or the video bag shown Information.

11. second terminal according to claim 10, it is characterised in that the Second processing module It is additionally operable to：

It is more than the sound played when the timestamp corresponding time in the text bag received of judging Frequency was wrapped or during the timestamp of video bag that shows corresponding time, the text bag received described in caching.

12. second terminal according to claim 11, it is characterised in that the Second processing module It is additionally operable to：

When not receiving audio pack and video bag in the preset time after receiving the text bag, show Show the text message in the text bag of caching.

13. a kind of system for realizing video calling, it is characterised in that including：

First terminal, for gathering digital audio and video signals and digital video signal respectively；DAB is believed Number text message is converted to, text message is packaged into text bag, digital audio and video signals are packaged into audio Bag, video bag is packaged into by digital video signal；Text bag, audio pack and video bag are sent to respectively Second terminal；

Second terminal, for receiving the text bag from first terminal；Judge the text bag received In the timestamp corresponding time be less than or equal to the audio pack played or the video bag that shows In timestamp corresponding time, the text bag for showing the text bag received and caching, timestamp field pair The time answered is less than or equal to the timestamp field pair of the audio pack played or the video bag shown Text message in the text bag for the time answered.

14. system according to claim 13, it is characterised in that the second terminal is additionally operable to：

15. system according to claim 14, it is characterised in that the second terminal is additionally operable to：