CN118487712B

CN118487712B - Conference audio processing method and system based on voice intelligence

Info

Publication number: CN118487712B
Application number: CN202410594348.1A
Authority: CN
Inventors: 白石; 陶川; 王世平; 白瀚良; 蔡睿直; 贾登魁
Original assignee: Chengtong Digital Technology Co ltd
Current assignee: Chengtong Digital Technology Co ltd
Priority date: 2024-05-14
Filing date: 2024-05-14
Publication date: 2024-12-17
Anticipated expiration: 2044-05-14
Also published as: CN118487712A

Abstract

The application relates to a conference audio processing method and system based on voice intelligence, wherein the method is applied to a source terminal, the source terminal characterizes terminal equipment where a speaker is positioned in an audio-video conference; the method comprises the steps of determining a transmission mode of each receiving end based on first network transmission quality, second network transmission quality and conference type of an audio-video conference, obtaining audio data of a speaker, determining target data corresponding to each receiving end based on the audio data and the transmission mode of each receiving end, for each receiving end, representing data obtained by processing the audio data through a transmission mode of the receiving end, and sending the target data corresponding to each receiving terminal to the corresponding receiving end. The application has the effect of improving the accuracy of audio data transmission in the audio-video conference under the condition of network disturbance.

Description

Conference audio processing method and system based on voice intelligence

Technical Field

The invention relates to the technical field of audio and video conference calls, in particular to a conference audio processing method and system based on voice intelligence.

Background

With the rapid development of information technology, an audio-video conference system has become an important tool for remote communication and exchange in modern society. The method is widely applied to the scenes of enterprise conferences, online education, telemedicine and the like, and greatly improves the communication efficiency and the information spreading breadth. However, fluctuations in network transmission quality often have a significant impact on the effectiveness of the audio-video conference.

In the existing audio-video conference system, when the network transmission quality is reduced, such as network jitter, bandwidth deficiency or packet loss, the problems of audio and video synchronization loss, sound noise variation, unclear tone distortion and the like are often caused. The method not only reduces the communication experience of the participants, but also can lead to the fact that key information in the conference cannot be accurately transmitted, thereby influencing conference decision and follow-up work.

Disclosure of Invention

The invention aims to solve at least one technical problem by providing a conference audio processing method and system based on voice intelligence.

The technical scheme for solving the technical problems is as follows:

In a first aspect, the present application provides a conference audio processing method based on voice intelligence, which is applied to a source end, and adopts the following technical scheme:

The conference audio processing method based on voice intelligence is applied to a source terminal, wherein the source terminal characterizes terminal equipment where a speaker is in an audio-video conference, and the method comprises the following steps:

In an audio-video conference, acquiring the first network transmission quality of a source terminal and the second network transmission quality of each receiving terminal;

Determining a transmission mode for each receiving end based on the first network transmission quality, each second network transmission quality and the conference type of the audio-video conference, wherein for each receiving end, the transmission mode characterizes a transmission mode of data between the source end and the receiving end;

acquiring audio data of a speaker, determining target data corresponding to each receiving end based on the audio data and transmission modes of the receiving ends, wherein for each receiving end, the target data represents data obtained by processing the audio data through a transmission mode of the receiving end;

And sending the target data corresponding to each receiving terminal to a corresponding receiving terminal.

The method has the beneficial effects that the transmission mode is determined by acquiring the network transmission quality of the source end and each receiving end in real time and based on the information of the first network transmission quality of the source end, the second network transmission quality of each receiving end and the conference type of the audio-video conference. Thus, the data transmission mode can be adjusted according to the network condition, and different meeting types can be adapted. The audio data of the speaker is processed, the target data is determined based on the transmission mode of the receiving end, intelligent processing of the audio data is achieved, high quality of the audio data can be guaranteed in the transmission process, clear and smooth audio data can be obtained by each receiving end in an audio-video conference, accuracy of audio data transmission in the audio-video conference under the condition of network disturbance is improved, and communication barriers caused by the condition of network disturbance are reduced.

On the basis of the technical scheme, the invention can be improved as follows.

Further, the transmission mode is any one of an audio transmission mode and a text transmission mode, and the determining the transmission mode for each receiving end based on the first network transmission quality, each second network transmission quality and the conference type of the audio-video conference includes:

based on the conference type of the audio-video conference, determining a network quality requirement corresponding to the conference type;

When the first network transmission quality does not meet the network quality requirement, determining that the transmission mode of the source end to each receiving end is a text transmission mode;

When the first network transmission quality meets the network quality requirement and the second network transmission quality of the receiving end does not meet the network quality requirement, determining that a transmission mode between the source end and the receiving end which does not meet the network quality requirement is a text transmission mode;

when the first network transmission quality meets the network quality requirement and the second network transmission quality of the receiving end meets the preset network quality requirement, determining that the transmission mode of the source end to the receiving end meeting the preset network quality requirement is an audio transmission mode.

The adoption of the further scheme has the beneficial effects that the transmission mode is flexibly adjusted according to the network conditions of the source end and the receiving end and the conference type, so that the adaptability and the flexibility of the conference are enhanced. Different types of conferences may have different requirements for audio quality, and such dynamic adjustment may meet the requirements in different conference scenarios. Adjusting the transmission mode according to the network quality may optimize the utilization of network resources. When the network quality is good, the audio transmission mode is used, better conference experience can be provided, and when the network quality is poor, the information can be accurately transmitted by switching to the text transmission mode, so that the audio quality is prevented from being reduced or interrupted due to the problems of network delay, packet loss and the like, the occupation of network bandwidth is reduced, and network resources are released for other applications or services.

Further, the determining, based on the audio data and the transmission modes of the respective receiving ends, the target data corresponding to each receiving end includes:

when the transmission mode is a text transmission mode, converting the audio data into text data based on a preset voice intelligent recognition technology;

Checking the text data, generating a check code corresponding to the text data, packaging the check code and the text data, and taking the packaged check code and text data as target data corresponding to the receiving end;

when the transmission mode is an audio transmission mode, the audio data is used as target data corresponding to the receiving end.

The further scheme has the advantages that when the transmission mode is a text transmission mode, the voice intelligent recognition technology is used for converting the audio data into text data, so that the problem of audio distortion or interruption possibly occurring in network transmission can be reduced, and the accuracy and reliability of information transmission are improved. And the text data is checked and a check code is generated, so that the integrity and the accuracy of the data are further ensured, and the possible errors or losses of the data in the transmission process are reduced. Because the text data and the check code are transmitted, compared with the audio data, the data size is smaller, the occupation of bandwidth can be obviously reduced, and the transmission quality of the target data is improved.

When the transmission mode is an audio transmission mode, the network condition is better, and the audio transmission mode is selected to provide better conference experience.

Further, after determining that the transmission mode of the source end to each receiving end is a text transmission mode when the network transmission quality of the source end does not meet the preset network quality requirement, the method further includes:

closing an audio transmission channel of the source end;

after determining that the transmission mode between the source end and the receiving end which does not meet the network quality requirement is the text transmission mode when the first network transmission quality meets the network quality requirement and the second network transmission quality of the receiving end does not meet the network quality requirement, the method further comprises:

And marking the audio transmission path of the receiving end which does not meet the preset network quality requirement as an unavailable state.

The further scheme has the beneficial effects that the audio transmission channel of the source end is closed or the audio transmission channel of the receiving end which does not meet the condition is marked to be in an unavailable state, and the transmission of audio data can be immediately stopped, so that the occupation of network bandwidth is obviously reduced. If the network conditions are not sufficient to support audio transmission, continuing to attempt audio transmission can result in wasted resources such as excessive use of CPU, memory and network bandwidth. By closing or marking the unavailable state, these unnecessary waste of resources can be avoided. Under poor network conditions, attempts to transmit audio may result in degraded sound quality, stuck and heavy and confusing sounds, etc., affecting the user experience.

Further, the sending the target data corresponding to each receiving terminal to the corresponding receiving terminal includes:

compressing the target data corresponding to each receiving terminal based on a preset compression algorithm, and sending each compressed target data to a corresponding receiving terminal.

The adoption of the further scheme has the beneficial effects that the volume of the target data can be obviously reduced by compressing the target data, and the network bandwidth required in the transmission process is reduced, so that the data transmission speed is increased. For the audio-video conference with higher real-time requirement, the method can ensure timely transmission of information and improve conference efficiency.

Further, after sending the target data corresponding to each receiving terminal to the corresponding receiving terminal, the method further includes:

and receiving the playing completion state identifiers sent by the receiving ends, and displaying the playing completion state identifiers.

The further scheme has the beneficial effects that by displaying the playing completion state, a speaker of the conference can instantly know which receiving ends have successfully received the data and which receiving ends possibly have problems. The method is convenient to quickly identify and solve the potential transmission problem, and ensures that conference information can be accurately and timely transmitted to all participants, thereby improving the communication efficiency of the conference.

In a second aspect, the present application provides a conference audio processing method based on voice intelligence, which is applied to a receiving end, and adopts the following technical scheme:

the conference audio processing method based on voice intelligence is applied to a receiving end, wherein the receiving end characterizes a terminal device for receiving audio data of a terminal device where a speaker is located in an audio-video conference, and the method comprises the following steps:

In an audio-video conference, acquiring target data sent by a source terminal based on audio data and a transmission mode, wherein the source terminal represents terminal equipment where a speaker is located in the audio-video conference, the transmission mode is a transmission mode of data between the source terminal and a receiving terminal, and the target data represents data obtained by processing the audio data through the transmission mode of the receiving terminal;

preprocessing the target data to obtain processed voice data;

and playing the voice data, and after the voice data is played, sending a playing completion state identifier to a source terminal.

The method has the advantages that the target data is converted into voice data through preprocessing, and after the voice data is played, the receiving end sends the playing completion state to the source end. The source end can conveniently know the data check and play condition of the receiving end through a timely feedback mechanism, so that corresponding adjustment is carried out or other participants are notified according to the needs. This interactivity enhances the real-time and interactivity of the conference.

Further, the preprocessing the target data to obtain processed voice data includes:

when the target data are packaged check codes and text data, checking the text data based on a preset check rule and the check codes;

if the text data passes the verification, converting the text data into voice data based on a preset conversion rule;

The converting the text data into voice data based on the preset conversion rule includes:

acquiring the identification information and the audio feature information of the source terminal, wherein the audio feature information characterizes the voice characteristics of a speaker of the source terminal in an audio-video conference;

determining a voice characteristic value corresponding to the source end in the characteristic value library based on the identification information;

comparing the audio characteristic information with the voice characteristic value;

if the audio characteristic information is compared with the voice characteristic value, the voice characteristic value is used as an input parameter of a preset text conversion model, and text data is converted into voice data through a voice synthesis technology.

The further scheme has the beneficial effects that the accuracy and the integrity of the received text data are ensured by receiving the packaged check code and the text data and checking at the receiving end. The checking mechanism is helpful for reducing errors and packet loss phenomena in the data transmission process and improving the communication quality of the whole conference. After receiving the correct text data, the text data is converted into voice data through a preset conversion rule for playing, so that the receiving end user can still clearly hear the content of the speaker even if the audio transmission quality is reduced due to poor network conditions. The processing mode avoids the degradation or interruption of sound quality caused by network problems, thereby optimizing the conference experience of users.

And determining the voice characteristics of the speaker corresponding to the source terminal by comparing the audio characteristic information of the source terminal with the voice characteristic values stored in the characteristic value library. Therefore, when the text data are converted into voice data to be played, the voice characteristics of the speaker can be simulated, and personalized voice playing is realized. The voice characteristic value matched with the voice characteristic of the speaker is used as the input parameter of the text conversion model, the text data is converted into the voice data through the voice synthesis technology and is played, and the generated voice data is ensured to be consistent with the original voice of the speaker in the aspects of tone, intonation, speed and the like. The accuracy and naturalness of voice playing are improved, so that a receiving end user can better understand and accept the content of a speaker. In case of poor network conditions, the dependency on the network bandwidth can be reduced by receiving and playing the converted speech data instead of the original audio data.

In a third aspect, the present application provides a conference audio processing system based on voice intelligence, which is applied to a source end, and adopts the following technical scheme:

A conference audio processing system based on voice intelligence, applied to a source, the source characterizes a terminal device where a speaker is located in an audio-video conference, the system comprises:

the network quality identification module is used for acquiring the first network transmission quality of the source terminal and the second network transmission quality of each receiving terminal in the audio-video conference;

The mode management module is used for determining a transmission mode of each receiving end based on the first network transmission quality, each second network transmission quality and the conference type of the audio-video conference, and for each receiving end, the transmission mode represents a transmission mode of data between the source end and the receiving end;

The first data receiving and processing module is used for acquiring audio data of a speaker, determining target data corresponding to each receiving end based on the audio data and the transmission modes of the receiving ends, and for each receiving end, representing data obtained by processing the audio data through the transmission mode of the receiving end;

and the transmission module is used for transmitting the target data corresponding to each receiving terminal to the corresponding receiving terminal.

The method has the beneficial effects that the transmission mode is determined by acquiring the network transmission quality of the source end and each receiving end in real time and based on the information of the network transmission quality of the source end, the network transmission quality of each receiving end and the conference type of the audio-video conference. Thus, the data transmission mode can be adjusted according to the network condition, and different meeting types can be adapted. The audio data of the speaker is processed, the target data is determined based on the transmission mode of the receiving end, intelligent processing of the audio data is achieved, high quality of the audio data can be guaranteed in the transmission process, clear and smooth audio data can be obtained by each receiving end in an audio-video conference, accuracy of audio data transmission in the audio-video conference under the condition of network disturbance is improved, and communication barriers caused by the condition of network disturbance are reduced.

In a fourth aspect, the present application provides a conference audio processing system based on voice intelligence, which is applied to a receiving end, and adopts the following technical scheme:

The second data receiving and processing module is used for acquiring target data sent by a source terminal based on audio data and a transmission mode in an audio-video conference, wherein the source terminal represents terminal equipment where a speaker is located in the audio-video conference, the transmission mode is a transmission mode of data between the source terminal and a receiving terminal, and the target data represents data obtained by processing the audio data through the transmission mode of the receiving terminal;

The preprocessing module is used for preprocessing the target data to obtain processed voice data;

And the feedback module is used for playing the voice data and sending a playing completion state identifier to the source end after the voice data is played.

Drawings

Fig. 1 is a schematic flow chart of a conference audio processing method based on voice intelligence applied to a source end according to an embodiment of the present invention;

fig. 2 is a flow chart of a conference audio processing method based on voice intelligence applied to a receiving end according to an embodiment of the present invention;

FIG. 3 is a block diagram illustrating a voice-based intelligent conference audio processing system applied to a source according to an embodiment of the present invention;

fig. 4 is a block diagram illustrating a voice-based intelligent conference audio processing system applied to a receiving end according to an embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application provides a conference audio processing method based on voice intelligence, which can be executed by electronic equipment, wherein the electronic equipment can be mobile terminal equipment, and the mobile terminal equipment can be a notebook computer, a desktop computer, a mobile phone and the like, but is not limited to the method.

Fig. 1 is a flow chart of a conference audio processing method based on voice intelligence applied to a source end according to an embodiment of the present invention.

The source end characterizes the terminal equipment where the speaker is in the audio-video conference, as shown in fig. 1, the main flow of the conference audio processing method based on voice intelligence applied to the source end comprises the following steps:

Step S11, acquiring a first network transmission quality of a source terminal and a second network transmission quality of each receiving terminal in an audio-video conference, wherein the receiving terminals represent terminal equipment for receiving audio of the source terminal in the audio-video conference;

In the embodiment of the application, the network transmission quality can comprise key indexes such as network delay, bandwidth utilization rate, packet loss rate and the like, and the network bottleneck or problem possibly existing in the terminal equipment can be identified.

The source device (i.e. the terminal device used by the speaker of the audio/video conference) performs self-diagnosis first, and detects key indexes such as the speed, stability, delay, packet loss rate and the like of the network connection of the source device through a built-in network testing tool, so as to obtain the first network transmission quality of the source device. These tests may include Pi ng tests (for measuring network delays), traceroute or Tracert tests (for tracking the transmission path of data packets in the network), and network bandwidth tests, among others.

For each receiving end device (namely, a terminal device for receiving the audio of the source end in the audio-video conference), the source end device sends a network diagnosis request to acquire the second network transmission quality of each receiving end.

Step S12, determining a transmission mode of each receiving end based on the first network transmission quality, each second network transmission quality and the conference type of the audio-video conference, wherein for each receiving end, the transmission mode represents a transmission mode of data between the source end and the receiving end;

In the embodiment of the application, the conference type of the audio-video conference affects the setting of the network quality requirement, so that the network quality requirement of different types of conferences on audio is different, and the conference type can comprise a formal conference, a common conference, a discussion conference and an emergency conference, but is not limited to the formal conference, the common conference, the discussion conference and the emergency conference.

The transmission mode is any one of an audio transmission mode and a text transmission mode, the audio transmission mode is a process of transmitting audio data of a speaker to each receiving end through a network in an audio-video conference, and the text transmission mode is a process of converting the audio data of the speaker into text data and transmitting the text data to each receiving end through the network in the audio-video conference.

Specifically, step S12 includes the following sub-steps:

step S121, based on the conference type of the audio-video conference, determining the network quality requirement corresponding to the conference type, and presetting different network quality requirements for different conference types;

Step S122, when the first network transmission quality does not meet the network quality requirement, determining that the transmission mode of the source end to each receiving end is a text transmission mode;

Step S123, when the first network transmission quality meets the network quality requirement and the second network transmission quality of the receiving end does not meet the network quality requirement, determining that the transmission mode between the source end and the receiving end which does not meet the network quality requirement is a text transmission mode;

Step S124, when the first network transmission quality meets the network quality requirement and the second network transmission quality of the receiving end meets the preset network quality requirement, determining that the transmission mode of the source end to the receiving end meeting the preset network quality requirement is an audio transmission mode.

The transmission mode is flexibly adjusted according to the network conditions of the source end and the receiving end and the conference type, and whether the transmission mode is an audio transmission mode or a text transmission mode is controlled, so that the adaptability and the flexibility of the conference are enhanced. Different types of conferences may have different requirements for audio quality, and such dynamic adjustment may meet the requirements in different conference scenarios. Adjusting the transmission mode according to the network quality may optimize the utilization of network resources. When the network quality is good, the audio transmission mode is used, better conference experience can be provided, and when the network quality is poor, the information can be accurately transmitted by switching to the text transmission mode, so that the audio quality is prevented from being reduced or interrupted due to the problems of network delay, packet loss and the like, the occupation of network bandwidth is reduced, and network resources are released for other applications or services.

Further, when the first network transmission quality does not meet the network quality requirement, after determining that the transmission mode of the source end to each receiving end is a text transmission mode, the method further comprises closing an audio transmission channel of the source end;

and marking the audio transmission path of the receiving end which does not meet the network quality requirement as an unavailable state.

In the embodiment of the application, an audio transmission channel at one end of which the network transmission quality does not meet the network quality requirement is blocked, when the network transmission quality of the terminal equipment at the speaking party does not meet the network quality requirement, voice data at one side of the terminal equipment at the speaking party is blocked, and when the network transmission quality at the receiving end does not meet the network quality requirement, voice data at one side of the receiving end of which the network transmission quality does not meet the network quality requirement is blocked.

By closing the audio transmission path of the source terminal or marking the audio transmission path of the receiving terminal which does not meet the condition as unavailable, the transmission of the audio data can be immediately stopped, thereby remarkably reducing the occupation of network bandwidth. If the network conditions are not sufficient to support audio transmission, continuing to attempt audio transmission can result in wasted resources such as excessive use of CPU, memory and network bandwidth. By closing or marking the unavailable state, these unnecessary waste of resources can be avoided. Under poor network conditions, attempts to transmit audio may result in degraded sound quality, stuck and heavy and confusing sounds, etc., affecting the user experience.

Step S13, obtaining audio data of a speaker, and determining target data corresponding to each receiving end based on the audio data and the transmission modes of the receiving ends, wherein the target data represents data obtained by processing the audio data through the transmission mode of the receiving end;

in an embodiment of the present application, the determining, based on the audio data and the transmission modes of the respective receiving ends, the target data corresponding to each receiving end includes:

When the transmission mode is a text transmission mode, converting the audio data into text data based on a preset voice intelligent recognition technology, wherein the preset voice intelligent recognition technology can be an ASR technology (Automat i c Speech Recogn it ion);

Checking the text data, generating a check code corresponding to the text data, packaging the check code and the text data, and taking the packaged check code and text data as target data corresponding to the receiving end, wherein a check algorithm can be a CRC algorithm (Cyc l i c Redundancy Check), and the check code is used for carrying out integrity verification on the text data at the receiving end;

When the transmission mode is an audio transmission mode, the audio data is used as target data corresponding to the receiving end, and in the audio transmission mode, the audio data does not need to be subjected to additional conversion. But the audio data may be encoded and compressed appropriately to accommodate the bandwidth and delay requirements of the network transmission.

The voice data is converted into text data through the voice intelligent recognition technology, so that the problem of audio distortion or interruption possibly occurring in network transmission can be reduced, and the accuracy and reliability of information transmission are improved. And the text data is checked and a check code is generated, so that the integrity and the accuracy of the data are further ensured, and the possible errors or losses of the data in the transmission process are reduced. Because the text data and the check code are transmitted, compared with the audio data, the data size is smaller, the occupation of bandwidth can be obviously reduced, and the transmission quality of the target data is improved.

And step S14, the target data corresponding to each receiving terminal is sent to the corresponding receiving terminal.

In the embodiment of the application, the target data corresponding to each receiving terminal is converted into the target data of the low-bandwidth character and is sent to the corresponding receiving terminal.

Specifically, step S14 includes compressing, based on a preset compression algorithm, target data corresponding to each receiving terminal, and sending each compressed target data to a corresponding receiving terminal.

The volume of the target data is obviously reduced by compressing the target data, and the network bandwidth required in the transmission process is reduced, so that the data transmission speed is increased.

As another optional implementation manner of the embodiment of the present application, after sending the target data corresponding to each receiving terminal to a corresponding receiving terminal, the method further includes:

The source end receives the playing completion state identifiers sent by the receiving ends and displays the playing completion state identifiers.

By displaying the play completion status, a speaker of the conference can instantly know which receiving ends have successfully received the data, and which receiving ends may have problems. The method is convenient to quickly identify and solve the potential transmission problem, and ensures that conference information can be accurately and timely transmitted to all participants, thereby improving the communication efficiency of the conference.

The conference audio processing method based on voice intelligence applied to the source terminal determines a transmission mode by acquiring network transmission quality of the source terminal and each receiving terminal in real time and based on information such as first network transmission quality of the source terminal, second network transmission quality of each receiving terminal and conference type of an audio-video conference. Thus, the data transmission mode can be adjusted according to the network condition, and different meeting types can be adapted. The audio data of the speaker is processed, the target data is determined based on the transmission mode of the receiving end, intelligent processing of the audio data is achieved, high quality of the audio data can be guaranteed in the transmission process, clear and smooth audio data can be obtained by each receiving end in an audio-video conference, accuracy of audio data transmission in the audio-video conference under the condition of network disturbance is improved, and communication barriers caused by the condition of network disturbance are reduced.

Fig. 2 is a flow chart of a conference audio processing method based on voice intelligence applied to a receiving end according to an embodiment of the present invention.

As shown in fig. 2, a main flow of a conference audio processing method based on voice intelligence applied to a receiving end includes:

Step S21, in an audio-video conference, acquiring target data sent by a source terminal based on audio data and a transmission mode, wherein the source terminal represents terminal equipment where a speaker is located in the audio-video conference, the transmission mode is a transmission mode of data between the source terminal and a receiving terminal, and the target data represents data obtained by processing the audio data through the transmission mode of the receiving terminal;

In an embodiment of the present application, the receiving end decompresses the target data to obtain decompressed target data, when the transmission mode is an audio transmission mode, the target data is audio data, and when the transmission mode is a text transmission mode, the target data is packaged text data and a check code corresponding to the text data.

Step S22, preprocessing the target data to obtain processed voice data;

In the embodiment of the present application, step S22 specifically includes, when the target data is a packaged check code and text data, checking the text data based on a preset check rule and the check code;

when the target data is audio data, the audio data is directly used as voice data.

In the embodiment of the application, the receiving end applies a preset check rule (CRC check) to check the text data.

Further, based on a preset conversion rule, converting the text data into voice data for playing comprises the following sub-steps:

step S221, obtaining the identification information and the audio feature information of the source terminal, wherein the audio feature information characterizes the voice characteristics of a speaker of the source terminal in an audio-video conference;

Step S222, determining a voice characteristic value corresponding to the source end in the characteristic value library based on the identification information;

Step S223, comparing the audio characteristic information with the voice characteristic value;

Step S224, if the audio feature information and the speech feature value are compared, the speech feature value is used as an input parameter of a preset text conversion model, and text data is converted into speech data by a speech synthesis technology and played.

In the above embodiments, the identification information may be a unique ID of the source, a user name, or other identifier for identifying a particular speaker. The audio characteristic information is extracted from the audio information of the last speech of the speaker extracted by the source terminal in the audio-video conference, and comprises tone, tone color, speech speed and the like.

The receiving end accesses a pre-stored characteristic value library based on the identification information of the source end, wherein the characteristic value library comprises mapping relations between a plurality of source ends (namely terminal equipment where different speakers are located) and respective voice characteristic values.

In the characteristic value library, the receiving end retrieves the corresponding voice characteristic value according to the identification information of the source end. These speech feature values are pre-stored and can be obtained by recording and analyzing the speaker at a certain stage before the audio-video conference starts.

And then, the receiving end compares the acquired audio characteristic information with the voice characteristic values retrieved from the characteristic value library, if the similarity between the audio characteristic information and the voice characteristic values reaches a preset threshold value, the comparison is passed, the receiving end uses the voice characteristic values as one of input parameters of a text conversion model, converts text data into voice data, and ensures that the converted voice data is closer to a speaker of the source end in voice characteristics.

Step S23, playing the voice data, and after the voice data is played, sending a playing completion status identifier to a source end.

In the embodiment of the application, after the voice data is played, the receiving end generates a message of the playing completion status identifier and sends the message back to the source end through the network, and the message can contain the identifier of the receiving end, the time stamp of the playing completion, the verification completion information and other information, so that the source end confirms that the semantic transmission is successful. In case of poor network conditions, the dependency on the network bandwidth can be reduced by receiving and playing the converted speech data instead of the original audio data.

The conference audio processing method based on voice intelligence applied to the receiving end ensures the accuracy and the integrity of the received text data by receiving the packaged check code and the text data and checking at the receiving end. The checking mechanism is helpful for reducing errors and packet loss phenomena in the data transmission process and improving the communication quality of the whole conference. After receiving the correct text data, the text data is converted into voice data through a preset conversion rule for playing, so that the receiving end user can still clearly hear the content of the speaker even if the audio transmission quality is reduced due to poor network conditions. The processing mode avoids the degradation or interruption of sound quality caused by network problems, thereby optimizing the conference experience of users. After the voice data is played, the receiving end sends the playing completion state to the source end. The source end can conveniently know the data check and play condition of the receiving end through a timely feedback mechanism, so that corresponding adjustment is carried out or other participants are notified according to the needs. This interactivity enhances the real-time and interactivity of the conference.

Fig. 3 is a block diagram of a conference audio processing system based on voice intelligence applied to a source according to an embodiment of the present invention.

As shown in fig. 3, a conference audio processing system 300 based on voice intelligence is applied to a source end, and mainly includes:

The network quality identification module 301 is configured to obtain, in an audio/video conference, a first network transmission quality of a source end and a second network transmission quality of each receiving end;

a mode management module 302, configured to determine, based on the first network transmission quality, each second network transmission quality, and a conference type of the audio-video conference, a transmission mode for each receiving end, where, for each receiving end, the transmission mode characterizes a transmission mode of data between the source end and the receiving end;

The first data receiving and processing module 303 is configured to obtain audio data of a speaker, determine, based on the audio data and a transmission mode of each receiving end, target data corresponding to each receiving end, where for each receiving end, the target data represents data obtained by processing the audio data in a transmission mode of the receiving end;

and the transmission module 304 is configured to send the target data corresponding to each receiving terminal to a corresponding receiving terminal.

Optionally, the transmission mode is any one of an audio transmission mode and a text transmission mode, and the mode management module 302 is specifically configured to:

Optionally, the first data receiving and processing module 303 includes:

The first voice conversion sub-module is used for converting the audio data into text data based on a preset voice intelligent recognition technology when the transmission mode is a text transmission mode;

The data verification sub-module is used for verifying the text data, generating a verification code corresponding to the text data, packaging the verification code and the text data, and taking the packaged verification code and the packaged text data as target data corresponding to the receiving end;

And the second voice conversion sub-module is used for taking the audio data as target data corresponding to the receiving end when the transmission mode is an audio transmission mode.

Optionally, the conference audio processing device 200 based on voice intelligence applied to a receiving end further comprises an audio blocking module, wherein the audio blocking module is specifically configured to close an audio transmission channel of the source end after determining that a transmission mode of the source end to each receiving end is a text transmission mode when the first network transmission quality does not meet the network quality requirement;

after determining that the transmission mode between the source end and the receiving end which does not meet the network quality requirement is a text transmission mode when the first network transmission quality meets the network quality requirement and the second network transmission quality of the receiving end does not meet the network quality requirement, marking an audio transmission path of the receiving end which does not meet the network quality requirement as an unavailable state.

Optionally, the transmission module 304 is specifically configured to:

Optionally, the conference audio processing device 200 based on voice intelligence applied to the receiving end further includes a display module, where the display module is specifically configured to:

Fig. 4 is a block diagram of a conference audio processing system based on voice intelligence applied to a receiving end according to an embodiment of the present invention.

As shown in fig. 4, a conference audio processing system 400 based on voice intelligence applied to a receiving end mainly includes:

The second data receiving and processing module 401 is configured to obtain, in an audio-video conference, target data sent by a source end based on audio data and a transmission mode, where the source end represents a terminal device where a speaker is located in the audio-video conference, the transmission mode is a transmission mode of data between the source end and a receiving end, and the target data represents data obtained by processing the audio data in the transmission mode of the receiving end;

The data verification module 402 is configured to pre-process the target data to obtain processed voice data;

and the feedback module 403 is configured to play the voice data, and after the playing of the voice data is completed, send a playing completion status identifier to the source end.

Optionally, the speech synthesis module 403 includes:

The data verification sub-module is used for verifying the text data based on a preset verification rule and the verification code when the target data are the packaged verification code and the text data;

The data synthesis sub-module is used for converting the text data into voice data based on a preset conversion rule if the text data passes the verification;

The data synthesis submodule is specifically used for:

If the audio characteristic information is compared with the voice characteristic value, the voice characteristic value is used as an input parameter of a preset text conversion model, and text data is converted into voice data through a voice synthesis technology and played.

In one example, a module in any of the above systems may be one or more integrated circuits configured to implement the above methods, such as one or more Application Specific Integrated Circuits (ASICs), or one or more digital signal processors (D IGITA L S I GNA L processors, DSPs), or one or more field programmable gate arrays (fie l d programmab L E GATE ARRAY, FPGAs), or a combination of at least two of these integrated circuit forms.

For another example, when a module in the system may be implemented in the form of a scheduler of processing elements, the processing elements may be general-purpose processors, such as a central processing unit (centra l process i ng un it, CPU) or other processor that may invoke a program. For another example, the modules may be integrated together and implemented in the form of a system-on-a-ch i p (SOC).

Various objects such as various messages/information/devices/network elements/systems/devices/actions/operations/processes/concepts may be named in the present application, and it should be understood that these specific names do not constitute limitations on related objects, and that the named names may be changed according to the scenario, context, or usage habit, etc., and understanding of technical meaning of technical terms in the present application should be mainly determined from functions and technical effects that are embodied/performed in the technical solution.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and module may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Claims

1. The conference audio processing method based on voice intelligence is characterized by being applied to a source terminal, wherein the source terminal characterizes terminal equipment where a speaker is in an audio-video conference, and the method comprises the following steps:

Transmitting the target data corresponding to each receiving terminal to a corresponding receiving terminal;

The transmission mode is any one of an audio transmission mode and a text transmission mode, and the determining the transmission mode for each receiving end based on the first network transmission quality, each second network transmission quality and the conference type of the audio-video conference includes:

2. The conference audio processing method based on voice intelligence according to claim 1, wherein the determining the target data corresponding to each receiving terminal based on the audio data and the transmission mode of each receiving terminal comprises:

3. The conference audio processing method based on voice intelligence according to claim 1, wherein after determining that the transmission mode of the source terminal to each of the receiving terminals is a text transmission mode when the first network transmission quality does not meet the network quality requirement, further comprising:

closing an audio transmission channel of the source end;

4. The conference audio processing method based on voice intelligence according to claim 2, wherein the sending the target data corresponding to each receiving terminal to the corresponding receiving terminal includes:

5. The conference audio processing method based on voice intelligence according to claim 1, further comprising, after sending the target data corresponding to each of the receiving terminals to the corresponding receiving terminal:

6. A conference audio processing method based on voice intelligence, which is applied to a receiving end, wherein the receiving end characterizes a terminal device for receiving audio data of a terminal device where a speaker is located in an audio-video conference, and the method comprises:

preprocessing the target data to obtain processed voice data;

playing the voice data, and after the voice data is played, sending a playing completion state identifier to a source end;

The preprocessing the target data to obtain processed voice data comprises the following steps:

7. The conference audio processing system based on voice intelligence is characterized by being applied to a source end, wherein the source end characterizes terminal equipment where a speaker is in an audio-video conference, and the system comprises:

The mode management module is used for determining a transmission mode of each receiving end based on the first network transmission quality, each second network transmission quality and the conference type of the audio-video conference, wherein for each receiving end, the transmission mode represents a transmission mode of data between the source end and the receiving end, and the transmission mode is any one of an audio transmission mode and a text transmission mode;

The transmission module is used for transmitting the target data corresponding to each receiving terminal to the corresponding receiving terminal;

the mode management module is specifically configured to:

8. A conference audio processing system based on speech intelligence, applied to a receiving end, the receiving end characterizing a terminal device that receives audio of a terminal device where a speaker is located in an audio-video conference, the system comprising:

The voice data playing module is used for playing the voice data and sending a playing completion state identifier to a source end after the voice data playing is completed, and the preprocessing module is specifically used for: