CN110288077B

CN110288077B - Method and related device for synthesizing speaking expression based on artificial intelligence

Info

Publication number: CN110288077B
Application number: CN201910745062.8A
Authority: CN
Inventors: 李广之; 陀得意; 康世胤
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2022-12-16
Anticipated expiration: 2038-11-14
Also published as: CN109447234A; CN109447234B; CN110288077A

Abstract

The embodiment of the application discloses a method and a related device for synthesizing speaking expressions based on artificial intelligence, which at least relate to a plurality of technologies in artificial intelligence, determine text characteristics corresponding to text contents and duration of pronunciation elements identified by the text characteristics aiming at the text contents sent by a terminal, and obtain target expression characteristics corresponding to the text characteristics and the duration of the identified pronunciation elements through an expression model; and returning the target expression characteristics to the terminal. The expression model can determine different sub-expression characteristics for the same pronunciation element with different durations in the text characteristics, the change patterns of the speaking expression are increased, the speaking expression generated according to the target expression characteristics determined by the expression model is matched with the expression of a speaker, and the speaking expression has different change patterns for the same pronunciation element, so that the excessive unnatural situation of the change of the speaking expression is improved, and the immersion of a user is improved.

Description

Method and related device for synthesizing speaking expression based on artificial intelligence

The application provides divisional application for Chinese patent application with application number of 201811354206.9, application date of 2018, 11 and 14 months, and invention name of 'a model training method, a method for synthesizing speaking expression and a related device'.

Technical Field

The present application relates to the field of data processing, and in particular, to a method and related apparatus for synthesizing speech expressions based on artificial intelligence.

Background

With the development of computer technology, human-computer interaction is common, but is mostly simple voice interaction, for example, the interactive device may determine reply content according to characters or voice input by a user, and play virtual sound synthesized according to the reply content.

The user immersion caused by the man-machine interaction of the type is difficult to meet the interaction requirements of the current user, and in order to improve the user immersion, a virtual object with expression change capability, such as a mouth shape, can be changed and is produced as an interaction object with the user. The virtual object can be a cartoon, a virtual human and other virtual images, when the human-computer interaction is carried out with a user, not only can the virtual sound for the interaction be played, but also the corresponding expression can be displayed according to the virtual sound, and the feeling that the virtual object sends the virtual voice is provided for the user.

At present, the expression of the virtual object is mainly determined according to the currently played pronunciation element, so that the expression change style of the virtual object is limited when the virtual voice is played, the expression change is excessive and unnatural, the feeling provided for a user is not good, and the effect of improving the immersion feeling of the user is difficult to play.

Disclosure of Invention

In order to solve the technical problems, the application provides a model training method for synthesizing speaking expression, a method for synthesizing speaking expression and a related device, the change patterns of the speaking expression are increased, the speaking expression generated according to the target expression characteristics determined by the expression model has different change patterns for the speaking expression of the same pronunciation element, and thus the excessive unnatural situation of the change of the speaking expression is improved to a certain extent

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides a model training method for synthesizing spoken expressions, including:

acquiring a video containing facial action expressions and corresponding voices of a speaker;

obtaining the expression characteristics of the speaker, the acoustic characteristics of the voice and the text characteristics of the voice according to the video; the acoustic feature comprises a plurality of sub-acoustic features;

determining the time interval and the duration of the pronunciation element identified by the text feature according to the text feature and the acoustic feature; any pronunciation element identified by the text feature is a target pronunciation element, the time interval of the target pronunciation element is the time interval of a sub-acoustic feature corresponding to the target pronunciation element in the acoustic feature in the video, and the duration of the target pronunciation element is the duration of the sub-acoustic feature corresponding to the target pronunciation element;

determining a first corresponding relation according to the time interval and the time length of the pronunciation element identified by the text feature and the expression feature, wherein the first corresponding relation is used for showing the corresponding relation between the time length of the pronunciation element and the corresponding sub-expression feature of the time interval of the pronunciation element in the expression feature;

training an expression model according to the first corresponding relation; the expression model is used for determining corresponding target expression characteristics according to the undetermined text characteristics and the duration of the pronunciation elements identified by the undetermined text characteristics.

In a second aspect, an embodiment of the present application provides a model training apparatus for synthesizing spoken expressions, where the apparatus includes an obtaining unit, a first determining unit, a second determining unit, and a first training unit:

the acquisition unit is used for acquiring a video containing the facial action expression and the corresponding voice of the speaker;

the acquisition unit is further used for acquiring the expression characteristics of the speaker, the acoustic characteristics of the voice and the text characteristics of the voice according to the video; the acoustic feature comprises a plurality of sub-acoustic features;

the first determining unit is used for determining the time interval and the duration of the pronunciation element identified by the text feature according to the text feature and the acoustic feature; any pronunciation element identified by the text feature is a target pronunciation element, the time interval of the target pronunciation element is the time interval of a sub-acoustic feature corresponding to the target pronunciation element in the acoustic feature in the video, and the duration of the target pronunciation element is the duration of the sub-acoustic feature corresponding to the target pronunciation element;

the second determining unit is configured to determine a first corresponding relationship according to the time interval and the duration of the pronunciation element identified by the text feature and the expression feature, where the first corresponding relationship is used to represent a corresponding relationship between the duration of the pronunciation element and a sub-expression feature corresponding to the time interval of the pronunciation element in the expression feature;

the first training unit is used for training an expression model according to the first corresponding relation; the expression model is used for determining corresponding target expression characteristics according to the undetermined text characteristics and the duration of the pronunciation elements identified by the undetermined text characteristics.

In a third aspect, an embodiment of the present application provides a model training device for synthesizing spoken expressions, where the device includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the model training method for synthesizing spoken expressions according to any one of the first aspect according to instructions in the program code.

In a fourth aspect, an embodiment of the present application provides a method for synthesizing a speaking expression, where the method includes:

determining text features corresponding to text content and duration of pronunciation elements identified by the text features; the text features comprise a plurality of sub-text features;

obtaining target expression characteristics corresponding to the text content through the text characteristics, the duration of the identified pronunciation element and the expression model; the target expression features comprise a plurality of sub-expression features, any pronunciation element identified by the text features is a target pronunciation element, and in the target expression features, the sub-expression features corresponding to the target pronunciation element are determined according to the sub-expression features corresponding to the target pronunciation element in the text features and the duration of the target pronunciation element.

In a fifth aspect, an embodiment of the present application provides an apparatus for synthesizing a speaking expression, where the apparatus includes a determining unit and a first obtaining unit:

the determining unit is used for determining text features corresponding to the text content and the duration of the pronunciation elements identified by the text features; the text features comprise a plurality of sub-text features;

the first obtaining unit is configured to obtain a target expression feature corresponding to the text content through the text feature, the duration of the identified pronunciation element, and an expression model; the target expression features comprise a plurality of sub-expression features, any pronunciation element identified by the text features is a target pronunciation element, and in the target expression features, the sub-expression features corresponding to the target pronunciation element are determined according to the sub-expression features corresponding to the target pronunciation element in the text features and the duration of the target pronunciation element.

In a sixth aspect, an embodiment of the present application provides an apparatus for synthesizing speaking expressions, where the apparatus includes a processor and a memory:

the processor is configured to execute the method for synthesizing a spoken expression according to any one of the fourth aspects according to instructions in the program code.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium for storing a program code for executing the model training method for synthesizing a spoken expression according to any one of the first aspect or the method for synthesizing a spoken expression according to any one of the fourth aspect.

According to the technical scheme, in order to determine the speaking expressions with various changes and excessive and natural expressions for the virtual object, the embodiment of the application provides a brand-new expression model training mode, and the expression characteristics of the speaker, the acoustic characteristics of the voice and the text characteristics of the voice are obtained according to the video containing the facial action expressions and the corresponding voice of the speaker. Because the acoustic features and the text features are obtained according to the same video, the time interval and the duration of the pronunciation elements identified by the text features can be determined according to the acoustic features. And determining a first corresponding relation according to the time interval and the time length of the pronunciation element identified by the text feature and the expression feature, wherein the first corresponding relation is used for showing the corresponding relation between the time length of the pronunciation element and the time interval of the pronunciation element in the sub-expression features corresponding to the expression feature.

For a target pronunciation element in the identified pronunciation elements, the sub-expression characteristics in the time interval can be determined from the expression characteristics through the time interval of the target pronunciation element, and the duration of the target pronunciation element can embody different durations of the target pronunciation element in various expression sentences of the video voice, so the determined sub-expression characteristics can embody the possible expression of the target pronunciation element spoken by a speaker in different expression sentences. Therefore, according to the expression model obtained by training the first corresponding relation, aiming at the text feature of the expression feature to be determined, the expression model can determine different sub-expression features for the same pronunciation element with different durations in the text feature, so that the change pattern of the speaking expression is increased, and the speaking expression generated according to the target expression feature determined by the expression model has different change patterns for the speaking expression of the same pronunciation element, so that the excessive unnatural condition of the speaking expression change is improved to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of an expression model training method provided in an embodiment of the present application;

FIG. 2 is a flowchart of a model training method for synthesizing spoken expressions according to an embodiment of the present disclosure;

fig. 3 is a flowchart of an acoustic model training method provided in an embodiment of the present application;

fig. 4 is a schematic view of an application scenario of a method for synthesizing speaking expressions according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for synthesizing spoken expressions according to an embodiment of the present disclosure;

fig. 6 is a schematic architecture diagram of a method for generating a visual speech synthesis in human-computer interaction according to an embodiment of the present application;

FIG. 7a is a block diagram of a device for training a synthesized speech expression model according to an embodiment of the present disclosure;

FIG. 7b is a block diagram of a device for training a synthesized speech expression model according to an embodiment of the present disclosure;

FIG. 7c is a block diagram of a device for training a synthesized speech expression model according to an embodiment of the present disclosure;

FIG. 8a is a block diagram of an apparatus for synthesizing spoken expressions according to an embodiment of the present application;

FIG. 8b is a block diagram of an apparatus for synthesizing spoken expressions according to an embodiment of the present disclosure;

fig. 9 is a block diagram of a server according to an embodiment of the present application;

fig. 10 is a structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, when a virtual object and a user perform human-computer interaction, which speaking expression the virtual object makes is mainly determined according to a currently played pronunciation element, for example, a corresponding relation between the pronunciation element and the expression is established, generally, one pronunciation element corresponds to one speaking expression, and when a certain pronunciation element is played, the virtual object is enabled to make the speaking expression corresponding to the pronunciation element.

In order to solve the above technical problem, an embodiment of the present application provides a brand-new expression model training mode, and when performing expression model training, a text feature, a duration of a pronunciation element identified by the text feature, and a sub-expression feature corresponding to a time interval of the pronunciation element in an expression feature are used as training samples, so that an expression model is obtained by training according to a correspondence between the duration of the pronunciation element and the time interval of the pronunciation element in the sub-expression feature corresponding to the expression feature.

The method for synthesizing the speaking expression and the corresponding model training method for synthesizing the speaking expression provided by the embodiment of the application can be realized based on Artificial Intelligence (AI), which is a theory, a method, a technology and an application system for simulating, extending and expanding the Intelligence of a human, sensing the environment, acquiring knowledge and using the knowledge to acquire the best result by using a digital computer or a machine controlled by the digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the computer vision technology, the speech processing technology, the natural language processing technology, the deep learning and other directions.

For example, the present invention may relate to Image Processing (Image Processing), image Semantic Understanding (ISU), video Processing (video Processing), video Semantic Understanding (VSU), three-dimensional object reconstruction (3D object reconstruction), face recognition (face recognition), and the like in Computer Vision (Computer Vision).

For example, the present invention may relate to a Speech recognition Technology in Speech Technology (Speech Technology), which includes Speech signal preprocessing (Speech signal preprocessing), speech signal frequency domain analysis (Speech signal analysis), speech signal feature extraction (Speech signal feature extraction), speech signal feature matching/recognition (Speech signal feature matching/recognition), training of Speech (Speech training), and the like.

For example, text preprocessing (Text preprocessing) and Semantic understanding (Semantic understanding) in Natural Language Processing (NLP) may be involved, which includes words, word/sentence segmentation (word segmentation), word tagging (word tagging), sentence classification (word/sense classification), and the like.

For example, deep Learning (Deep Learning) in Machine Learning (ML) may be involved, including various types of artificial neural networks (artificial neural networks).

In order to facilitate understanding of the technical solution of the present application, the expression model training method provided in the embodiment of the present application is introduced below with reference to an actual application scenario.

The model training method provided by the application can be applied to data processing equipment, such as terminal equipment and a server, which has the capability of processing videos containing speeches spoken by speakers. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like; the server may be specifically an independent server or a cluster server.

The data processing equipment can have the capability of implementing a computer vision technology, wherein the computer vision is a science for researching how to enable a machine to see, and in particular, the computer vision is used for replacing human eyes to identify, track and measure a target and the like, and further performing graphic processing, so that the computer processing becomes an image which is more suitable for the human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction and the like, and also includes common biometric technologies such as face recognition, fingerprint recognition and the like.

In the embodiment of the application, the data processing device can acquire various information such as the expression characteristics and the corresponding duration of the speaker from the video through a computer vision technology.

The data processing device may have the capability to implement automatic speech recognition techniques (ASR) in speech technology, voiceprint recognition techniques, and the like. The voice technology enables the data processing equipment to listen, see and feel, and is the development direction of future human-computer interaction, wherein voice becomes one of the best viewed human-computer interaction modes in the future.

In the embodiment of the present application, the data processing device may perform voice recognition on the obtained video by implementing the voice technology, so as to obtain various information such as acoustic features, corresponding pronunciation elements, corresponding duration time, and the like of a speaker in the video.

The data processing device may also have the capability to perform natural language processing, which is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, and the like.

In the embodiment of the present application, the data processing device can determine the text feature of the voice from the video by implementing the NLP technology.

The data processing apparatus may be provided with Machine Learning (ML) capability. ML is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks.

In the embodiment of the application, the model training method for synthesizing the speaking expression mainly relates to the application of various artificial neural networks, for example, the expression model is trained through the first corresponding relation.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of an expression model training method provided in an embodiment of the present application, where the application scenario includes a server 101, and the server 101 may acquire a video including facial motion expressions and corresponding voices of a speaker, where the video may be one or multiple videos. The language corresponding to the characters included in the voice in the video can be various languages such as Chinese, english, korean and the like.

The server 101 can obtain the expressive features of the speaker, the acoustic features of the voice and the text features of the voice according to the obtained video. The expression feature of the speaker may represent a facial motion expression of the speaker when speaking voice in the video, for example, the expression feature may include a mouth shape feature, an eye motion, and the like, and a video viewer may feel that the voice in the video is what the speaker utters through the expression feature of the speaker. The acoustic features of the speech may include sound waves of the speech. The text feature of the speech is used to identify a pronunciation element corresponding to the text content, and it should be noted that the pronunciation element in the embodiment of the present application may be a pronunciation corresponding to a character included in the speech for a speaker to speak.

It should be noted that, in this embodiment, the expressive feature, the acoustic feature and the text feature can all be represented in the form of a feature vector.

Since both the acoustic feature and the text feature are obtained from the same video, the server 101 may determine the time interval and duration of the pronunciation element identified by the text feature according to the text feature and the acoustic feature. The time interval is an interval between a start time and an end time corresponding to the sub-acoustic feature corresponding to the pronunciation element in the video, and the duration is a duration of the sub-acoustic feature corresponding to the pronunciation element, and may be, for example, a difference between the end time and the start time. One sub-acoustic feature is a part of the acoustic feature corresponding to one pronunciation element, and the acoustic feature may include a plurality of sub-acoustic features.

Then, the server 101 determines a first corresponding relationship according to the time interval and the duration of the pronunciation element identified by the text feature and the expression feature, where the first corresponding relationship is used to represent a corresponding relationship between the duration of the pronunciation element and the sub-expression feature corresponding to the time interval of the pronunciation element in the expression feature. One sub-expression feature is a part of expression features corresponding to one pronunciation element, and the expression features can comprise a plurality of sub-expression features.

For any pronunciation element in the identified pronunciation elements, such as the target pronunciation element, the time interval of the target pronunciation element is the time interval of the sub-acoustic feature corresponding to the target pronunciation element in the acoustic feature in the video, the acoustic feature, the text feature and the expression feature are all obtained according to the same video and correspond to the same time axis, so that the sub-expression feature in the time interval can be determined from the expression feature through the time interval of the target pronunciation element. The duration of the target pronunciation element is the duration of the sub-acoustic features corresponding to the target pronunciation element, and different durations of the target pronunciation element in various expression sentences of the video voice can be embodied, so that the determined sub-expression features can embody possible speaking expressions of the target pronunciation element spoken by a speaker in different expression sentences.

Taking the example that the voice spoken by the speaker is 'do you have a meal', and the duration of the video containing the voice is 2s, wherein the text feature is used for identifying the pronunciation elements of the characters 'do you have a meal', the pronunciation elements identified by the text feature comprise 'ni' chi 'fan' le 'ma', the expression feature represents the speaking expression of the speaker when the speaker speaks the voice 'do you have a meal', and the acoustic feature is the sound wave sent when the speaker speaks the voice 'do you have a meal'. The target pronunciation element is any pronunciation element in 'ni' chi 'fan' le 'ma', if the target pronunciation element is 'ni', the time interval of 'ni' is the interval between 0s and 0.1s, the duration of 'ni' is 0.1s, and the sub-expression feature corresponding to 'ni' is a part of expression feature corresponding to the speaker speaking the voice in the interval between 0s and 0.1s in the video, for example, the sub-expression feature A can be used. When determining the first corresponding relationship, the server 101 may determine the sub-expression feature a corresponding to the time interval 0s and 0.1s of the pronunciation element "ni" identified by the text feature, so as to determine the corresponding relationship between the time duration 0.1s of the pronunciation element "ni" and the sub-expression feature a corresponding to the pronunciation element "ni" in the time interval 0s and 0.1s, where the first corresponding relationship includes the corresponding relationship between the time duration 0.1s of the pronunciation element "ni" and the sub-expression feature a corresponding to the pronunciation element "ni" in the time interval 0s and 0.1s.

The server 101 trains an expression model according to the first corresponding relation, wherein the expression model is used for determining a corresponding target expression characteristic according to the undetermined text characteristic and the duration of the pronunciation element identified by the undetermined text characteristic.

It can be understood that the present implementation focuses on the pronunciation element, and does not focus on what the character corresponding to the pronunciation element is. The same sentence in the speech spoken by the speaker may include different characters, but the different characters may correspond to the same pronunciation element, so that the same pronunciation element is located in different time intervals, and may have different durations, thereby corresponding to different sub-expression characteristics.

For example, the speech spoken by the speaker includes "and" you speak secret ", the pronunciation element of the text feature identifier corresponding to both the characters" secret "and" secret "is" mi ", the time interval of the pronunciation element" mi "of the text feature identifier corresponding to the character" secret "is the interval between 0.4s and 0.6s, the time duration is 0.2s, the time interval of the pronunciation element" mi "of the text feature identifier corresponding to the character" secret "is the interval between 0.6s and 0.7s, and the time duration is 0.1s. As can be seen, the text features corresponding to different characters "secret" and "secret" identify the same pronunciation element "mi", but the corresponding durations are different, and therefore, the pronunciation element "mi" corresponds to different sub-expression features.

In addition, according to different expression modes of the speaker during speaking, different sentences in the speech spoken by the speaker may include the same characters, and pronunciation elements of the text feature identifiers corresponding to the same characters may have different durations, so that the same pronunciation element corresponds to different sub-expression features.

For example, the voice spoken by the speaker is "hello", and the duration of the pronunciation element "ni" of the text feature identifier corresponding to the character "you" is 0.1s, but in another voice "you i he" spoken by the speaker, the duration of the pronunciation element "ni" of the text feature identifier corresponding to the character "you" may be 0.3s, and at this time, pronunciation elements of the text feature identifiers corresponding to the same character have different durations, so that the same pronunciation element may correspond to different sub-expression features.

Because one pronunciation element corresponds to different durations, the sub-expression characteristics corresponding to the pronunciation elements with different durations are different, and the first corresponding relation can embody the corresponding relation between the durations of the different pronunciation elements of one pronunciation element and the sub-expression characteristics, in this way, when the sub-expression characteristics are determined by using the expression model obtained by training according to the first corresponding relation, aiming at the text characteristics of the expression characteristics to be determined, the expression model can determine different sub-expression characteristics for the same pronunciation element with different durations in the text characteristics, and the variation pattern of the speaking expression is increased. In addition, the speaking expression generated according to the target expression characteristics determined by the expression model has different change patterns for the speaking expression of the same pronunciation element, so that the excessive unnatural situation of the change of the speaking expression is improved to a certain extent.

It can be understood that, in order to solve the technical problems existing in the conventional manner, increase the change style of the speaking expression, and improve the excessively unnatural situation of the change of the speaking expression, the embodiment of the present application provides a new expression model training method, and the expression model is used to generate the speaking expression corresponding to the text content. Next, a model training method for synthesizing speaking expressions and a method for synthesizing speaking expressions provided by the embodiments of the present application will be described with reference to the drawings.

First, a model training method for synthesizing a speaking expression is described. Referring to fig. 2, fig. 2 shows a flow chart of a model training method for synthesizing talking expressions, the method comprising:

s201, obtaining a video containing the facial action expression and the corresponding voice of the speaker.

The video containing the facial action expression and the corresponding voice can be obtained by recording the voice spoken by the speaker in the recording environment with the camera and recording the facial action expression of the speaker through the camera.

S202, obtaining the expression characteristics of the speaker, the acoustic characteristics of the voice and the text characteristics of the voice according to the video.

The expression features can be obtained by performing feature extraction on facial action expressions in the video, the acoustic features can be obtained by performing feature extraction on the voice spoken by a speaker in the video, the text features can be obtained by performing feature extraction on the text corresponding to the voice spoken by the speaker in the video, and the expression features, the acoustic features and the text features are obtained according to the same video and have the same time axis.

The expression features can identify the characteristics of facial action expressions when a speaker speaks voice, and a video viewer can see which pronunciation elements are uttered by the speaker through the expression features.

S203, determining the time interval and the duration of the pronunciation element identified by the text feature according to the text feature and the acoustic feature.

In this embodiment, the pronunciation element may be, for example, a syllable corresponding to a character included in the speech spoken by the speaker, and the character may be a basic semantic unit in different languages, for example, in chinese, the character may be a chinese character, and the pronunciation element corresponding to the chinese character may be a pinyin syllable; in English, a character may be a word and the pronunciation element corresponding to the word may be a corresponding phonetic symbol or phonetic symbol combination. For example, when the character is a chinese character, the pronunciation element corresponding to the character may be a pinyin syllable, for example, the character is a chinese character "you", and the pronunciation element corresponding to the character may be a pinyin syllable "ni"; when the character is english, the pronunciation element corresponding to the character can be an english syllable, for example, the character is an english word "ball", and the pronunciation element corresponding to the character can be an english syllable

Of course, the pronunciation element can also beThe minimum pronunciation unit included in the pinyin syllable corresponding to the character, for example, the character included in the speech spoken by the speaker is "you", and the pronunciation element may include two pronunciation elements "n" and "i".

In some cases, the pronunciation elements may also be differentiated based on tone, and thus may also include tones. For example, in chinese, a speaker speaks that the speech includes the character "you are ni", wherein the pinyin syllables of the characters "you" and "ni" are both "ni", but the tone of "you" is three tones, and the tone of "ni" is one tone, so that the pronunciation elements corresponding to "you" include "ni" and three tones, the pronunciation elements corresponding to "ni" include "ni" and one tone, and the pronunciation elements corresponding to "you" and "ni" are distinguished according to the tone. When in use, the appropriate pronunciation elements can be determined according to requirements.

It should be noted that the language of the character and the language of the pronunciation element corresponding to the character may be different languages besides the above possible manners, and the language type of the character is not limited herein. For convenience of description, in the embodiments of the present application, the description will be mainly given by taking characters as chinese characters, and pronunciation elements corresponding to the characters as pinyin syllables.

S204, determining a first corresponding relation according to the time interval and the duration of the pronunciation element identified by the text feature and the expression feature.

Because the acoustic feature and the text feature are obtained according to the same video, and the acoustic feature comprises time information, the time interval and the time length of the pronunciation element identified by the text feature can be determined according to the acoustic feature. The time interval of the target pronunciation element is the time interval of the sub-acoustic feature corresponding to the target pronunciation element in the acoustic feature in the video, the duration of the target pronunciation element is the duration of the sub-acoustic feature corresponding to the target pronunciation element, and the target pronunciation element is any pronunciation element identified by the text feature.

The first corresponding relation is used for showing the corresponding relation between the duration of the pronunciation element and the corresponding sub-expression characteristics of the time interval of the pronunciation element in the expression characteristics.

When the first corresponding relation is determined, the sub-expression characteristics corresponding to the pronunciation element in the time interval can be determined from the expression characteristics through the time interval of the pronunciation element, so that the corresponding relation between the duration of the pronunciation element and the sub-expression characteristics corresponding to the time interval of the pronunciation element in the expression characteristics is determined.

It can be understood that the durations corresponding to the same pronunciation element may not only be different, but also be the same, and not only the durations corresponding to the same pronunciation element are different, but also the sub-expressive features corresponding to the same pronunciation element may be different, and even when the durations corresponding to the same pronunciation element are the same, the same pronunciation element having the same duration may have different sub-expressive features due to different moods for speaking voices, different expression habits, and the like.

For example, when the speaker utters the voice "ni" with excited tone and utters the voice "ni" with angry tone, even if the durations of the pronunciation elements identified by the text features are the same, the sub-expression features corresponding to the same pronunciation element may be different due to the speaker's tone.

S205, training an expression model according to the first corresponding relation.

The trained expression model can determine a target expression characteristic corresponding to the identified pronunciation element for the undetermined text characteristic, wherein the text content corresponding to the undetermined text characteristic is the text content needing to synthesize the speaking expression or needing to further generate the virtual sound. And taking the duration of the pronunciation element identified by the undetermined text characteristic and the undetermined text characteristic as the input of the expression model, and taking the target expression characteristic as the output of the expression model.

The training data used for training the expression model is a first corresponding relation, and in the first corresponding relation, the same pronunciation element with the same time length or different time lengths corresponds to different sub-expression characteristics, so that when the expression model obtained through subsequent training is used for determining the target expression characteristics, the situation similar to the training data can be obtained after the time lengths of the pronunciation elements marked by the undetermined text characteristics and the undetermined text characteristics are input into the expression model, namely the target expression characteristics obtained after the same pronunciation element with different time lengths is input into the expression model are possibly different, and different target expression characteristics can be obtained even after the same pronunciation element with the same time length is input into the expression model.

It should be noted that, in this embodiment, the same pronunciation element may have different durations, the same pronunciation element having different durations has different expression features, and even if the same duration also has different expression features, which duration the pronunciation element corresponds to and which expression the determined duration of the pronunciation element corresponds to can be accurately determined through the context information corresponding to the pronunciation element.

When a person pronounces normally, the characteristics of the same pronunciation element may be different in different contexts, for example, the duration of the pronunciation element is different, and then the same pronunciation element may have different sub-expression characteristics, that is, which duration a pronunciation element corresponds to, and which expression characteristic the pronunciation element with duration corresponds to, are related to the context of the pronunciation element. Therefore, in an implementation manner, the text features used for training the expression model may also be used for identifying the pronunciation element in the speech and the context information corresponding to the pronunciation element. Therefore, when the expression features are determined by using the expression model obtained through training, the duration of the pronunciation elements and the corresponding sub-expression features can be accurately determined according to the context information.

For example, the speaker utters "you are ni", and the pronunciation element identified by the corresponding text feature includes "ni 'shi' ni", where the pronunciation element "ni" appears three times, the duration of the first pronunciation element "ni" is 0.1s, the duration of the first pronunciation element "ni" corresponds to the sub-form feature a, the duration of the first pronunciation element "ni" is 0.2s, the duration of the third pronunciation element "ni" corresponds to the sub-form feature B, and the duration of the third pronunciation element "ni" is 0.1s, and the duration of the third pronunciation element "ni" corresponds to the sub-form feature C. If the text feature can also identify the pronunciation element in the voice and the context information corresponding to the pronunciation element, where the context information of the first pronunciation element "ni" is context information 1, the context information of the second pronunciation element "ni" is context information 2, and the context information of the third pronunciation element "ni" is context information 3, then when the expression feature is determined by using the expression model obtained through training, the duration of the pronunciation element "ni" can be accurately determined to be 0.2s according to the context information 1, the corresponding sub-expression feature is sub-expression feature a, and so on.

The context information can reflect the expression mode of a person in normal pronunciation, and the duration of the pronunciation element and the corresponding sub-expression characteristics are accurately determined through the context information, so that when the expression model obtained through training according to the first corresponding relation is used for determining the target expression characteristics of the pronunciation element sent by the virtual object, the expression mode of the virtual object can be more suitable for the expression of the person. In addition, the context information can inform that under the condition of sending the last pronunciation element, the speaker sends out what the sub-expression characteristics corresponding to the current pronunciation element are, so that the connection between the sub-expression characteristics corresponding to the current pronunciation element and the sub-expression characteristics corresponding to the context pronunciation element is related, and the excessive fluency degree of the later-period generated speaking expression is improved.

According to the technical scheme, in order to determine the speaking expressions with various changes and excessive and natural expressions for the virtual object, the embodiment of the application provides a brand-new expression model training mode, and the expression characteristics of the speaker, the acoustic characteristics of the voice and the text characteristics of the voice are obtained according to the video containing the facial action expressions and the corresponding voice of the speaker. Because the acoustic feature and the text feature are obtained according to the same video, the time interval and the duration of the pronunciation element identified by the text feature can be determined according to the acoustic feature.

For a target pronunciation element in the identified pronunciation elements, the sub-expression characteristics in the time interval can be determined from the expression characteristics through the time interval of the target pronunciation element, and the duration of the target pronunciation element can embody different durations of the target pronunciation element in various expression sentences of the video voice, so the determined sub-expression characteristics can embody the possible expression of the target pronunciation element spoken by a speaker in different expression sentences. Therefore, according to the expression model obtained by training the first corresponding relation, aiming at the text characteristics of the expression characteristics to be determined, the expression model can determine different sub-expression characteristics for the same pronunciation element with different durations in the text characteristics, the change pattern of the speaking expression is increased, and the speaking expression generated according to the target expression characteristics determined by the expression model has different change patterns for the speaking expression of the same pronunciation element, so that the excessive unnatural situation of the speaking expression change is improved to a certain extent.

It can be understood that, when the expression model trained by the method provided by the embodiment corresponding to fig. 2 generates the speaking expression, the change pattern of the speaking expression can be increased, and the excessive unnatural situation of the change of the speaking expression is improved. And when the human-computer interaction is carried out, the speaking expression of the virtual object is displayed for the user, and virtual sound for interaction can be played. If the virtual sound is generated in the existing manner, a situation that the virtual sound is not matched with the speaking expression generated by the scheme provided by the embodiment of the present application may occur, in this case, the embodiment of the present application provides a new acoustic model training method, and the acoustic model trained by the manner may generate the virtual sound matched with the speaking expression, as shown in fig. 3, the method includes:

s301, determining a second corresponding relation between the pronunciation element identified by the text feature and the acoustic feature.

S302, training an acoustic model according to the second corresponding relation.

And the second corresponding relation is used for reflecting the corresponding relation between the duration of the pronunciation element and the corresponding sub-acoustic features of the pronunciation element in the acoustic features.

The trained acoustic model may determine, for the pending text feature, a target acoustic feature corresponding to the identified pronunciation element. And the undetermined text feature and the duration of the pronunciation element identified by the undetermined text feature are used as the input of the acoustic model, and the target acoustic feature is used as the output of the acoustic model.

When the second corresponding relation is determined, the pronunciation element is sent out by the speaker, the acoustic feature and the pronunciation element sent out by the speaker have a corresponding relation, so that the sub-acoustic feature corresponding to the pronunciation element in the acoustic feature can be determined according to the pronunciation element identified by the text feature, and the second corresponding relation between the pronunciation element identified by the text feature and the acoustic feature can be determined according to any acoustic feature identified by the text feature.

It can be understood that the durations corresponding to the same pronunciation element may not only be different, but also be the same, and not only when the durations corresponding to the same pronunciation element are different, the sub-acoustic features corresponding to the same pronunciation phoneme may be different, even when the durations corresponding to the same pronunciation element are the same, due to different moods and different expression modes for speaking the voice, the same pronunciation element having the same duration may have different sub-acoustic features.

The pronunciation element comprises time information, and the sub-expression characteristics corresponding to the pronunciation element in the time interval can be determined from the expression characteristics through the time interval of the pronunciation element, so that the corresponding relation between the duration of the pronunciation element and the sub-expression characteristics corresponding to the time interval of the pronunciation element in the expression characteristics is determined.

Because the training data used for training the acoustic model and the training data used for training the expression model are from the same video and correspond to the same time axis, when a speaker sends out a pronunciation element, the voice of the speaker is matched with the facial action expression of the speaker, so that the virtual voice generated according to the target acoustic characteristics determined by the acoustic model is matched with the speaking expression generated according to the target expression characteristics determined by the expression model, better feeling is provided for a user, and the user immersion is improved.

In addition, because the training data used for training the acoustic model is the second corresponding relationship, in the second corresponding relationship, the same pronunciation element with the same time length or different time lengths corresponds to different sub-acoustic features, thus, when the acoustic model obtained by training is used to determine the target acoustic feature subsequently, after the time lengths of the pronunciation elements identified by the undetermined text feature and the undetermined text feature are input into the acoustic model, the situation similar to the training data can also be obtained, namely, the target acoustic features obtained after the same pronunciation element with different time lengths is input into the acoustic model are different, and even if the same pronunciation element has the same time length, the same pronunciation element with the same time length can also be input into the acoustic model to obtain different target acoustic features.

Therefore, the acoustic model obtained by training according to the second corresponding relation can determine different sub-acoustic features for the same pronunciation element with different duration in the text feature aiming at the text feature of the acoustic feature to be determined, so that the change pattern of the virtual sound is increased.

It can be understood that when the expression model is used to determine the target expression feature corresponding to the duration of the pronunciation element identified by the pending text feature, the input of the expression model is the pending text feature and the duration of the pronunciation element identified by the pending text feature, where the duration of the pronunciation element directly determines what the determined target expression feature is. That is to say, in order to determine the target expression feature corresponding to the duration of the pronunciation element, the duration of the pronunciation element needs to be determined first, and the duration of the pronunciation element may be determined in various ways.

One way to determine the duration of a pronunciation element may be according to a duration model, and to this end, the present implementation provides a method for training a duration model, which includes training a duration model according to the text features and the durations of pronunciation elements identified by the text features.

The trained duration model may determine the durations of the phonemes it identifies for the characteristics of the pending text. And the undetermined text characteristics are used as the input of the duration model, and the duration of the pronunciation elements identified by the undetermined text characteristics is used as the output of the duration model.

Because the training data used for training the duration model and the training data used for training the expression model and the acoustic model are from the same video, the duration of the pronunciation element identified by the text feature and the text feature included in the training data used for training the duration model is the duration of the pronunciation element identified by the text feature and the text feature used for training the expression model and the acoustic model. Therefore, the duration of the pronunciation element determined by using the duration model is suitable for the expression model and the acoustic model obtained by training in the embodiment, the expression model determines the target expression characteristic according to the duration of the pronunciation element obtained by using the duration model, and the acoustic model determines the target acoustic characteristic according to the duration of the pronunciation element obtained by using the duration model, so that the target acoustic characteristic accords with the expression mode of a person during normal speaking.

Next, a method of synthesizing the speech expression will be described. The method for synthesizing the speaking expression provided by the embodiment of the application can be applied to equipment providing the function related to the synthesized speaking expression, such as terminal equipment, a server and the like, wherein the terminal equipment can be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer and the like; the server may specifically be an application server, and may also be a Web server, and when the application is deployed in practice, the server may be an independent server, and may also be a cluster server.

The method for determining the speaking expression in the voice interaction provided by the embodiment of the application can be applied to various application scenes, and the embodiment of the application takes two application scenes as an example.

The first application scenario may be a game scenario in which different users communicate with each other through a virtual object, and one user may interact with a virtual object corresponding to another user, for example, the user a communicates with the user B through the virtual object, the user a inputs text content, the user B sees a speech expression of the virtual object corresponding to the user a, and the user B interacts with the virtual object corresponding to the user a.

The second application scenario may be applied in an intelligent voice assistant, such as the intelligent voice assistant siri, where when the user uses the intelligent voice assistant siri, the intelligent voice assistant siri may also present the speaking expression of the virtual object to the user when feeding back the interaction information to the user, and the user interacts with the virtual object.

In order to facilitate understanding of the technical solution of the present application, a server is taken as an execution subject, and a method for synthesizing speaking expressions provided in the embodiments of the present application is described below with reference to an actual application scenario.

Referring to fig. 4, fig. 4 is a schematic view of an application scenario of the method for synthesizing speaking expressions according to the embodiment of the present application. The application scenario includes a terminal device 401 and a server 402, where the terminal device 401 is configured to send text content acquired by the terminal device 401 to the server 402, and the server 402 is configured to execute the method for synthesizing speaking expressions provided in the embodiment of the present application, so as to determine a target expression feature corresponding to the text content sent by the terminal device 401.

When the server 402 needs to determine the target expression feature corresponding to the text content, the server 402 first determines the text feature corresponding to the text content and the duration of the pronunciation element identified by the text feature, and then the server 402 inputs the text feature and the duration of the identified pronunciation element into the expression model trained in the embodiment corresponding to fig. 2 to obtain the target expression feature corresponding to the text content.

The expression model is obtained by training according to a first corresponding relation, the first corresponding relation is used for showing the corresponding relation between the duration of the pronunciation element and the time interval of the pronunciation element in the corresponding sub-expression characteristics in the expression characteristics, and the same pronunciation element with the same duration or different durations in the first corresponding relation corresponds to different sub-expression characteristics. Therefore, when the expression model is used for determining the target expression characteristics, aiming at the text characteristics of the expression characteristics to be determined, the expression model can determine different sub-expression characteristics for the same pronunciation element with different or same duration in the text characteristics, the change patterns of the speaking expression are increased, and the speaking expression generated according to the target expression characteristics determined by the expression model has different change patterns for the speaking expression of the same pronunciation element, so that the excessive unnatural condition of the speaking expression change is improved to a certain extent.

A method for synthesizing speaking expressions according to an embodiment of the present application will be described with reference to the accompanying drawings.

Referring to fig. 5, fig. 5 is a flowchart illustrating a method for determining a speaking expression in a voice interaction, the method comprising:

s501, determining text characteristics corresponding to text content and duration of pronunciation elements identified by the text characteristics.

In this embodiment, the text content refers to a text that needs to be fed back to a user interacting with the virtual object, and may be different according to different text contents of the application scenario.

In the first application scenario mentioned above, the text content may be a text corresponding to the user input content. For example, if the user B sees the speaking expression of the virtual object corresponding to the user a and interacts with the virtual object corresponding to the user a, the text corresponding to the input content of the user a may be used as the text content.

In the second application scenario mentioned above, the text content may be a text corresponding to the interaction information fed back according to the content input by the user. For example, after the user inputs "how the weather is today", siri may answer the user input, and feedback the interactive information including the weather condition of the today to the user, so that the text corresponding to the interactive information including the weather condition of the today, which is fed back to the user, may be used as the text content.

The input mode of the user may be inputting text or inputting voice. When the mode of the user input may be text input, the text content is obtained by the terminal device 101 directly according to the user input or is fed back according to the text input by the user, and when the user inputs voice through the terminal device 101, the text content is obtained by the terminal device 101 recognizing the voice input by the user or is fed back according to the recognized voice input by the user.

The text content can be subjected to feature extraction, so that text features corresponding to the text content are obtained, the text features can comprise a plurality of sub-text features, and the duration of the pronunciation elements identified by the text features can be determined according to the text features.

It should be noted that, when the duration of the pronunciation element identified by the text feature is determined, the duration of the pronunciation element identified by the text feature can be obtained through the text feature and the duration model. The duration model is obtained by training the duration of the pronunciation elements identified by the historical text features and the historical text features. The training method of the duration model is described in the foregoing embodiments, and is not described herein again.

S502, obtaining target expression characteristics corresponding to the text content through the text characteristics, the duration of the identified pronunciation elements and the expression model.

That is to say, the text feature and the duration of the identified pronunciation element are used as the input of the expression model, so that the target expression feature corresponding to the text content is obtained through the expression model.

In the target expression features, sub-expression features corresponding to target pronunciation elements are determined according to sub-text features corresponding to the target pronunciation elements in the text features and duration of the target pronunciation elements.

The target pronunciation element is any one of the pronunciation elements identified by the text feature, and the expression model is obtained by training according to the method provided by the embodiment corresponding to fig. 2.

It should be noted that, in some cases, the text features used for training the expression model may be used to identify the pronunciation element in the speech and the context information corresponding to the pronunciation element. Then, when the target expression feature is determined by using the expression model, the text feature may also be used to identify the pronunciation element and the context information corresponding to the pronunciation element in the text content.

According to the technical scheme, the expression model is obtained through training according to a first corresponding relation, the first corresponding relation is used for representing the corresponding relation between the duration of the pronunciation element and the time interval of the pronunciation element in the corresponding sub-expression characteristics in the expression characteristics, and the same pronunciation element with the same duration or different durations in the first corresponding relation corresponds to different sub-expression characteristics. Therefore, when the expression model is used for determining the target expression characteristics, aiming at the text characteristics of the expression characteristics to be determined, the expression model can determine different sub-expression characteristics for the same pronunciation element with different or same duration in the text characteristics, the change patterns of the speaking expression are increased, and the speaking expression generated according to the target expression characteristics determined by the expression model has different change patterns for the speaking expression of the same pronunciation element, so that the excessive unnatural condition of the speaking expression change is improved to a certain extent.

It can be understood that, when the speaking expression is synthesized by the method provided by the embodiment corresponding to fig. 5, the change pattern of the speaking expression can be increased, and the excessive unnatural situation of the change of the speaking expression is improved. And when the human-computer interaction is carried out, the speaking expression of the virtual object is displayed for the user, and virtual sound for interaction can be played. If the virtual sound is generated in the existing mode, the situation that the virtual sound and the speaking expression are not matched may occur, in this situation, the embodiment of the present application provides a method for synthesizing the virtual sound, the virtual sound synthesized by the method may be matched with the speaking expression, and the method includes obtaining the target acoustic feature corresponding to the text content through the text feature, the duration of the identified pronunciation element, and the acoustic model.

In the target acoustic features, sub-acoustic features corresponding to the target pronunciation elements are determined according to sub-text features corresponding to the target pronunciation elements in the text features and duration of the target pronunciation elements; the acoustic model is obtained by training the method provided by the corresponding embodiment of fig. 3.

Because the training data of the acoustic model used for determining the target acoustic characteristics and the training data of the expression model used for determining the target expression characteristics come from the same video and correspond to the same time axis, when a speaker sends out a pronunciation element, the voice of the speaker is matched with the expression of the speaker, therefore, the virtual voice generated according to the target acoustic characteristics determined by the acoustic model is matched with the speaking expression generated according to the target expression characteristics determined by the expression model, better feeling is provided for a user, and the immersion of the user is improved.

Next, a method for generating a visual speech synthesis method in human-computer interaction will be introduced based on the model training method and the method for synthesizing speaking expressions and virtual sounds provided in the embodiments of the present application, in combination with a specific application scenario.

The application scene can be a game scene, the user A communicates with the user B through the virtual object, the user A inputs text content, the user B sees the speaking expression of the virtual object corresponding to the user A and hears virtual sound, and the user B interacts with the virtual object corresponding to the user A. Referring to fig. 6, fig. 6 shows an architecture diagram of a method for generating a visual speech synthesis in human-computer interaction.

As shown in fig. 6, the architecture diagram includes a model training portion and a synthesis portion. Wherein, in the model training part, videos containing the facial action expression and the corresponding voice of the speaker can be collected. And performing text analysis and prosody analysis on the text corresponding to the voice spoken by the speaker so as to extract text characteristics. And extracting acoustic features of the voice spoken by the speaker so as to extract the acoustic features. And performing expression feature extraction on the facial action expression when the speaker speaks the voice, so as to extract the expression feature. And processing the voice spoken by the speaker through a forced alignment module, and determining the time interval and the time length of the pronunciation element identified by the text characteristic according to the text characteristic and the acoustic characteristic.

Secondly, performing expression model training according to the duration of the pronunciation element identified by the text feature, the corresponding expression feature and the text feature to obtain an expression model; performing acoustic model training according to the duration of the pronunciation element identified by the text characteristic, the corresponding acoustic characteristic and the text characteristic to obtain an acoustic model; and training a duration model according to the text characteristics and the duration of the pronunciation elements identified by the text characteristics to obtain the duration model. At this point, the model training section completes the training of the desired model.

And then, entering a synthesis part, wherein the synthesis part can complete the visualized speech synthesis by utilizing the expression model, the acoustic model and the duration model obtained by training. Specifically, text analysis and prosody analysis are carried out on text content of visual voice to be synthesized to obtain text characteristics corresponding to the text content, and the text characteristics are input into a duration model to carry out duration prediction to obtain duration of pronunciation elements identified by the text characteristics. And inputting the frame-level feature vectors generated by the text features and the duration of the identified pronunciation elements into an expression model for expression feature prediction to obtain target expression features corresponding to the text content. And inputting the frame-level feature vectors generated by the text features and the duration of the identified pronunciation elements into an acoustic model for acoustic feature prediction to obtain target acoustic features corresponding to the text content. And finally, rendering the obtained target expression characteristics and target acoustic characteristics to generate animation, so as to obtain visual voice.

The visual voice obtained by the scheme increases the change patterns of the speaking expression and the virtual voice on one hand, and improves the excessive unnatural situation of the change of the speaking expression to a certain extent, on the other hand, because the training data used by the training acoustic model and the training data used by the training expression model come from the same video and correspond to the same time shaft, when the speaker sends out a pronunciation element, the voice of the speaker is matched with the expression of the speaker, therefore, the virtual voice generated according to the target acoustic characteristic determined by the acoustic model is matched with the speaking expression generated according to the target expression characteristic determined by the expression model, so the synthesized visual voice provides better feeling for the user, and the immersion feeling of the user is improved.

Based on the model training method for synthesizing speaking expressions and the method for synthesizing speaking expressions provided by the foregoing embodiments, the related apparatuses provided by the embodiments of the present application are introduced. The present embodiment provides a model training apparatus 700 for synthesizing speaking expressions, referring to fig. 7a, the apparatus 700 includes an obtaining unit 701, a first determining unit 702, a second determining unit 703 and a first training unit 704:

the acquiring unit 701 is configured to acquire a video including a facial motion expression of a speaker and a corresponding voice, and to obtain an expression feature of the speaker, an acoustic feature of the voice, and a text feature of the voice according to the video; the acoustic feature comprises a plurality of sub-acoustic features;

the first determining unit 702 is configured to determine, according to the text feature and the acoustic feature, a time interval and a duration of a pronunciation element identified by the text feature; any pronunciation element identified by the text feature is a target pronunciation element, the time interval of the target pronunciation element is the time interval of a sub-acoustic feature corresponding to the target pronunciation element in the acoustic feature in the video, and the duration of the target pronunciation element is the duration of the sub-acoustic feature corresponding to the target pronunciation element;

the second determining unit 703 is configured to determine a first corresponding relationship according to the time interval and the duration of the pronunciation element identified by the text feature and the expression feature, where the first corresponding relationship is used to represent a corresponding relationship between the duration of the pronunciation element and a sub-expression feature corresponding to the time interval of the pronunciation element in the expression feature;

the first training unit 704 is configured to train an expression model according to the first corresponding relationship; the expression model is used for determining corresponding target expression characteristics according to the undetermined text characteristics and the duration of the pronunciation elements identified by the undetermined text characteristics.

In one implementation, referring to fig. 7b, the apparatus 700 further comprises a third determining unit 705 and a second training unit 706:

the third determining unit 705 is configured to determine a second correspondence between the pronunciation element identified by the text feature and the acoustic feature; the second corresponding relation is used for reflecting the corresponding relation between the duration of the pronunciation element and the corresponding sub-acoustic features of the pronunciation element in the acoustic features;

the second training unit 706 is configured to train an acoustic model according to the second correspondence, where the acoustic model is configured to determine a corresponding target acoustic feature according to the undetermined text feature and the duration of the pronunciation element identified by the undetermined text feature.

In one implementation, referring to fig. 7c, the apparatus 700 further comprises a third training unit 707:

the third training unit 707 is configured to train a duration model according to the text feature and the duration of the pronunciation element identified by the text feature, where the duration model is configured to determine the duration of the pronunciation element identified by the pending text feature according to the pending text feature.

In one implementation, the text feature is used to identify a pronunciation element in the speech and context information corresponding to the pronunciation element.

The embodiment of the present application further provides an apparatus 800 for synthesizing a speech expression, referring to fig. 8a, the apparatus 800 includes a determining unit 801 and a first obtaining unit 802:

the determining unit 801 is configured to determine a text feature corresponding to text content and a duration of a pronunciation element identified by the text feature; the text features comprise a plurality of sub-text features;

the first obtaining unit 802 is configured to obtain a target expression feature corresponding to the text content through the text feature, the duration of the identified pronunciation element, and an expression model; the target expression features comprise a plurality of sub-expression features, any pronunciation element identified by the text features is a target pronunciation element, and in the target expression features, the sub-expression features corresponding to the target pronunciation element are determined according to the sub-expression features corresponding to the target pronunciation element in the text features and the duration of the target pronunciation element.

In one implementation manner, the expression model is obtained by training according to a first corresponding relationship, where the first corresponding relationship is used to represent a corresponding relationship between the duration of the pronunciation element and the time interval of the pronunciation element in the sub-expression features corresponding to the expression features.

In one implementation, referring to fig. 8b, the apparatus 800 further includes a second obtaining unit 803:

the second obtaining unit 803 is configured to obtain, through the text feature, the duration of the identified pronunciation element, and an acoustic model, a target acoustic feature corresponding to the text content; in the target acoustic features, sub-acoustic features corresponding to the target pronunciation elements are determined according to sub-text features corresponding to the target pronunciation elements in the text features and the duration of the target pronunciation elements;

the acoustic model is obtained through training according to a second corresponding relation, and the second corresponding relation is used for showing the corresponding relation between the duration of the pronunciation element and the corresponding sub-acoustic features of the pronunciation element in the acoustic features.

In one implementation, the determining unit 801 is specifically configured to obtain, through the text feature and the duration model, a duration of a pronunciation element identified by the text feature; the duration model is obtained through duration training according to the historical text features and the pronunciation elements identified by the historical text features.

In one implementation, the text feature is used to identify a pronunciation element in the text content and context information corresponding to the pronunciation element.

According to the technical scheme, in order to determine the speaking expressions with various changes and excessively natural expressions for the virtual object, the embodiment of the application provides a brand-new expression model training device, and the expression characteristics of the speaker, the acoustic characteristics of the voice and the text characteristics of the voice are obtained according to the video containing the facial action expressions and the corresponding voice of the speaker. Because the acoustic features and the text features are obtained according to the same video, the time interval and the duration of the pronunciation elements identified by the text features can be determined according to the acoustic features. And determining a first corresponding relation according to the time interval and the time length of the pronunciation element identified by the text feature and the expression feature, wherein the first corresponding relation is used for showing the corresponding relation between the time length of the pronunciation element and the time interval of the pronunciation element in the sub-expression features corresponding to the expression feature.

For a target pronunciation element in the identified pronunciation elements, the sub-expression characteristics in the time interval can be determined from the expression characteristics through the time interval of the target pronunciation element, and the duration of the target pronunciation element can embody different durations of the target pronunciation element in various expression sentences of the video voice, so the determined sub-expression characteristics can embody the possible expression of the target pronunciation element spoken by a speaker in different expression sentences. Therefore, according to the expression model obtained by training the first corresponding relation, aiming at the text feature of the expression feature to be determined, the device for determining the speaking expression in the voice interaction can determine different sub-expression features for the same pronunciation element with different durations in the text feature through the expression model, so that the change pattern of the speaking expression is increased, and the speaking expression generated according to the target expression feature determined by the expression model has different change patterns for the speaking expression of the same pronunciation element, so that the excessive unnatural condition of the speaking expression change is improved to a certain extent.

The embodiment of the present application further provides a server, which may be used as a model training device for synthesizing the speaking expression, and may also be used as a device for synthesizing the speaking expression, and the server will be described below with reference to the accompanying drawings. Referring to fig. 9, a server 900, which may vary greatly in configuration or performance, may include one or more Central Processing Units (CPUs) 922 (e.g., one or more processors) and memory 932, one or more storage media 930 (e.g., one or more mass storage devices) storing applications 942 or data 944. Memory 932 and storage media 930 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, central processor 922 may be disposed in communication with storage medium 930 to execute a sequence of instruction operations in storage medium 930 on device 900.

The device 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958, and/or one or more operating systems 941, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.

The CPU 922 is configured to execute the following steps:

acquiring the expression characteristics of the speaker, the acoustic characteristics of the voice and the text characteristics of the voice according to the video; the acoustic feature comprises a plurality of sub-acoustic features; determining the time interval and the duration of the pronunciation element identified by the text feature according to the text feature and the acoustic feature; any pronunciation element identified by the text feature is a target pronunciation element, the time interval of the target pronunciation element is the time interval of a sub-acoustic feature corresponding to the target pronunciation element in the acoustic feature in the video, and the duration of the target pronunciation element is the duration of the sub-acoustic feature corresponding to the target pronunciation element;

Alternatively, the CPU 922 is configured to perform the following steps:

Referring to fig. 10, an embodiment of the present application provides a terminal device, which may be used as a device for synthesizing a speaking expression, where the terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the terminal device is a mobile phone:

fig. 10 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 10, the cellular phone includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (WiFi) module 1070, processor 1080, and power source 1090. Those skilled in the art will appreciate that the handset configuration shown in fig. 10 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following specifically describes each constituent component of the mobile phone with reference to fig. 10:

RF circuit 1010 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to processor 1080; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1010 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1010 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Message Service (SMS), and the like.

The memory 1020 can be used for storing software programs and modules, and the processor 1080 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1020 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1030 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1030 may include a touch panel 1031 and other input devices 1032. The touch panel 1031, also referred to as a touch screen, may collect touch operations by a user (e.g., operations by a user on or near the touch panel 1031 using any suitable object or accessory such as a finger, a stylus, etc.) and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1031 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1080, and can receive and execute commands sent by the processor 1080. In addition, the touch panel 1031 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1030 may include other input devices 1032 in addition to the touch panel 1031. In particular, other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a track ball, a mouse, a joystick, and the like.

The display unit 1040 may be used to display information input by a user or information provided to the user and various menus of the cellular phone. The Display unit 1040 may include a Display panel 1041, and optionally, the Display panel 1041 may be configured in a form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1031 can cover the display panel 1041, and when the touch panel 1031 detects a touch operation on or near the touch panel 1031, the touch operation is transferred to the processor 1080 to determine the type of the touch event, and then the processor 1080 provides a corresponding visual output on the display panel 1041 according to the type of the touch event. Although in fig. 10, the touch panel 1031 and the display panel 1041 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1031 and the display panel 1041 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1050, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1041 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1041 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, the description is omitted here.

Audio circuitry 1060, speaker 1061, microphone 1062 may provide an audio interface between the user and the handset. The audio circuit 1060 can transmit the electrical signal converted from the received audio data to the speaker 1061, and convert the electrical signal into a sound signal for output by the speaker 1061; on the other hand, the microphone 1062 converts the collected sound signal into an electrical signal, which is received by the audio circuit 1060 and converted into audio data, which is then processed by the audio data output processor 1080 and then sent to, for example, another cellular phone via the RF circuit 1010, or output to the memory 1020 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help the user to receive and send e-mail, browse web page and access streaming media etc. through WiFi module 1070, it provides wireless broadband internet access for the user. Although fig. 10 shows the WiFi module 1070, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1080 is a control center of the mobile phone, connects various parts of the whole mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1020 and calling data stored in the memory 1020, thereby integrally monitoring the mobile phone. Optionally, processor 1080 may include one or more processing units; preferably, the processor 1080 may integrate an application processor, which handles primarily the operating system, user interfaces, applications, etc., and a modem processor, which handles primarily the wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1080.

The handset also includes a power supply 1090 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1080 via a power management system that may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In this embodiment, the processor 1080 included in the terminal device further has the following functions:

An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is configured to store a program code, and the program code is configured to execute the model training method for synthesizing spoken expressions according to the foregoing embodiments corresponding to fig. 2 to 3 or the method for synthesizing spoken expressions according to the embodiment corresponding to fig. 5.

The terms "first," "second," "third," "fourth," and the like (if any) in the description of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in this application, "at least one" means one or more, "a plurality" means two or more. "and/or" is used to describe the association relationship of the associated object, indicating that there may be three relationships, for example, "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b and c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other media capable of storing program codes.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for synthesizing spoken expressions based on artificial intelligence, the method comprising:

acquiring text content sent by a terminal;

determining text features corresponding to the text content and duration of pronunciation elements identified by the text features; the text features comprise a plurality of sub-text features;

determining a corresponding target expression characteristic according to the text characteristic and the duration of the pronunciation element identified by the text characteristic through an expression model; the target expression features comprise a plurality of sub-expression features, any pronunciation element identified by the text features is a target pronunciation element, and in the target expression features, the sub-expression features corresponding to the target pronunciation element are determined according to the sub-expression features corresponding to the target pronunciation element in the text features and the duration of the target pronunciation element;

and returning the target expression characteristics to the terminal.

2. The method of claim 1, further comprising:

obtaining target acoustic characteristics corresponding to the text content through the text characteristics, the duration of the identified pronunciation elements and an acoustic model; in the target acoustic features, the sub-acoustic features corresponding to the target pronunciation elements are determined according to the sub-text features corresponding to the target pronunciation elements in the text features and the duration of the target pronunciation elements.

3. The method according to claim 1, wherein the determining the text feature corresponding to the text content and the duration of the pronunciation element identified by the text feature comprises:

and obtaining the duration of the pronunciation element identified by the text feature through the text feature and the duration model.

4. The method according to any one of claims 1 to 3, wherein the text content is a text fed back to a user interacting with the virtual object, and the text content includes a text corresponding to the user input content or a text corresponding to the interaction information fed back according to the user input content.

5. The method according to any one of claims 1-3, wherein the text feature is used for identifying a pronunciation element and context information corresponding to the pronunciation element in the text content.

6. The method according to any one of claims 1-3, wherein the expressive features comprise at least mouth shape features.

7. An apparatus for synthesizing speaking expressions based on artificial intelligence, the apparatus comprising an acquisition unit, a determination unit, a first acquisition unit, and a return unit:

the acquiring unit is used for acquiring the text content sent by the terminal;

the determining unit is used for determining a text feature corresponding to the text content and the duration of a pronunciation element identified by the text feature; the text features comprise a plurality of sub-text features;

the first acquisition unit is used for determining corresponding target expression characteristics according to the text characteristics and the duration of the pronunciation elements identified by the text characteristics through an expression model; the target expression features comprise a plurality of sub-expression features, any pronunciation element identified by the text features is a target pronunciation element, and in the target expression features, the sub-expression features corresponding to the target pronunciation element are determined according to the sub-expression features corresponding to the target pronunciation element in the text features and the duration of the target pronunciation element;

and the returning unit is used for returning the target expression characteristics to the terminal.

8. A method for synthesizing spoken expressions based on artificial intelligence, the method comprising:

acquiring the expression characteristics of the speaker, the acoustic characteristics of the voice and the text characteristics of the voice according to the video; the acoustic feature comprises a plurality of sub-acoustic features;

training an expression model and an acoustic model according to the text features, the time intervals and durations of the pronunciation elements identified by the text features, and the expression features and the acoustic features; the expression model is used for determining corresponding target expression characteristics according to the undetermined text characteristics and the duration of the pronunciation elements identified by the undetermined text characteristics; the acoustic model is used for determining corresponding target acoustic characteristics according to the undetermined text characteristics and the duration of the pronunciation elements identified by the undetermined text characteristics;

acquiring text content sent by a terminal;

obtaining target expression characteristics and target acoustic characteristics corresponding to the text content through the text characteristics, the duration of the identified pronunciation elements, the expression models and the acoustic models;

and rendering the target expression characteristics and the target acoustic characteristics to generate an animation.

9. The method of claim 8, wherein training an expression model and an acoustic model based on the textual features, the time intervals and durations of the vocal elements identified by the textual features, the expressive features, and the acoustic features comprises:

training the expression model according to the first corresponding relation;

determining a second corresponding relation between the pronunciation element identified by the text feature and the acoustic feature; the second corresponding relation is used for reflecting the corresponding relation between the duration of the pronunciation element and the corresponding sub-acoustic features of the pronunciation element in the acoustic features;

and training the acoustic model according to the second corresponding relation.

10. The method according to claim 8 or 9, characterized in that the method further comprises:

training a duration model according to the text features and the duration of the pronunciation elements identified by the text features, wherein the duration model is used for determining the duration of the pronunciation elements identified by the undetermined text features according to the undetermined text features;

the duration of the pronunciation element identified by the text feature is determined by the following method:

11. An apparatus for artificial intelligence based synthesized spoken expressions, the apparatus comprising a processor and a memory:

the processor is configured to execute the method for artificial intelligence based synthesized speech expression of any of claims 1-6 or 8-10 according to instructions in the program code.

12. A computer-readable storage medium for storing program code for performing the method for artificial intelligence based synthesized speech expressions according to any one of claims 1-6 or 8-10.