CN114783403B

CN114783403B - Method, apparatus, device, storage medium and program product for generating audio reading material

Info

Publication number: CN114783403B
Application number: CN202210149168.3A
Authority: CN
Inventors: 程龙; 王砚峰; 刘恺; 王睿敏; 周志平; 方鹏; 周明; 林国雯; 冷永才; 蒋维明; 史小静; 陆亮; 张晶晶; 段文君; 曾可璇; 张心愿; 马浩然; 郎勇; 段枫; 谢昆
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2024-08-13
Anticipated expiration: 2042-02-18
Also published as: CN114783403A

Abstract

The application discloses a method, a device, equipment, a storage medium and a program product for generating an audio reading material, and relates to the technical field of artificial intelligence. The method comprises the following steps: displaying a dubbing player setting interface corresponding to the target reading material, and displaying a plurality of roles and a plurality of candidate dubbing players contained in the target reading material in the dubbing player setting interface; in response to a dubber setting operation for a character, displaying a dubber set for the character in a dubber setting interface; responding to the setting completion operation, displaying a dubbing result display interface, and displaying at least one sentence of the target reading material and a role corresponding to the sentence in the dubbing result display interface; and playing the audio content of the target sentence in the target reading material generated by the target dubbing player in response to the playing operation of the target reading material. The application makes the sound in the sound reading material more diversified, and improves the dubbing quality of the sound reading material.

Description

Method, apparatus, device, storage medium and program product for generating audio reading material

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for generating an audio reading material.

Background

After the novel is dubbed to generate the voiced novel, people can read the novel content conveniently.

In order to reduce the manufacturing cost of the voiced novels, compared with the traditional way of generating the voiced novels by artificial dubbing, in the related art, a way of automatically generating the voiced novels is provided. In this related art, text data of novels is converted into audio content using a text-to-speech technique, thereby automatically generating voiced novels.

However, the resulting sound is not only single tone but also of poor quality.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment, a storage medium and a program product for generating an audio reading material, which can solve the technical problems of single tone and poor quality of the audio reading material. The technical scheme is as follows:

According to an aspect of the embodiment of the present application, there is provided a method for generating an audio reading material, the method including:

Displaying a dubbing player setting interface corresponding to a target reading material, and displaying a plurality of roles and a plurality of candidate dubbing players contained in the target reading material in the dubbing player setting interface;

in response to a dubbing person setting operation for the character, displaying a dubbing person set for the character in the dubbing person setting interface; wherein, the dubbing staff set for at least two different roles in the target reading material are different;

Responding to the setting completion operation, displaying a dubbing result display interface, and displaying at least one sentence of the target reading material and a role corresponding to the sentence in the dubbing result display interface;

And playing the audio content of the target sentence in the target reading material generated by the target dubbing player in response to the playing operation of the target reading material, wherein the target dubbing player is the dubbing player set for the role corresponding to the target sentence.

Acquiring text data of a target reading material to be dubbed;

Identifying a plurality of roles contained in the target reading based on the text data of the target reading;

Acquiring a dubbing player set for the role; wherein dubbing staff set for at least two different roles in the target reading material are different;

and generating an audio file corresponding to the target reading material based on dubbing staff corresponding to each role in the target reading material.

According to an aspect of an embodiment of the present application, there is provided a generating apparatus of a dubbing result, the apparatus including:

The setting interface display module is used for displaying a dubbing player setting interface corresponding to a target reading material, and displaying a plurality of roles and a plurality of candidate dubbing players contained in the target reading material in the dubbing player setting interface;

a dubber setting module for displaying a dubber set for the character in the dubber setting interface in response to a dubber setting operation for the character; wherein dubbing staff set for at least two different roles in the target reading material are different;

the dubbing result display module is used for responding to the setting completion operation, displaying a dubbing result display interface, and displaying at least one sentence of the target reading material and a role corresponding to the sentence in the dubbing result display interface;

And the dubbing content playing module is used for responding to the playing operation of the target reading material and playing the audio content of the target sentence in the target reading material generated by a target dubbing player, wherein the target dubbing player is a dubbing player set for a role corresponding to the target sentence.

the data acquisition module is used for acquiring text data of a target reading material to be dubbed;

the character recognition module is used for recognizing a plurality of characters contained in the target reading material based on the text data of the target reading material;

The dubbing setting module is used for acquiring a dubbing person set for the role; wherein dubbing staff set for at least two different roles in the target reading material are different;

and the file generation module is used for generating an audio file corresponding to the target reading material based on dubbing staff corresponding to each role in the target reading material.

According to an aspect of an embodiment of the present application, there is provided a computer apparatus including a processor and a memory, where at least one instruction, at least one program, a code set, or an instruction set is stored in the memory, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the method for generating an audio readout as described above.

According to an aspect of an embodiment of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a code set, or an instruction set, which is loaded and executed by a processor to implement the method for generating an audio reading as described above.

According to an aspect of an embodiment of the present application, there is provided a computer program product including computer instructions stored in a computer-readable storage medium, from which a processor reads and executes the computer instructions to implement the above-described method of generating an audible readout.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

The multiple roles and the multiple candidate dubbing staff contained in the target reading material are displayed in the dubbing staff setting interface, at least two different dubbing staff are selected to dub the multiple roles, a dubbing result is generated and displayed in the dubbing result display interface, and the dubbing result can be listened in a trial mode. The application adopts a plurality of dubbing staff to dub different roles in the same target reading material, and because the different dubbing staff can produce sounds with different timbres, the generated audio reading material is not only a single dubbing staff (or a single timbre), so that the sounds in the audio reading material have more diversity, and the dubbing quality of the audio reading material is improved.

Drawings

FIG. 1 is a schematic illustration of an implementation environment for an embodiment of the present application;

FIG. 2 is a flow chart of a method for generating an audio reading according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for generating an audio reading according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a reader providing interface provided in accordance with one embodiment of the present application;

FIG. 5 is a schematic diagram of a dubbing operator setup interface provided by one embodiment of the present application;

FIG. 6 is a schematic diagram of a dubbing result display interface according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for generating an audio reading according to another embodiment of the present application;

FIG. 8 is a schematic diagram of a chapter selection interface provided by one embodiment of the present application;

FIG. 9 is a flowchart of a method for generating an audio reading according to another embodiment of the present application;

FIG. 10 is a flowchart of a method for generating an audio reading according to another embodiment of the present application;

FIG. 11 is a schematic diagram of the structure of BERT (Bidirectional Encoder Representation from Transformers, pre-trained language characterization model) provided by one embodiment of the present application;

FIG. 12 is a schematic diagram of an AI (ARTIFICIAL INTELLIGENCE ) acoustic model provided by one embodiment of the application;

fig. 13 is a schematic diagram of a sound quality extractor according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a bypass AI acoustic model provided in accordance with one embodiment of the application;

FIG. 15 is a block diagram of a dubbing result generating apparatus according to an embodiment of the present application;

Fig. 16 is a block diagram of a dubbing result generating apparatus according to another embodiment of the present application;

fig. 17 is a block diagram of a dubbing result generating apparatus according to another embodiment of the present application;

fig. 18 is a block diagram of a dubbing result generating apparatus according to another embodiment of the present application;

Fig. 19 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with the hardware level and the technology with the software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Natural language processing (Nature Language processing, NLP) is an important direction in the field of computer science and in the field of artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements the learning behavior of a human being to acquire new knowledge or skills, reorganizing the existing knowledge structure to continuously improve its own performance. Machine learning is the core of artificial intelligence, a root way for computers to have intelligence, which applies throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The technical scheme of the application relates to the technology of natural language processing, machine learning and the like in the AI field, and is described by a plurality of embodiments.

Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The implementation environment of the scheme can be realized into a dubbing content generation system for generating the audio reading material. The implementation environment of the scheme can comprise: a terminal device 10 and a server 20.

The terminal device 10 may be an electronic device such as a mobile phone, a tablet computer, a PC (Personal Computer ), a wearable device, an in-vehicle terminal device, a VR (Virtual Reality) device, and an AR (Augmented Reality) device, which is not limited in this respect. The terminal device 10 may have installed therein a client running a target application. Alternatively, the target application may be an application having an audio book generation function (e.g., generating a corresponding dubbing result for the target book, resulting in an audio book). Illustratively, the target application may be, for example, a target reader application, a target reader authoring application, a social interaction application, an audio playing application, etc., as the application is not limited in this regard.

The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The server 20 may be a background server of the target application program, and is configured to provide background services for clients of the target application program, for example, to assist the clients in generating corresponding audio books according to text data of the target book.

Alternatively, the target Application may be a stand-alone APP (Application), an applet, or other Application such as a web Application, which is not limited in the present application.

In some embodiments, as shown in fig. 2, fig. 2 illustrates a process flow diagram of a method of generating an audio reading. In this example, take the generation of a voiced novel as an example. The object firstly imports Wen Benshu the novel data into the client, the client transmits the novel text data imported by the object to the server, the server executes the processes of identifying the novel role, identifying the novel chapter, identifying the role and the dubbing staff to which the novel dialogue belongs, distinguishing the white place and the emotion of the dialogue, and the like, and acquires the novel role and the dubbing staff information corresponding to the white place manually set by the object in the client. And the server generates a dubbing result corresponding to the novel according to the steps, and sends the dubbing result to the client for display. The object can manually modify the dubbing result in the client according to the own requirement, such as manually configuring the speech and gas speed of the novel, manually adjusting the novel dubbing person, manually configuring novel special effect sound or background music, and the like, so as to obtain the dubbing result after manual fine modification. Finally, the object can select the dubbing result of a certain section or sections in the novel to download, and the corresponding audio file is obtained.

Referring to fig. 3, a flowchart of a method for generating an audio reading material according to an embodiment of the application is shown. The method may be performed by the terminal device 10 in the implementation environment of the solution shown in fig. 1, for example, the steps may be performed by a client of the target application. The method may comprise at least one of the following steps (310-340):

Step 310, displaying a dubbing player setting interface corresponding to the target reading material, and displaying a plurality of characters and a plurality of candidate dubbing players contained in the target reading material in the dubbing player setting interface.

The dubbing player setting interface is an interface for setting a desired dubbing player for a plurality of characters of the target reading material, and the plurality of characters and a plurality of candidate dubbing players contained in the target reading material are displayed in the dubbing player setting interface. The target reading is a reading to be dubbed, for example, a reading uploaded or selected by the subject. In the embodiment of the application, the reading material refers to text data generated in a text form, and the reading material can be in the forms of novels, cartoon and the like. The novel is a literary genre reflecting social life with a complete story or environmental description centered on the descriptive character information. Alternatively, the novel may be a presentation for telling only one story, which is the way the novel author tells the reader. The cartoon is in the form of pictures, so that readers can understand accidents through the combination of the pictures and the characters. The cartoon content can be more vivid by dubbing the dialogue of the characters in the cartoon. The character is a person or animal in the target reading. Optionally, the characters and animals are named to distinguish the identification degree of each character in the target reading material, and the characters in the target reading material are depicted and the dialogue is depicted to further describe the characters, so that the characters in the target reading material are complete and attractive. The candidate dubbing staff is the dubbing staff displayed in the client, and the user can select any one of the candidate dubbing staff and set the dubbing staff as the dubbing staff of a certain role in the target reading material. One dubbing person can dub only one character in the target reading material, namely, dubbing persons corresponding to different characters in the target reading material are different; or the same dubbing person can also dub for a plurality of different roles in the target reading material, for example, the same dubbing person is set for the two roles A and B in the target reading material.

Optionally, the dubbing operator setup interface is further configured to display at least one of the following information: information of each of the plurality of dubbing persons, setting of the plurality of dubbing persons, and a plurality of characters contained in the target reading material. The information of the dubber refers to attribute information of the dubber, including but not limited to nickname of the dubber, style of the dubber, language (such as dialect) of the dubber, emotion of the dubber and the like. Optionally, the setting of the dubber refers to setting information of the dubber, including: adjusting the sex of the dubber, adjusting the speech speed of the dubber, adjusting the volume of the dubber, and the like. The dubbing player is used for dubbing the sentence of the character in the target reading material, wherein the sentence not only comprises the content which is spoken by the character, but also comprises the content which is not spoken by the character, such as psychological description, imagination description and the like. Optionally, different AI acoustic models are set for different dubbing staff, and the different AI acoustic models can have different prosody, timbre and other characteristics, so that audio content with different prosody and timbre can be generated.

Alternatively, the multiple roles displayed in the dubber setting interface may be multiple roles that the object adds itself according to the roles that appear in the text data of the target reading. Alternatively, the multiple characters displayed in the dubbing player setting interface may be multiple characters recognized by the system from the text data of the target reading material. Of course, in some other embodiments, the multiple roles displayed in the dubbing operator setup interface may also be multiple roles that the object has added, modified or adjusted by itself, based on the multiple roles identified by the system. The application is not limited by the comparison.

Optionally, before displaying the dubbing player setting interface, text data of the target reading material can be displayed, and characters in the text data can be identified and displayed. Optionally, the client displays a reading material providing interface, acquires the target reading material determined in the reading material providing interface, and acquires a plurality of characters contained in the target reading material identified from text data of the target reading material.

The reading providing interface is an interface for displaying the target reading, the object can upload the target reading in the reading providing interface or select the target reading in the reading providing interface, and then the client displays the target reading uploaded or selected by the object in the reading providing interface. Then, the client can locally identify text data of the target reading material to obtain a plurality of roles contained in the target reading material; or the client sends the text data or the identification information of the target reading material to the server, and the server identifies the text data of the target reading material to obtain a plurality of roles contained in the target reading material. For a specific manner of identifying the character, please refer to the description in the following embodiments.

In some embodiments, as shown in FIG. 4, FIG. 4 illustrates a schematic view of a reading providing interface. In fig. 4 (a), a reading material adding control 41 is displayed in the part of the reading material providing interface 40, and after the object clicks the reading material adding control 41, the object may select the target reading material to be uploaded and upload the target reading material to the server. And an upload wait interface 42 as in part (b) of fig. 4 is displayed. At this time, in the upload waiting interface 42, the client transmits the text data of the target reading object uploaded by the object to the server, and the server acquires the text data of the target reading object and then recognizes the character included in the target reading object from the text data. When the server identifies the character contained in the target reading, the server identification progress 43 is displayed in the upload waiting interface 42, and after the server identification is finished, the server identification progress 43 displayed in the upload waiting interface 42 is full, and the next content display is performed. The server also selects a dubbing person with higher matching degree with the style characteristics of the characters in the text data to dub the characters and sentences, generates dubbing results of the target reading material after completing dubbing of all characters and bystandings in part or all text data of the target reading material, and sends the dubbing results to the client for display.

In some embodiments, as shown in fig. 5, fig. 5 illustrates a schematic diagram of a dubber setup interface, in which a character 51 of a target reading is displayed in the dubber setup interface 50. Optionally, a plurality of candidate dubbing persons 52, dubbing person information 53 corresponding to the character contained in the target reading material, dubbing person information 54 corresponding to the bystanding, and dubbing parameter setting area 55 are also displayed in the dubbing person setting interface 50.

Step 320, in response to the dubbing player setting operation for the character, displaying the dubbing player set for the character in the dubbing player setting interface; wherein, at least two different roles in the target reading material are correspondingly provided with different dubbing staff.

The object can select a specific dubber to dub for the role of the target reading material through the setting operation of the dubber, and the client correspondingly displays the dubber selected by the object and the dubbed role in the dubber setting interface. As shown in fig. 5, the character 51 of the target reading object selected by the subject and the corresponding dubber information 53 thereof are shown in fig. 5. For example, when the object wants to configure the dubbing player for the character 1, the object may complete the setup operation for the dubbing player of the character 1 by clicking on the entry corresponding to the character 1 and then clicking on the dubbing player that wants to dubbe the character. Alternatively, the object may drag any candidate dubbing operator to the entry corresponding to character 1 to complete the setting operation for the dubbing operator of character 1.

Optionally, the object may make adjustments to the displayed characters in the dubber settings interface, including adding characters, deleting characters, modifying characters (e.g., modifying character names), and so forth. Optionally, after the object adjusts the character, the corresponding dubbing person may also be reset by the adjusted character. In some embodiments, as shown in FIG. 5, an object may be added to a character by clicking on the add character control 56 in FIG. 5. Optionally, the object may delete the character by sliding the entry corresponding to the character 51 of the added target reading object, or rename the character of the added target reading object by clicking the entry corresponding to the character 51 of the target reading object.

Optionally, at least two different dubbing operators are set for the multiple roles contained in the target reading material to dub the target reading material. Wherein, for two different roles of the target reading material, different dubbing operators can be selected for dubbing, and the same dubbing operators can be selected for dubbing.

And 330, displaying a dubbing result display interface in response to the setting completion operation, and displaying at least one sentence of the target reading material and a role corresponding to the sentence in the dubbing result display interface.

In some embodiments, as shown in FIG. 5, after the object has configured the dubber, the object completes the setup of the character and the bystander in the target reader by clicking on the complete configuration control 57 in the dubber setup interface 50. And the client responds to the setting completion operation and displays a dubbing result display interface.

The dubbing result display interface is used for displaying the target reading text and the corresponding dubbing result thereof, and in the dubbing result display interface, the target reading text of one chapter can be displayed at a time, the character name and the corresponding sentences thereof are displayed in the target reading text, and the dubbing sentence quantity of the chapter is displayed. The client determines the value of the number of dubbing sentences by counting the number of the dubbing sentences in the text data, and displays the value of the number of the dubbing sentences in a dubbing result display interface. Alternatively, the number of dubbing sentences may be the number of dubbing sentences of all characters, or may be the number of dubbing sentences of each character individually, which is not limited in the present application.

In some embodiments, as shown in fig. 6, fig. 6 illustrates a schematic diagram of a dubbing result presentation interface. As shown in part (a) of fig. 6, a chapter selection area 61, a text display area 62, and a character selection area 63 are displayed in the dubbing result display interface 60. A plurality of chapters are displayed in the chapter selection area 61, and a text display area 62 corresponding to the target chapter 64 is displayed in the dubbing result display interface 60 by a click operation on the target chapter 64, wherein dubbing result information of the target chapter 64 is displayed in the text display area 62. The text display area 62 displays dubbing result information including a plurality of dubbing sentence information. For example, dubbing sentence 65 includes character 66, its corresponding sentence 67, audio content trial listening control 68, and audio content corresponding to the sentence. Meanwhile, each character and its corresponding number of dubbing sentences are displayed in the character selection area 63. For example, the character selection area 63 in fig. 6 shows the total number of dubbing sentences of all characters, and the number of dubbing sentences of a single character. By clicking the audio content listening test control 68, the dubbing result display interface is displayed to the part (b) in fig. 6 and listening of the dubbing content is performed, and listening of the entire chapter content can also be performed by clicking the chapter listening test control 69.

In step 340, in response to the play operation for the target reading material, playing the audio content of the target sentence in the target reading material generated by the target dubber, where the target dubber is a dubber set for the character corresponding to the target sentence.

And the server performs dubbing operation of the target reading material based on the AI acoustic model corresponding to the dubber, and generates an audio file of the target reading material. Among them, the AI acoustic model of the dubbing person is an acoustic model synthesized by a computer, and the acoustic model is one of the most important parts in the speech recognition system. Alternatively, the model parameters may be provided to the AI acoustic model of the dubber by way of manual dubbing, which is not limited in this regard by the present application.

Based on the operation of the target reading material playing control, the audio content of the target reading material is played. The audio content of the target reading material is the dubbing result of the target dubbing person on the target reading material, and the target dubbing person is the dubbing person set for the role corresponding to the target sentence. The playing operation of the target reading material may be playing the whole chapter content of the target reading material, or playing the target sentence in the target reading material.

In some embodiments, as shown in part (b) of fig. 6, after the object clicks on the audio content listening test control 68 in part (a) of fig. 6 or clicks on a character in the character selection area 63, the client further displays the dubbing result presentation interface 60. Wherein, the dubbing result area 68A corresponding to the selected character is additionally displayed in the part (b) as compared with the part (a). In the dubbing result area 68A corresponding to the selected character, all dubbing results of the selected character in the chapter text data of the target reading material are displayed. Alternatively, the subject may choose to listen to the dubbing results, cancel the dubbing results, or adjust the dubbing results in the dubbing results area 68A corresponding to the selected character. Wherein the subject may be listening to the entire chapter by clicking on the chapter listening control 69, or may be listening to a single sentence by clicking on the audio content listening control 66.

According to the embodiment, the multiple roles and the multiple candidate dubbing persons contained in the target reading material are displayed in the dubbing person setting interface, at least two different dubbing persons are selected to dub the multiple roles, the dubbing results are generated and displayed in the dubbing result display interface, and the dubbing results can be listened in a trial mode. The application adopts a plurality of dubbing staff to dub different roles in the same target reading material, and because the different dubbing staff can produce sounds with different tone colors, the generated audio reading material is not only a single dubbing staff (or a single tone color), so that the sounds in the audio reading material have more diversity, and the dubbing quality of the audio reading material is improved.

Meanwhile, a method for inputting any target reading material to generate dubbing results is provided for the object, so that the object can convert the text uploaded by the object into an audio file, the self-definition requirement of the object on the generated dubbing results is met, and the diversity of the dubbing results is improved.

Fig. 7 is a flowchart illustrating a method for generating an audio reading according to another embodiment of the application. The method may be performed by the terminal device 10 in the implementation environment of the solution shown in fig. 1, for example, the steps may be performed by a client of the target application. The method may comprise at least one of the following steps (710-770):

Step 710, displaying a dubbing player setting interface corresponding to the target reading material, and displaying a plurality of characters and a plurality of candidate dubbing players contained in the target reading material in the dubbing player setting interface.

Step 720, in response to the dubbing player setting operation for the character, displaying the dubbing player set for the character in the dubbing player setting interface; wherein, at least two different roles in the target reading material are correspondingly provided with different dubbing staff.

Step 730, displaying interface elements for setting dubbing parameters for the dubbing player in the dubbing player setting interface, the dubbing parameters including at least one of: speakable speed, speakable volume, speakable language, speakable emotion.

The execution timing of steps 720 and 730 is not limited, and step 720 may be executed first and then step 730 may be executed, or step 730 may be executed first and then step 720 may be executed.

Optionally, the interface elements of the dubbing parameters of the plurality of dubbing operators are displayed in the dubbing operator setting interface, wherein the dubbing parameters are used for indicating the difference among the dubbing operators, and the different dubbing operators have different dubbing parameters. The dubbing parameters may comprise at least one of: speakable speed, speakable volume, speakable language, and speakable emotion. The speaking speed is used for indicating the speed of the dubbing person to read the text, and the speed of the dubbing person to read the text is adjusted by adjusting the speaking speed. The speakable volume is used for indicating the volume of the speakable text of the dubbing player, and the volume of the speakable text of the dubbing player is adjusted by adjusting the speakable volume. The speakable language is used to represent the language used by the dubbing player to speak the text, and optionally, the dubbing language may be a language of different countries such as chinese, english, japanese, etc., and may be a language of different countries such as mandarin, sichuan, northeast, etc. The speakable emotion is used to indicate the emotion of the speaker speaking text. Alternatively, the emotion may be emotion such as happiness, sadness, anger, etc., or may be speech such as heaviness, liveness, loveliness, etc., which is not limited in the present application. Meanwhile, interface elements of dubbing parameters of a plurality of dubbing operators can be adjusted, so that the adjustment of the speaking speed, the speaking volume, the speaking language and the speaking emotion is achieved. Alternatively, the above-described various dubbing parameters may be combined. For example, the dubbing parameters may indicate that the dubbing person is a loved style of Shaanxi's voice, may indicate that the dubbing person is a Shanghai style of quick and sad emotion, may indicate that the dubbing person is a Mandarin style of active character with loud volume and excited emotion, and may even switch between the above styles, which is not a limitation of the present application.

In some embodiments, as shown in fig. 5, the speaker setup interface 50 of fig. 5 also displays speaker sound features 58, such as affinity, lovely liveness, gentle awareness, and ripeness. The dubbing operator setting interface 50 further displays a dubbing parameter setting area 55, where the dubbing parameter setting area 55 includes a speaking speed adjusting control, a speaking volume adjusting control, a speaking language adjusting control (not shown in the figure) and a speaking emotion adjusting control (not shown in the figure), the speaking speed adjusting control is used for adjusting the speaking speed of the aforementioned dubbing operator, and the speaking volume adjusting control is used for adjusting the speaking volume of the aforementioned dubbing operator. Optionally, the speaking speed adjustment control and the speaking volume adjustment control may be used for adjusting the speaking speed and the speaking volume of the whole dubbing result. Optionally, a dialect adjustment control is also displayed in the speaker setup interface 50 to make adjustments to the speaker's dialect, for example, the speaker's dialect may include: dialects such as mandarin, northeast, guangdong and Sichuan.

Optionally, the speaker setup interface also displays a speaker setup for the character initialization, where the speaker setup is a speaker setup automatically based on the style of the character. Step 720 may include: in response to a dubbing operator setting operation for a character, changing the initially set dubbing operator to a newly selected dubbing operator, and displaying the newly selected dubbing operator as the character in a dubbing operator setting interface.

Optionally, after the client sends the text data of the target reading material to the server, the server selects a proper dubbing person for dubbing the character according to the style characteristics of the character in Wen Benshu, and sends the selection result to the client, and the dubbing person corresponding to the character selected by the server is displayed in the client. Optionally, the object may change, according to its own preference, a dubbing person corresponding to the role selected by the server displayed in the client, and display, in the client, the role name of the target reading and the dubbing person corresponding to the changed target reading.

In some embodiments, the server selects a mature and stable beijing speaker to perform dubbing for the character according to the style characteristics of the character in the text data, however, the object is not satisfied with the speaker selected by the server, and the object selects the speaker of the character. Optionally, the object selects a happy mandarin speaker to dub the character according to the style characteristics of the character and its own preference in the text data.

Optionally, the setting of the dubbing player for character initialization may also be performed by the client, which is not limited by the present application.

Step 740, in response to the setting completion operation, displaying a dubbing result display interface, and displaying at least one sentence of the target reading material and a role corresponding to the sentence in the dubbing result display interface.

Optionally, the dubbing result presentation interface includes a chapter selection area, a text presentation area, and a character selection area. Wherein the chapter selection area is used for displaying a plurality of chapters of the target reading. The text display area is used for displaying sentences contained in at least one section selected from the plurality of sections and roles corresponding to the sentences. The character selection area is used for displaying at least one character contained in at least one chapter selected. Step 740 is followed by the further step of: in a case where at least one sentence of a first chapter of the plurality of chapters is displayed in the text presentation area, in response to a selection operation for a second chapter of the plurality of chapters, at least one sentence of the second chapter and a character to which the sentence corresponds are displayed in the text presentation area; and displaying at least one character contained in the second chapter in the character selection area.

The dubbing result display interface comprises a chapter selection area, a text display area and a role selection area. The chapter selection area displays a plurality of chapters of the target reading. Optionally, the text presentation area initially displays text data and dubbing sentences of the first chapter, and the character selection area initially displays characters in the first chapter and the number of dubbing sentences thereof. Through the selection operation of the second chapter in the multiple chapters of the target reading material, the text data and the dubbing sentences of the second chapter of the target reading material are displayed in a text display area in a dubbing result display interface, and the characters and the number of the dubbing sentences in the second chapter are displayed in a character selection area. Optionally, the dubbing result display interface further displays the character name and the corresponding sentence contained in the first chapter.

Optionally, the client displays a dubbing result area of the first character according to a selection operation of the object for the first character in the first chapter. Wherein, only the character dubbing sentence selected by the object is displayed in the dubbing result area. Alternatively, the object may perform trial listening of the character dubbing sentence by clicking a dubbing content trial listening controller of the character dubbing sentence in the dubbing result area.

Optionally, the object may also modify the character dubbing sentence by clicking a dubbing content modification control of the character dubbing sentence in the dubbing region.

In some embodiments, as shown in part (a) of fig. 6, part (a) of fig. 6 shows a dubbing result presentation interface 60, and shows a chapter selection area 61 and a target chapter 64 of object selection within the chapter selection area 61. A text presentation area 62 of the target chapter 64 is displayed in the client. In the text presentation area 62, a plurality of dubbing sentences are displayed. Wherein the dubbing sentence 65 includes a character 66, a sentence 67 corresponding thereto, an audio content trial listening control 68, and audio content corresponding thereto. As shown in part (b) of fig. 6, a character selection area 63 corresponding to the selected character is additionally displayed. The object may click on the selected character in the character selection area 63, display the dubbing result area 68A corresponding to the selected character, and display all the dubbing results of the selected character in the chapter text data of the target reading in the dubbing result area 68A corresponding to the selected character, alternatively, the object may select to listen to the dubbing result, cancel the dubbing result, or adjust the dubbing result in the dubbing result area 68A corresponding to the selected character. According to the dubbing configuration adjustment operation of the object on the first sentence of the first character in the target reading material chapter, the dubbing result of the dubbing operator is adjusted, so that the adjusted dubbing result meets the use requirement of the object, and the dubbing mode of the text data is more diversified and reasonable.

In some embodiments, the following various adjustment operations on text data may also be performed in the dubbing result presentation interface.

In response to a dubbing setting operation for a target reading, performing a setting action corresponding to the dubbing setting operation; wherein the setting behavior comprises at least one of: inserting a pause character into the text data of the target reading material, adjusting the reading speed of a selected target word and sentence in the text data of the target reading material, adjusting the continuity between selected target word groups in the text data of the target reading material, setting the pronunciation of a selected target polyphone in the text data of the target reading material, setting the reading method of a selected target digital symbol in the text data of the target reading material, setting the reading method of a selected target word in the text data of the target reading material, and adjusting the dubbing parameters of a dubbing player in the text data of the target reading material.

Optionally, in response to a pause-adjusting operation for the text data, a pause character is inserted in the text data, the pause character being used to indicate a pause location in the text data. By inserting the pause character into the text data, when the dubbing operator encounters the pause character during dubbing of the text data, the dubbing operator pauses at the pause character and then continues dubbing of the text data.

Optionally, in response to a speed adjustment operation for the text data, a speaking speed of the selected target word or sentence in the text data is adjusted. And adjusting the speaking speed of the corresponding dubbing staff by adjusting the speaking speed of the selected target words and sentences in the text data. For example, the speaking speed of a selected target word and sentence in the text data is increased, and the speaking speed of a corresponding dubbing speaker is increased; and (3) slowing down the speaking speed of the selected target words and sentences in the text data, and slowing down the speaking speed of the corresponding dubbing staff.

Optionally, in response to a phrase setting operation for the text data, continuity between selected target phrases in the text data is adjusted. The target phrase can be read continuously or discontinuously by adjusting the continuity between the selected target phrases in the text data. For example, in the sentence "your desk, chair", the phrase is continuous, and if the phrase is adjusted to be a continuous phrase in the text data, the dubber reads the phrase in a continuous manner. And the table and the chair are discontinuous, and the word group in the text data is adjusted to be the discontinuous word group, so that the dubbing staff reads the table and the chair in a discontinuous mode.

Optionally, in response to a polyphone setting operation for the text data, the pronunciation of the selected target polyphone in the text data is set. By setting the correct pronunciation of the selected target polyphones in the text data, a dubbing person can read the correct pronunciation of the target polyphones normally during dubbing. For example, for the "shell" of the "crust". Setting the correct pronunciation thereof as qiao, the dubbing staff can accurately read the word.

Optionally, in response to a numeric symbol setting operation for the text data, a reading of the selected target numeric symbol in the text data is set. By setting the correct reading method of the selected target data symbol in the text data, a dubbing operator can correctly read the reading method of the target data symbol during dubbing; for example, for "520". Setting the correct reading method to be single data of five, two and zero, the dubbing staff can accurately read the word.

Optionally, in response to a word reading operation for the text data, a reading of the selected target word in the text data is set. By setting the reading method of the selected target word in the text data, the dubbing operator can correctly read the reading method of the target word during dubbing.

Optionally, in response to a dubbing operator adjustment operation for the text data, a dubbing parameter of the dubbing operator in the text data is adjusted. The dubbing parameters of the dubbing person in the text data are adjusted, so that the dubbing person can meet the dubbing requirements of the object during dubbing, wherein the dubbing requirements comprise the dubbing parameters such as the reading speed, the reading volume, the reading language, the reading emotion and the like of the dubbing person. Optionally, the object may also perform adjustment of the dubbing parameters of the dubbing player by means of voice input. The object carries out dubbing on the text data by adopting own voice to obtain a manual dubbing result, and the server analyzes the manual dubbing result to adjust the dubbing parameters of the dubbing staff, so that the dubbing staff meets the dubbing requirement of the object during dubbing.

The dubbing results of the dubbing staff are adjusted in various modes, so that the dubbing modes of the text content are more diversified and reasonable, and the adjusted dubbing results meet the use requirements of the objects.

Step 750, in response to the play operation for the target reading material, playing the audio content of the target sentence in the target reading material generated by the target dubber, where the target dubber is a dubber set for the character corresponding to the target sentence.

For details not described in detail in steps 710-750 above, reference is made to the description in other embodiments of the context.

In some embodiments, step 740 is followed by the following steps (step 760 and step 770):

step 760, in response to the export operation, displaying a chapter selection interface in which a plurality of chapters of the target readout are displayed.

In step 770, in response to a selection operation for at least one chapter of the plurality of chapters, an audio file corresponding to the selected chapter is generated.

And the client side generates an audio file corresponding to the selected chapter according to the selection operation of the object on at least one chapter and exports the selected chapter. Alternatively, the audio file may be stored in MP3 (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio layer 3) format, WMA (Windows Media Audio, microsoft audio format) format, or other audio format, which is not limited in this regard. Optionally, the generated audio file may be downloaded to complete local storage of the audio file.

In some embodiments, as shown in fig. 8, fig. 8 illustrates a schematic diagram of a chapter selection interface, where the chapter selection interface 80 of fig. 8 displays a plurality of chapters of a target reading. The current chapter 81 may be selected as the export object by clicking on the current chapter, and the chapter control 82 may be selected in the chapter selection interface 80 by clicking on the batch to select the chapter as the export object. The object may make an audio format selection by audio format selection control 83 and finally the object may derive the selected chapter by clicking make audio control 84.

By storing and downloading the audio file corresponding to the target reading material, the object can listen to the audio file without the network.

The application adjusts the dubbing result of the dubbing staff in various modes, so that the dubbing mode of the target reading material is more diversified and reasonable, and the quality of the dubbing result is improved.

Fig. 9 is a flowchart illustrating a method for generating an audio reading according to another embodiment of the application. The execution subject of the method may be the server 20 in the implementation environment of the scheme shown in fig. 1, or may be the terminal device (e.g. client) 10 in the implementation environment of the scheme shown in fig. 1. In this embodiment, description will be made taking the example in which each step execution subject is a server. The method may comprise at least one of the following steps (910-940):

In step 910, text data of the target reading to be dubbed is obtained.

The server acquires text data of the target reading matter sent by the client.

Step 920, identifying a plurality of roles included in the target reading based on the text data of the target reading.

The server analyzes the text data sent by the client and identifies a plurality of roles contained in the text data.

Step 930, obtaining a dubbing player set for the character; wherein dubbing staff set for at least two different roles in the target reading material are different.

And selecting a proper dubbing person to dub the character according to the style characteristics of the characters contained in the text data by the server. Or the server acquires the dubbing staff corresponding to each of the multiple roles selected by the object, and dubbing is performed on the multiple roles according to the dubbing staff selected by the object.

Step 940, generating an audio file corresponding to the target reading material based on dubbing staff corresponding to each role in the target reading material.

Optionally, the server acquires an AI acoustic model corresponding to each dubber, and performs dubbing processing on text data corresponding to each role respectively according to the AI acoustic model corresponding to each dubber to obtain a dubbing file corresponding to a dubbing result. After the server sends the dubbing file to the client, the object acquires the dubbing file according to the operations of storing and downloading the dubbing file by the object.

In this embodiment, the server identifies a plurality of characters included in the target reading material, determines dubbing staff corresponding to each character, and then performs dubbing processing on text data corresponding to each character, so as to generate an audio file corresponding to the target reading material. The application adopts the AI acoustic models of a plurality of dubbing persons to dub different roles in the same target reading material, and because the different AI acoustic models can generate sounds with different tone colors, the sound reading material generated by the mode is not only a single dubbing person (or a single tone color), so that the sounds in the sound reading material have more diversity, and the dubbing quality of the sound reading material is improved.

Fig. 10 is a flowchart illustrating a method for generating an audio reading material according to another embodiment of the application. The execution subject of the method may be the server 20 in the implementation environment of the scheme shown in fig. 1, or may be the terminal device (e.g. client) 10 in the implementation environment of the scheme shown in fig. 1. In this embodiment, description will be made taking the example in which each step execution subject is a server. The method may include at least one of the following steps (1010-1070):

In step 1010, text data of the target reading to be dubbed is obtained.

Step 1020, identifying roles respectively contained in each sentence for each sentence contained in the text data of the target reading material.

And the server analyzes and identifies the text data of the target reading material to obtain a plurality of roles contained in the text data. Optionally, a named entity recognition method is adopted to recognize characters in the text data. Named entity recognition may include the following: rule-based approaches based methods, entity extraction based on manually designed rules, unsupervised learning methods, supervised learning methods based on artificial features, supervised learning methods based on deep learning, and the like. Hereinafter, a supervised learning method based on deep learning will be mainly described.

For example, take the example of "a race in which a person looks at chinese men in a yanyuan at university of beijing" for small Ming. The sentences "Xiaoming sees a race of chinese men in the Yanyuan of Beijing university", which are respectively sorted by the NER (NAMED ENTITY Recognition ) model by PER (name of person), by ORG (name of matter) by "Beijing university", by LOC (place name) by "Yan Yuan", and by ORG (name of matter) by "chinese men basket". Entity recognition is a process of picking the type of entity you want to acquire from a sentence.

NER is a sequence annotation problem, so their way of data annotation also follows the way of sequence annotation problem, mainly two types of BIO and BIOES. The BIOES is directly described BIOES, and the BIO is known.

First, BIOES represents what meaning:

B, begin, means start

I, i.e. Intermediate, represents an Intermediate

E, i.e. End, represents End

S, i.e. Single, represents a Single character

O, other, represents Other, for marking irrelevant characters

Labeling the sentence of ' a race of China men is seen in Yanyuan of Beijing university ' by Xiaoming ' with the following results:

[B-PER，E-PER，O,B-ORG，I-ORG，I-ORG，E-ORG，O，B-LOC，E-LOC， O，O，B-ORG，I-ORG，I-ORG，E-ORG，O，O，O，O]

Optionally, step 1020 includes: obtaining a vector representation sequence corresponding to the target sentence for the target sentence contained in the text data of the target reading material; inputting the vector representation sequence into a character recognition model, and extracting characteristic information in the vector representation sequence through a context encoder of the character recognition model; outputting marking results corresponding to each word in the target sentence according to the characteristic information by a marking decoder of the character recognition model, wherein the marking results are used for indicating entity types of the words; and obtaining roles contained in the target sentence based on the labeling results respectively corresponding to the words in the target sentence.

The server performs word segmentation processing on the target sentence to obtain at least one word segment contained in the target sentence, and then obtains vector representations corresponding to the word segments respectively to obtain a vector representation sequence corresponding to the target sentence. The vector representation sequence refers to representing each word in the target sentence by a vector. Alternatively, the word segmentation process may be word level division or character level division, and the corresponding vector representation sequence of the word level or the vector representation sequence of the character level is obtained, which is not limited in the present application.

One of the distributed representation methods in the word segmentation process is a word-level representation method, such as CBOW (continues bag of words, continuous word bag model), skip-gram (an unsupervised learning technique), and based on these models, the pre-training word vector representation (PRETRAINED WORD EMBEDDINGS) is obtained by using a large-scale text corpus to perform unsupervised learning. Different pre-training word vectors are trained by using different large-scale corpuses, and suitable pre-training word vectors can be selected according to the data distribution of actual tasks.

These pre-trained word vectors play a driving role in the field of NLP (Natural Language Processing ) in the deep learning era, simplifying complex feature engineering into a process of looking up a table through a vocabulary.

The server inputs the vector representation sequence corresponding to the target sentence into a character recognition model, wherein the character recognition model comprises a context encoder and a labeling decoder, and the feature information in the vector representation sequence is obtained through the context decoder. The context encoder is part of a character recognition model, and the context encoder analyzes the relationship between a certain word segment and other word segments in the vector representation sequence to obtain corresponding characteristic information. And then, outputting marking results corresponding to the words in the target sentence respectively through a marking decoder according to the characteristic information, wherein the marking results are used for indicating the entity category of the words. The annotation decoder is another part of the character recognition model, and the annotation decoder determines the entity category of each word based on the feature information. Then, the words with the entity category of 'roles' can be extracted to obtain the roles contained in the target sentence.

The context encoder aims to mine for hidden patterns between contexts in the word sequence, optionally the encoder comprises: CNN (Convolutional Neural Network ), recurrent NN (Recursive Neural Network, recurrent neural network), language Model (Language Model), and the like.

The CNN-based method has a better extraction effect on local features, but has a poor effect on Long Cheng Yilai (Long-TERM DEPENDENCY) problems in sentences, so Bi-LSTM (Long Short-Term Memory) based encoders are widely used. The repetitive NN treats sentences as a tree structure instead of sequences, has stronger representation capability in theory, but has the weaknesses of large sample marking difficulty, easy gradient disappearance in deep layers, difficult parallel calculation and the like, so the repetitive NN is less in practical application. The transducer has a network structure with wide application, has the characteristics of CNN and RNN, has a better extraction effect on global characteristics, and has a certain advantage in parallel calculation compared with RNN (Recurrent Neural Network ).

The annotation decoder is part of a named entity recognition model, which takes as input the representation obtained through the context encoder, a common decoder combination is: MLP (Multilayer Perceptron, multi-layer perceptron) +Softmax, CRF (Conditional Random Field ), RNN, pointer Networks (a neural network), etc.

MLP+Softmax is a classification task output layer that can relatively efficiently fit the annotation sequence output of NER tasks if the upstream distributed representation and coding layer effectively represents textual information.

CRF is a evergreen tree in the sequence labeling task, especially a linear link CRF (Linear Chain CRF) that derives the front-to-back constraints in the output sequence by modeling and extracting the probability of observing the sequence to the output sequence.

The Pointer Networks are applied to the problem of combination optimization, and are characterized in that the output sequence length is determined by the input sequence length, and the Pointer Networks can be well applied to sequence labeling tasks.

For example, after the target sentence is "Wei somewhere but also hide from blue somewhere, squat way: ' you child wedding, what relationship I have with me-! When' is, firstly, word segmentation processing is carried out on the target sentence, and word segmentation is carried out on each character to obtain each character in the target sentence. Then the prediction result is obtained through a context encoder and a labeling decoder: B-PER I-PER E-PER O OO B-PER I-PER E-PER OOOOOOOOOOOOOOOOOOOOOOO. Wherein B-PER I-PER E-PER indicates that the character can be combined into a color, and O indicates that the character cannot be combined into a color. Finally, the roles in the target sentence are 'Wei somewhere' and 'blue somewhere'.

As shown in fig. 11, fig. 11 illustrates a schematic structural diagram of a BERT. BERT is divided into a pre-training and a Fine-Tuning phase.

In the pre-training stage, two tasks, MLM (Masked Language Model, language labeling model) and NSP (Next Sentence Prediction, sentence judgment model), are used for training.

The MLM recognizes the input and then marks the input to obtain a marked result.

The NSP inputs two sentences and determines whether the sentence B is the next sentence of the sentence a.

In the BERT fine tune stage, a model mode is designed according to the characteristics and the targets of respective tasks.

Machine reading understanding tasks typically employ question-and-answer assessment: natural language questions related to the content of the article are designed to allow the model to understand the questions and answer based on the article. In order to evaluate the correctness of the answer, there are generally the following forms of reference answers:

A polynomial, i.e. the model needs to select the correct answer from a given number of options;

Interval answer type, namely answer limit is a clause of the article, the model is required to mark the correct answer starting position and ending position in the article;

Free answer type, namely, the form of generating an answer by a model is not limited, and the model is allowed to generate sentences freely;

The complete gap-filling, i.e., removing a number of keywords in the text, requires the model to be filled with the correct word or phrase.

The application provides a scheme for determining characters contained in a text by adopting a named entity recognition mode, and characters can be automatically recognized from the text. Meanwhile, the characters contained in the text are identified by adopting a named entity identification technology based on deep learning, so that the characters contained in the text can be identified efficiently and accurately, and the quality of dubbing results is improved.

Step 1030, counting the number of occurrences of each identified character in the target reading.

Based on each character identified from the text data of the target reading, the number of occurrences of each character in the target reading is counted.

Step 1040, selecting the roles whose appearance times satisfy the first condition, and obtaining a plurality of roles contained in the target reading material.

And respectively carrying out first condition filtering on the occurrence times of each character in the target reading material to obtain a plurality of characters contained in the filtered target reading material. The first condition may be a threshold value, and the multiple roles contained in the filtered target reading material are calculated only when the number of occurrences exceeds the threshold value. Alternatively, the first character may be a character having a capacity, for example, a capacity of 20, and the plurality of characters included in the filtered target reading material may be calculated only when the character having the first 20 times appears. The more the number of occurrences of a character in the target reading material, the higher the importance degree of the character in the target reading material, and the important character can be selected for setting a dubbing player for the object through filtering based on the number of occurrences.

By the mode, the roles with more occurrence times are selected to be provided for the object for dubbing setting, so that important roles in the target reading material can be ensured to have dubbing operators meeting the requirements of the object, and the quality of dubbing results is improved. Meanwhile, important roles in the target reading materials are selected to be provided for the object for selecting by a dubbing operator, so that the operation of the object is reduced.

Optionally, step 1040 is followed by the following steps: for a target character contained in the target reading material, determining characteristic information of the target character according to text data of the target character, wherein the characteristic information is used for representing style characteristics of the target character; according to the characteristic information of the target character and the characteristic information of each candidate dubbing person, determining the matching degree corresponding to each candidate dubbing person respectively, optionally, the characteristic information of the candidate dubbing person is determined by the dubbing parameters of the candidate dubbing person; and selecting the candidate dubbing staff with the matching degree meeting the third condition as the dubbing staff of the target role.

And the server determines the characteristic information of the target role according to the text data of the target role, and matches the characteristic information of the target role with the characteristic information of each candidate dubbing player to obtain the corresponding matching degree. And selecting the candidate dubbing staff with the matching degree meeting the third condition as the dubbing staff of the target role.

Alternatively, the third condition may be that the server selects the candidate dubber having the highest matching degree with the object as the dubber of the target character. Alternatively, the third condition may be that the server selects a plurality of candidate dubbing members with different styles with highest matching degree with the object as recommended dubbing members of the target character, and sends the recommended dubbing members to the client, and any one of the candidate dubbing members with different styles is selected as the dubbing member of the target character according to the preference of the object.

Step 1050, obtaining a dubbing player set for the character; wherein dubbing staff set for at least two different roles in the target reading material are different.

Step 1060, for the target sentence contained in the text data of the target reading material, determining the role corresponding to the target sentence according to the target sentence and the context information of the target sentence.

Optionally, the context information of the target sentence includes at least one of: at least one statement preceding the target statement, at least one statement following the target statement.

Optionally, step 1060 includes the steps of: acquiring at least one candidate role contained in the context information of the target sentence; extracting characteristic information corresponding to each candidate role from the target statement and the context information of the target statement; according to the characteristic information corresponding to each candidate character, determining the score corresponding to each candidate character; and selecting the candidate roles with scores meeting the second condition as the roles corresponding to the target sentences.

And acquiring the roles contained in the target sentence and the context information thereof as candidate roles. For each candidate character, the feature information of the candidate character is obtained through analysis and extraction in the target sentence and the context information of the target sentence. And according to the obtained characteristic information of the candidate character and the position information of the candidate character, the position information comprises the information of the distance between the candidate character and the characteristic information in the target sentence, whether the distance spans, and the like. The candidate characters, the feature information in the target sentence, the position information of each candidate character, and the text data are input together into the neural network, and the neural network outputs the score of each candidate character. And determining the role of the target sentence according to the size of the score, wherein the candidate object is easier to be the role of the target sentence when the score is higher.

For example, assume that the target sentence is "Wei somewhere but hide from blue somewhere, squat way: ' you child wedding, what relationship I have with me-! ' the context information of the target sentence is as follows:

Some of the weirs are looking at the details, and a side move suddenly walks. Blue certain eye diseases are blocked by hand, and the move can say that: "I'm wedding, I invite her to attend wedding-! What do you block me? "

Wei somewhere and hide behind the blue somewhere, squat the way: "you child wedding, what relationship I have with-! "

Bai Tianlan someone in the east house has seen a certain alarm, and later from the other people, hears a lot of news about the bitter person adding oil and vinegar, and has the same emotion as the sick person, and can not hold speaking for her: "Mo Fu, wei, a dark love your son years, now your son, marries, wei, a wounded, and she is no longer difficult. "

In the above paragraph we have found that the candidate roles are "Wei somewhere", "blue somewhere" and "Molady". When the candidate character is "Wei somewhere" and the feature information in the target sentence is "you son". It is apparent that the distance between "wei somewhere" and "your son" is 3 and there is no span, while "wei somewhere" appears 5 times in this piece of text. Likewise, the distance, whether span, and the number of occurrences between "blue somewhere" and "mofever" and the feature information in the target sentence are acquired. The above information is input to the neural network to obtain respective scores, for example, a score of "wei somewhere" of 0.9, and a score of "yu somewhere" and "mofey person" of 0.1. Then "wei somewhere" is selected as the role to which the target sentence belongs.

Optionally, determining the score corresponding to each candidate role according to the feature information corresponding to each candidate role includes the following steps: for each candidate character, generating input data of a character selection model based on the feature information, the target sentence and the context information of the target sentence corresponding to the candidate character; input data is input to the character selection model, and scores corresponding to candidate characters are output through the character selection model.

Alternatively, the third condition may be that the server selects a role corresponding to the candidate role with the highest score as the target sentence. Optionally, the third condition may be a threshold, and the server selects a plurality of candidate characters with scores exceeding the threshold as recommended dubbing members of the target characters, sends the recommended dubbing members to the client, and determines characters corresponding to the target sentences in the candidate characters through manual auditing.

By analyzing the context information of the target sentence, the application can automatically identify the role of the target sentence, and ensure the accuracy of the dubbing result. In addition, when the character of the target sentence is identified, the characters contained in the context information of the target sentence are only needed as candidates instead of the characters of the whole quantity in the target reading material, and the calculation amount required for judging the characters is fully reduced on the premise of not affecting the accuracy.

Step 1070, generating an audio file corresponding to the target reading material based on the dubbing staff corresponding to each role in the target reading material.

After determining the roles to which each sentence in the target reading material respectively belongs, text data of each role can be obtained, and then dubbing processing is carried out on the text data of each role to generate an audio file corresponding to the target reading material. And the server inputs the text data corresponding to the target role into an AI acoustic model of a dubbing person corresponding to the target role, and a sentence dubbing result of the target role is obtained. And combining statement dubbing results of all the roles to obtain an audio file corresponding to the target reading material. The audio file corresponding to the target reading material further comprises a dubbing result corresponding to the bystander, and the server inputs text data corresponding to the bystander into an AI acoustic model of a dubbing person corresponding to the bystander to obtain the bystander dubbing result. And combining the statement dubbing result and the bystander dubbing result of each character to obtain the audio file corresponding to the target reading material.

According to the method, the important roles in the target reading material can be configured for the expected dubbing staff of the object, and finally the audio reading material meeting the requirement of the object is generated.

In addition, by analyzing the context information of the target sentence, the embodiment can automatically identify the role of the target sentence, and ensures the accuracy of the dubbing result. In addition, when identifying the characters to which the target sentence belongs, the characters contained in the context information of the target sentence are only needed as candidates instead of the characters of the whole quantity in the target reading material, and the calculation amount required for judging the characters is fully reduced on the premise of not affecting the accuracy.

In some embodiments, as shown in fig. 12, fig. 12 illustrates a use process and a training process of the AI acoustic model.

The dubbing person carries out dubbing of the target reading material through the corresponding AI acoustic model, wherein the AI acoustic model comprises a first encoder, a prosody model and a tone model, and the use process of the AI acoustic model is as follows:

for a target sentence contained in text data of a target reading material, performing coding processing on the target sentence through a first coder to obtain phoneme-level tone quality information corresponding to the target sentence; wherein the phoneme-level sound quality information is used for representing the sound quality of the target sentence;

Expanding frames of the phoneme-level tone quality information of the target sentence to obtain frame-level tone quality information corresponding to the target sentence; the frame-level tone quality information is used for representing tone quality of the target sentence after frame expansion;

determining a predicted prosody code representation vector corresponding to the target sentence through a prosody model according to the frame-level tone quality information corresponding to the target sentence; the prediction prosody coding representation vector comprises a prediction sentence-level prosody coding representation vector, a prediction frame-level prosody coding representation vector and a prediction phoneme-level prosody coding representation vector;

Generating spectrum information corresponding to the target sentence according to the predicted prosody code representation vector corresponding to the target sentence through the timbre model, wherein the spectrum information corresponding to the target sentence is used for generating audio content of the target sentence in the audio file. The predicted sentence-level prosody code representation vector is a prosody code representation vector in sentence units of a target sentence, the predicted frame-level prosody code representation vector is a prosody code representation vector in frame units of a predicted phoneme-level prosody code representation vector in phoneme units of a target sentence. The prosody of the target sentence can be more specifically and clearly represented by the prosody representation vectors of the three target sentences.

The sound quality information is used to represent the sound quality of the dubbing result to be generated. The tone quality information includes sentence-level tone quality information, phoneme-level tone quality information, and frame-level tone quality information. The sentence-level tone quality information is tone quality information of a whole sentence of a target sentence, the phoneme-level tone quality information is tone quality information of each phoneme in the target sentence, and the frame-level tone quality information is tone quality information of each frame in the target sentence.

Optionally, the phoneme-level timbre information is frame-expanded through a time length model to generate frame-level timbre information. The frame expansion is a method of increasing the length of the voice quality information by compressing the width of the voice quality information, and the length of the voice quality information is increased without affecting the overall size of the voice quality information. The tone quality information with the increased length is more beneficial to the use and training of the model.

Illustratively, the training process of the AI acoustic model is as follows:

acquiring a training sentence;

encoding the training sentences through a first encoder to obtain phoneme-level voice information corresponding to the training sentences;

expanding frames of the phoneme-level tone quality information of the training sentences to obtain frame-level tone quality information corresponding to the training sentences;

determining a predicted prosody code representation vector corresponding to the training statement through a prosody model according to the frame-level tone quality information corresponding to the training statement;

Determining an error of the training process based on the real prosody code representation vector of the training sentence and the predicted prosody code representation vector of the training sentence;

Parameters of the first encoder, the prosody model, and the timbre model are adjusted based on errors of the training process.

The training sentences are sentences with known rhythm, timbre and spectrum information, the training sentences are analyzed by using an AI acoustic model to obtain an analysis result, and the analysis result is compared with the known rhythm, timbre and spectrum information to obtain training errors so as to optimize parameters of the first encoder, the rhythm model and the audio model.

In some embodiments, as shown in fig. 12, in the model using part in fig. 12, after the server inputs the text data of the target sentence into the first encoder, the text data is encoded to obtain the phoneme-level sound quality information of the target sentence, and then the phoneme-level sound quality information is compressed and expanded by the duration model to obtain the frame-level sound quality information corresponding to the target sentence. And inputting the obtained frame-level tone quality information into a prosody model to obtain a predicted prosody coding representation vector, and inputting the predicted prosody coding representation vector into a tone model to obtain the frequency spectrum information corresponding to the target sentence. Wherein, the tone quality information of the frequency spectrum information is influenced by parameters in the prosody model and the tone model, and an audio file is generated according to the frequency spectrum information.

In the model training section in fig. 12, the model uses the same method as described above to obtain a predicted prosody code representation vector (predicted timbre vector) from training samples through a first encoder, a duration model, and a prosody model. Meanwhile, the training sample obtains a real prosody coding representation vector (real tone quality vector) through a tone quality extractor, a training error between the predicted prosody coding representation vector and the real prosody coding representation vector is obtained through calculation, and parameters of the first encoder, the prosody model and the tone color model are adjusted through the training error.

Alternatively, the timbre prediction decoder in the prosody model described above may use the VAE (Variational Auto-Encoder, a variational self-decoder) for decoding. As shown in fig. 13, fig. 13 illustrates a schematic structural view of the VAE. Hierarchical VAE coding refers to extracting unsupervised VAE vectors, including sentence-level, phoneme-level, and frame-level, from acoustic features MEL using a variant encoder technique to represent prosodic features.

By training the AI acoustic model, the accuracy of the AI acoustic model is increased, and the quality of the generated dubbing result is improved.

In some embodiments, as shown in fig. 14, fig. 14 illustrates the use and training process of AI acoustic models corresponding to the paralytic data. And generating the audio content corresponding to the bystander data by adopting an AI acoustic model corresponding to the bystander for the bystander data contained in the text data of the target reading material. Optionally, the parallactic corresponding AI acoustic model includes a second encoder, a parallactic prosody model, and a parallactic tone model.

The use process of the AI acoustic model corresponding to the bypass is as follows:

the method comprises the steps that the second encoder is used for encoding the bystander data contained in text data of a target reading material to obtain tone quality information corresponding to the bystander data;

According to tone quality information corresponding to the target sentence, determining a side-white prediction prosody coding representation vector corresponding to the side-white data through a side-white prosody model;

And generating spectral information corresponding to the bystander data according to the predicted prosody code representation vector corresponding to the bystander data through the bystander tone model, wherein the spectral information corresponding to the bystander data is used for generating the audio content of the bystander data in the audio file.

The training process of the bypass corresponding AI acoustic model is as follows:

Acquiring a bystander training text;

encoding the white-space training text through a second encoder to obtain tone quality information corresponding to the white-space training text;

According to tone quality information corresponding to the white-space training text, determining a predicted prosody coding representation vector corresponding to the white-space training text through a white-space prosody model;

determining an error of the training process based on the real prosody code representation vector and the predicted prosody code representation vector of the bystander training text;

Parameters of the second encoder, the paralytic model, and the paralytic timbre model are adjusted based on the errors of the training process.

The method in the embodiment shown in fig. 14 is the same as or similar to the method in the embodiment shown in fig. 12.

In some embodiments, as shown in fig. 14, the model using part in fig. 14, the server inputs text data (text sequence) into the second encoder to obtain timbre information corresponding to the text data, and inputs timbre information corresponding to the obtained text data and position information of the bystander content in the text data into the bystander model to obtain a predicted prosody encoding representation vector of the bystander content. And inputting the predicted prosody coding representation vector of the bystander content into a bystander acoustic model to obtain the spectrum information of the bystander content.

In fig. 12, the model training section passes the training text through the second encoder to obtain the tone quality information of the training text. At the same time, a real prosody representation vector in the real acoustic features of the training text is obtained. And determining an error of the training process based on the predicted prosody encoding representation vector and the real prosody representation vector corresponding to the training text. And adjusting parameters of the second encoder, the prosodic model and the parallactic acoustic model according to the error. Alternatively, the predicted spectrum information corresponding to the training text may be compared with the real spectrum information, an error in the training process may be determined, and parameters of the second encoder, the paralytic prosody model, and the paralytic acoustic model may be adjusted according to the error.

By training the AI acoustic model, the accuracy of the AI acoustic model is increased, and the quality of the generated dubbing result is improved. Meanwhile, emotion determination is carried out on the white content, so that the sound in the audio reading material is more diversified, and the dubbing quality of the audio reading material is improved.

In some embodiments, based on AI acoustic models of dubbing staff corresponding to each role in a target reading material, dubbing processing is performed on text data corresponding to each role, and an audio file corresponding to the target reading material is generated, including the following contents:

For obtaining at least one regular expression for identifying different chapters in a target reading; matching text data of the target reading material by adopting a regular expression, and searching an adaptation sentence matched with the regular expression; performing chapter division on the target reading material based on the adaptation sentence, and determining the names of the chapters; acquiring selection information aiming at one or more second chapters in the target reading material; and carrying out dubbing processing on text data corresponding to each role in the second chapter based on the AI acoustic models of dubbing staff corresponding to each role in the second chapter, and generating an audio file corresponding to the second chapter.

The regular expression refers to various chapter expression forms in the target reading material, chapter matching is carried out through the regular expression, and each chapter in the target reading material is obtained through recognition. For example, for the regular expression "chapter X", any chapter section such as "chapter first" and "chapter eighteenth" in the text data is adapted as a chapter section.

And dividing the obtained chapters according to the original sequence, and judging whether the chapter contents are in error fit according to the storage of the chapters. Optionally, the obtained chapters are divided according to the actual sequence of the chapters, and whether the chapter contents are fit for errors is judged according to the storage of the chapters.

The regular expression is used for automatically identifying the chapter of the target reading material, so that the time required by the object for chapter setting is saved. Meanwhile, the target reading chapter is automatically identified through the regular expression, so that the accuracy of chapter content is ensured, and the time required by subsequent manual auditing is saved.

Regular expression (Regular Expression, abbreviated as RE), also called regular expression, is a concept of computer science. A regular expression is a logical formula that operates on a string (including common characters (e.g., letters between a and z) and special characters (called "meta-characters")) by forming a "regular string" with predefined specific characters and combinations of the specific characters, where the "regular string" is used to express a filtering logic for the string. Table 1 below exemplarily shows a part of meta-characters and their descriptions in regularization:

TABLE 1

The server obtains the selection information of the object in the client for one or more second chapters in the target reading material. And the server performs dubbing processing on the text data corresponding to the one or more second chapters based on the AI acoustic model of the dubber to obtain one or more audio files corresponding to the second chapters.

The dubbing result of the target to be dubbed is obtained by dubbing the selected chapters, and the corresponding audio file is obtained by storing and downloading, so that the collection function of the target to the audio file is realized, and the function diversity is improved.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Referring to fig. 15, a block diagram of an apparatus for generating an audio reading material according to an embodiment of the application is shown. The device has the function of realizing the method for generating the audio reading material, and the function can be realized by hardware or by executing corresponding software by the hardware. The device can be a terminal device or can be arranged in the terminal device. The apparatus 1500 may include: a setup interface display module 1510, a dubber setup module 1520, a dubbing result display module 1530, and an inner dubbing content play module 1540.

The setting interface display module 1510 is configured to display a speaker setting interface corresponding to a target reading material, where multiple roles and multiple candidate speakers contained in the target reading material are displayed.

A dubber setting module 1520 for displaying a dubber set for the character in the dubber setting interface in response to a dubber setting operation for the character; wherein dubbing staff set for at least two different roles in the target reading is different.

The dubbing result display module 1530 is configured to display a dubbing result display interface in response to a setting completion operation, where at least one sentence of the target reading material and a role corresponding to the sentence are displayed in the dubbing result display interface.

The dubbing content playing module 1540 is configured to respond to a playing operation for the target reading material, and play audio content of a target sentence in the target reading material generated by a target dubber, where the target dubber is a dubber set for a role corresponding to the target sentence.

In an exemplary embodiment, the speaker setup interface further displays a speaker setup for the character, where the speaker setup is a speaker setup automatically based on the style of the character. The speaker setting module 1520 is configured to change the speaker set by the initialization to a newly selected speaker in response to a speaker setting operation for the character, and display the newly selected speaker as the character in the speaker setting interface.

In an exemplary embodiment, as shown in fig. 16, the apparatus 1500 further includes a parameter setting module 1550 for displaying interface elements for setting dubbing parameters for the dubbing player in the dubbing player setting interface, the dubbing parameters including at least one of: speakable speed, speakable volume, speakable language, speakable emotion.

In an exemplary embodiment, the dubbing result presentation interface includes a chapter selection area, a text presentation area, and a character selection area; the chapter selection area is used for displaying a plurality of chapters of the target reading material; the text display area is used for displaying sentences contained in at least one chapter selected from the chapters and roles corresponding to the sentences; the character selection area is used for displaying at least one character contained in the selected at least one chapter.

In an exemplary embodiment, the dubbing result display module 1530 is further configured to perform a setting action corresponding to a dubbing setting operation for the target reading material in response to the dubbing setting operation; wherein the setting behavior comprises at least one of: inserting a pause character into the text data of the target reading material, adjusting the reading speed of a selected target word and sentence in the text data of the target reading material, adjusting the continuity between selected target word groups in the text data of the target reading material, setting the pronunciation of a selected target polyphone in the text data of the target reading material, setting the reading method of a selected target digital symbol in the text data of the target reading material, setting the reading method of a selected target word in the text data of the target reading material, and adjusting the dubbing parameters of the dubbing player in the text data of the target reading material.

In an exemplary embodiment, the dubbing result display module 1530 is further configured to display a chapter selection interface in which a plurality of chapters of the target reading are displayed in response to a export operation; in response to a selection operation for at least one chapter of the plurality of chapters, an audio file corresponding to the selected chapter is generated.

Referring to fig. 17, a block diagram of an apparatus for generating an audio reading material according to another embodiment of the present application is shown. The device has the function of realizing the method for generating the audio reading material, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. The apparatus may be a computer device (such as a terminal device or a server), or may be provided in a computer device. The apparatus 1700 may include: a data acquisition module 1710, a character recognition module 1720, a dubbing setting module 1730, and a file generation module 1740.

The data obtaining module 1710 is configured to obtain text data of a target reading to be dubbed.

Character recognition module 1720 is configured to recognize a plurality of characters included in the target reading object based on the text data of the target reading object.

A dubbing setting module 1730, configured to obtain a dubbing player set for the character; wherein dubbing staff set for at least two different roles in the target reading material are different.

The file generating module 1740 is configured to generate an audio file corresponding to the target reading material based on the speaker corresponding to each role in the target reading material.

In an exemplary embodiment, the role identification module 1720 is configured to:

identifying roles respectively contained in each sentence for each sentence contained in the text data of the target reading material;

counting the occurrence times of each identified character in the target reading material;

And selecting the roles of which the occurrence times meet a first condition to obtain the multiple roles contained in the target reading material.

For a target sentence contained in the text data of the target reading material, acquiring a vector representation sequence corresponding to the target sentence;

Inputting the vector representation sequence into a character recognition model, and extracting characteristic information in the vector representation sequence through an up-down Wen Bianma machine of the character recognition model;

Outputting marking results corresponding to each word in the target sentence according to the characteristic information by a marking decoder of the character recognition model, wherein the marking results are used for indicating the entity class of the word;

and obtaining roles contained in the target sentence based on the labeling results respectively corresponding to the words in the target sentence.

In an exemplary embodiment, as shown in fig. 18, the apparatus 1700 further comprises: the role determination module 1750.

A role determining module 1750, configured to determine, for a target role included in the target reading material, feature information of the target role according to text data of the target role; determining the matching degree corresponding to each candidate dubbing player according to the characteristic information of the target role and the characteristic information of each candidate dubbing player; and selecting the candidate dubbing staff with the matching degree meeting a third condition as the dubbing staff corresponding to the target role.

In an exemplary embodiment, the role determination module 1750 is configured to:

acquiring at least one candidate role contained in the context information of the target sentence;

Extracting characteristic information corresponding to each candidate character from the target sentence and the context information of the target sentence;

determining the score corresponding to each candidate role according to the characteristic information corresponding to each candidate role;

And selecting the candidate roles with the scores meeting the second condition as the roles corresponding to the target sentences.

For each candidate role, generating input data of a role selection model based on the feature information corresponding to the candidate role, the target sentence and the context information of the target sentence;

and inputting the input data into the role selection model, and outputting the scores corresponding to the candidate roles through the role selection model.

In an exemplary embodiment, the dubber performs dubbing of the target reading through its corresponding AI acoustic model, the AI acoustic model including a first encoder, a prosodic model, and a tone model, the file generation module 1740 is for:

For a target sentence contained in the text data of the target reading material, carrying out coding processing on the target sentence through the first coder to obtain phoneme-level tone quality information corresponding to the target sentence; wherein the phoneme-level sound quality information is used for representing the sound quality of the target sentence;

Determining a predicted prosody code representation vector corresponding to the target sentence through the prosody model according to the frame-level tone quality information corresponding to the target sentence; wherein the predicted prosody encoding representation vector comprises a predicted sentence-level prosody encoding representation vector, a predicted frame-level prosody encoding representation vector and a predicted phoneme-level prosody encoding representation vector;

and generating spectrum information corresponding to the target sentence according to the predicted prosody coding representation vector corresponding to the target sentence through the timbre model, wherein the spectrum information corresponding to the target sentence is used for generating the audio content of the target sentence in the audio file.

In an exemplary embodiment, the file generation module 1740 is for:

Acquiring selection information aiming at one or more second chapters in the target reading material;

And carrying out dubbing processing on text data corresponding to each role in the second chapter based on the AI acoustic models of dubbing staff corresponding to each role in the second chapter, and generating an audio file corresponding to the second chapter.

In an exemplary embodiment, the file generation module 1740 is for:

acquiring at least one regular expression for identifying different chapters in the target reading material;

Matching the text data of the target reading material by adopting the regular expression, and searching an adaptation sentence matched with the regular expression;

and dividing the target reading material into chapters based on the adaptation sentence, and determining the names of the chapters.

In an exemplary embodiment, for the bystander data contained in the text data of the target reading, the audio content corresponding to the bystander data is generated by adopting an AI acoustic model corresponding to the bystander, and the AI acoustic model corresponding to the bystander comprises a second encoder, a bystander rhythm model and a bystander timbre model; the file generation module 1740 is further configured to:

The second encoder is used for encoding the side white data to obtain tone quality information corresponding to the side white data;

According to the tone quality information corresponding to the bystander data, determining a predicted prosody coding representation vector corresponding to the bystander data through the bystander model;

And generating spectrum information corresponding to the bystander data according to the predicted prosody coding representation vector corresponding to the bystander data through the bystander tone model, wherein the spectrum information corresponding to the bystander data is used for generating the audio content of the bystander data in the audio file.

Referring to fig. 19, a schematic structural diagram of a computer device according to an embodiment of the application is shown. The computer device may be any electronic device having data computing, processing and storage functions, such as a cell phone, tablet computer, PC (Personal Computer ) or server, etc. The computer device may be used to implement the method of generating an audio reading provided in the above-described embodiments. Specifically, the present application relates to a method for manufacturing a semiconductor device.

The computer device 1900 includes a central processing unit (such as a CPU (Central Processing Unit, a central processing unit), a GPU (Graphics Processing Unit, a graphics processor), an FPGA (Field Programmable GATE ARRAY, a field programmable gate array), and the like) 1901, a system Memory 1904 including a RAM (Random-Access Memory) 1902 and a ROM (Read-Only Memory) 1903, and a system bus 1905 connecting the system Memory 1904 and the central processing unit 1901. The computer device 1900 also includes a basic input/output system (Input Output System, I/O system) 1906 to facilitate the transfer of information between the various devices within the server, and a mass storage device 1907 for storing an operating system 1913, application programs 1914, and other program modules 1915.

The basic input/output system 1906 includes a display 1908 for displaying information and an input device 1909, such as a mouse, keyboard, etc., for inputting information to an object. Wherein the display 1908 and the input device 1909 are connected to the central processing unit 1901 through an input-output controller 1910 connected to the system bus 1905. The basic input/output system 1906 may also include an input/output controller 1910 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input output controller 1910 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1907 is connected to the central processing unit 1901 through a mass storage controller (not shown) connected to the system bus 1905. The mass storage device 1907 and its associated computer-readable media provide non-volatile storage for the computer device 1900. That is, the mass storage device 1907 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) drive.

Without loss of generality, the computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc, high density digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1907 described above may be collectively referred to as memory.

The computer device 1900 may also operate in accordance with an embodiment of the application through a network, such as the internet, connected to a remote computer on the network. I.e., the computer device 1900 may be connected to the network 1912 through a network interface unit 1911 coupled to the system bus 1905, or other types of networks or remote computer systems (not shown) may also be connected to the network interface unit 1911.

The memory further includes at least one instruction, at least one program, code set, or instruction set stored in the memory and configured to be executed by the one or more processors to implement the method of generating an audio reading described above.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one instruction, at least one program, a code set, or an instruction set is stored, which when executed by a processor of a computer device, implements the method for generating an audio reading described above. Alternatively, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State disk), or optical disk. The random access memory may include, among other things, reRAM (RESISTANCE RANDOM ACCESS MEMORY, resistive random access memory) and DRAM (Dynamic Random Access Memory ).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method for generating the audio reading material.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.

The foregoing is illustrative of the present application and is not to be construed as limiting thereof, and any modifications, equivalent arrangements, improvements, etc., which fall within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

1. A method for generating an audio reading, the method comprising:

In response to a dubbing person setting operation for the character, displaying a dubbing person set for the character in the dubbing person setting interface; wherein dubbing staff set for at least two different roles in the target reading material are different;

In response to a play operation for the target reading material, playing audio content of a target sentence in the target reading material generated by a target dubber, wherein the target dubber is a dubber set for a role corresponding to the target sentence;

The method comprises the steps of performing dubbing processing by adopting an artificial intelligent AI acoustic model corresponding to the dubbing operator, wherein the AI acoustic model comprises a first encoder, a prosody model and a tone model; the generating step of the audio content of the target sentence in the target reading material comprises the following steps:

determining a predicted prosody code representation vector corresponding to the target sentence through the prosody model according to the frame-level tone quality information corresponding to the target sentence; the prediction prosody coding representation vector comprises a prediction sentence-level prosody coding representation vector, a prediction frame-level prosody coding representation vector and a prediction phoneme-level prosody coding representation vector;

2. The method of claim 1, wherein a dubber setup for the character is also displayed in the dubber setup interface, the initialized dubber being a dubber automatically set based on a style of the character;

the displaying, in the dubbing player setting interface, a dubbing player set for the character in response to a dubbing player setting operation for the character, including:

And in response to a dubbing operator setting operation for the character, changing the initialized dubbing operator to a reselected dubbing operator, and displaying the reselected dubbing operator for the character in the dubbing operator setting interface.

3. The method of claim 1, wherein after displaying the dubbing player setting interface corresponding to the target reading, further comprising:

Displaying interface elements for setting dubbing parameters for the dubbing player in the dubbing player setting interface, wherein the dubbing parameters comprise at least one of the following: speakable speed, speakable volume, speakable language, speakable emotion.

4. The method of claim 1, wherein the dubbing result presentation interface comprises a chapter selection area, a text presentation area, and a character selection area; wherein,

The chapter selection area is used for displaying a plurality of chapters of the target reading material;

The text display area is used for displaying sentences contained in at least one chapter selected from the chapters and roles corresponding to the sentences;

The character selection area is used for displaying at least one character contained in the selected at least one chapter.

5. The method of claim 1, further comprising, after displaying the dubbing result presentation interface:

Executing a setting behavior corresponding to a dubbing setting operation for the target reading in response to the dubbing setting operation; wherein the setting behavior comprises at least one of: inserting a pause character into the text data of the target reading material, adjusting the reading speed of a selected target word and sentence in the text data of the target reading material, adjusting the continuity between selected target word groups in the text data of the target reading material, setting the pronunciation of a selected target polyphone in the text data of the target reading material, setting the reading method of a selected target digital symbol in the text data of the target reading material, setting the reading method of a selected target word in the text data of the target reading material, and adjusting the dubbing parameters of the dubbing player in the text data of the target reading material.

6. The method of claim 1, further comprising, after displaying the dubbing result presentation interface:

Responding to the export operation, displaying a chapter selection interface, and displaying a plurality of chapters of the target reading material in the chapter selection interface;

In response to a selection operation for at least one chapter of the plurality of chapters, an audio file corresponding to the selected chapter is generated.

7. The method according to any one of claims 1 to 6, wherein before the displaying the dubbing player setting interface corresponding to the target reading, further comprises:

Displaying a reading material providing interface;

Acquiring the target reading material determined in the reading material providing interface;

And acquiring the characters contained in the target reading material, which are identified from the text data of the target reading material.

8. A method for generating an audio reading, the method comprising:

Acquiring text data of a target reading material to be dubbed;

Acquiring a dubbing player set for the role; wherein, the dubbing staff set for at least two different roles in the target reading material are different, and artificial intelligent AI acoustic models corresponding to the dubbing staff are adopted for dubbing treatment, wherein the AI acoustic models comprise a first encoder, a rhythm model and a tone model;

9. The method of claim 8, wherein the identifying the plurality of characters contained in the target reading comprises:

10. The method of claim 9, wherein for each sentence included in the text data of the target reading material, identifying a character included in each sentence includes:

Inputting the vector representation sequence into a character recognition model, and extracting characteristic information in the vector representation sequence through a context encoder of the character recognition model;

Outputting marking results corresponding to each word in the target sentence according to the characteristic information by a marking decoder of the character recognition model, wherein the marking results are used for indicating entity categories of the words;

11. The method of claim 8, wherein the method further comprises:

determining a role corresponding to the target sentence according to the target sentence and the context information of the target sentence for the target sentence contained in the text data of the target reading material;

Wherein the context information of the target sentence includes at least one of: at least one statement preceding the target statement, at least one statement following the target statement.

12. The method of claim 11, wherein the determining the role corresponding to the target sentence according to the target sentence and the context information of the target sentence comprises:

Extracting characteristic information corresponding to each candidate role from the target statement and the context information of the target statement;

13. The method of claim 12, wherein determining the score for each candidate character based on the characteristic information for each candidate character comprises:

for each candidate role, generating input data of a role selection model based on the feature information corresponding to the candidate role, the target statement and the context information of the target statement;

14. The method according to any one of claims 8 to 13, further comprising, after said identifying the plurality of characters contained in the target reading,:

For a target character contained in the target reading material, determining characteristic information of the target character according to text data of the target character;

determining the matching degree corresponding to each candidate dubbing player according to the characteristic information of the target role and the characteristic information of each candidate dubbing player;

and selecting the candidate dubbing staff with the matching degree meeting a third condition as the dubbing staff corresponding to the target role.

15. A dubbing result generation apparatus, the apparatus comprising:

The dubbing content playing module is used for responding to the playing operation of the target reading material and playing the audio content of the target sentence in the target reading material generated by a target dubbing player, wherein the target dubbing player is a dubbing player set for a role corresponding to the target sentence;

16. The apparatus of claim 15, wherein a dubber setup for the character is also displayed in the dubber setup interface, the initialized dubber being a dubber automatically set based on a style of the character;

The dubbing player setting module is used for responding to the setting operation of the dubbing player aiming at the role, changing the initialized dubbing player into the newly selected dubbing player, and displaying the newly selected dubbing player for the role in the dubbing player setting interface.

17. The apparatus of claim 15, wherein the apparatus further comprises:

the parameter setting module is used for displaying interface elements for setting dubbing parameters for the dubbing player in the dubbing player setting interface, and the dubbing parameters comprise at least one of the following: speakable speed, speakable volume, speakable language, speakable emotion.

18. The apparatus of claim 15, wherein the dubbing result presentation interface comprises a chapter selection area, a text presentation area, and a character selection area; wherein,

The chapter selection area is used for displaying a plurality of chapters of the target reading material; the text display area is used for displaying sentences contained in at least one chapter selected from the chapters and roles corresponding to the sentences; the character selection area is used for displaying at least one character contained in the selected at least one chapter.

19. The apparatus of claim 15, wherein the dubbing result display module is further configured to perform a setting action corresponding to a dubbing setting operation for the target reading in response to the dubbing setting operation; wherein the setting behavior comprises at least one of: inserting a pause character into the text data of the target reading material, adjusting the reading speed of a selected target word and sentence in the text data of the target reading material, adjusting the continuity between selected target word groups in the text data of the target reading material, setting the pronunciation of a selected target polyphone in the text data of the target reading material, setting the reading method of a selected target digital symbol in the text data of the target reading material, setting the reading method of a selected target word in the text data of the target reading material, and adjusting the dubbing parameters of the dubbing player in the text data of the target reading material.

20. The apparatus of claim 15, wherein the dubbing result display module is further configured to display a chapter selection interface in which a plurality of chapters of the target reading are displayed in response to a export operation; in response to a selection operation for at least one chapter of the plurality of chapters, an audio file corresponding to the selected chapter is generated.

21. A dubbing result generation apparatus, the apparatus comprising:

The dubbing setting module is used for acquiring a dubbing person set for the role; setting different dubbing staff for at least two different roles in the target reading material, and performing dubbing processing by adopting an artificial intelligence AI acoustic model corresponding to the dubbing staff, wherein the AI acoustic model comprises a first encoder, a rhythm model and a voice model;

The file generation module is used for carrying out coding processing on the target sentences through the first coder for the target sentences contained in the text data of the target reading materials to obtain phoneme-level tone quality information corresponding to the target sentences; wherein the phoneme-level sound quality information is used for representing the sound quality of the target sentence;

The file generation module is further used for expanding frames of the phoneme-level tone quality information of the target sentence to obtain frame-level tone quality information corresponding to the target sentence; the frame-level tone quality information is used for representing tone quality of the target sentence after frame expansion;

The file generation module is further used for determining a predicted prosody code representation vector corresponding to the target sentence through the prosody model according to the frame-level tone quality information corresponding to the target sentence; the prediction prosody coding representation vector comprises a prediction sentence-level prosody coding representation vector, a prediction frame-level prosody coding representation vector and a prediction phoneme-level prosody coding representation vector;

the file generation module is further configured to generate, according to the predicted prosody encoding representation vector corresponding to the target sentence through the timbre model, spectrum information corresponding to the target sentence, where the spectrum information corresponding to the target sentence is used to generate audio content of the target sentence in the audio file.

22. The apparatus of claim 21, wherein the character recognition module is configured to recognize, for each sentence included in the text data of the target reading material, a character respectively included in each sentence; counting the occurrence times of each identified character in the target reading material; and selecting the roles of which the occurrence times meet a first condition to obtain the multiple roles contained in the target reading material.

23. The apparatus of claim 22, wherein the character recognition module is configured to obtain, for a target sentence included in text data of the target reading material, a vector representation sequence corresponding to the target sentence; inputting the vector representation sequence into a character recognition model, and extracting characteristic information in the vector representation sequence through a context encoder of the character recognition model; outputting marking results corresponding to each word in the target sentence according to the characteristic information by a marking decoder of the character recognition model, wherein the marking results are used for indicating entity categories of the words; and obtaining roles contained in the target sentence based on the labeling results respectively corresponding to the words in the target sentence.

24. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set that is loaded and executed by the processor to implement the method of any one of claims 1 to 7 or to implement the method of any one of claims 8 to 14.

25. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of any one of claims 1 to 7, or the method of any one of claims 8 to 14.

26. A computer program product comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium and executed by a processor to implement the method of any one of claims 1 to 7 or the method of any one of claims 8 to 14.