Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal according to a method for converting speech and sign language according to an embodiment of the present application. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the computer terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for converting speech and sign language in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
The embodiment of the application can be operated on a voice and sign language conversion system shown in fig. 2, and AS shown in fig. 2, the system comprises a new call operation management module, a new call sign language translation service application server (Application Server, AS), voNR capability network elements, voNR + media surface, a media capability platform and a terminal.
The new call operation management is responsible for the opening and setting of sign language translation service functions;
a new conversation sign language translation service AS is responsible for logic scheduling of the sign language translation service;
VoNR a capacity network element, which is responsible for providing signaling forwarding and providing a media capacity interface;
VoNR + media surface, providing media stream copying, user data interaction, providing a cloud rendering center based on a graphic processor to realize sign language digital person generation and provide real-time auxiliary caption synthesis;
the media capability platform is responsible for converting AI voice into characters, converting the AI characters into hand language codes, identifying AI hand language actions, identifying AI expression and identifying AI lips;
And the terminal is responsible for capturing the push stream of the audio and video to the media surface in real time, interacting with a user, submitting interaction data to the media surface, controlling the terminal to display, expanding a video window with multiple palace lattices and broadcasting subtitles.
In this embodiment, a method for converting speech to sign language is provided, fig. 3 is a flowchart of a method for converting speech to sign language according to an embodiment of the present invention, and as shown in fig. 3, the flowchart includes the following steps:
Step S302, a first voice action video sent by a first terminal in a conversation state of the first terminal and a second terminal is obtained, first voice actions in the first voice action video are converted into first voices, and first interaction information is sent to the second terminal, wherein the first interaction information comprises the first voices, and/or a second voice sent by the second terminal in a conversation state of the first terminal and the second terminal is obtained, the second voices are converted into second voice actions, second voice action videos are generated, and second interaction information is sent to the first terminal, wherein the second interaction information comprises the second voice action videos.
In step S302, converting the first voice action in the first voice action video into first voice includes converting the first voice action in the first voice action video into first text, converting the first text into first voice, wherein the first interaction information includes the first text, and/or converting the second voice into second voice action and generating a second voice action video, including converting the second voice into second text, converting the second text into second voice action video, and wherein the second interaction information includes the second text.
Through converting the sign language actions into texts and sending the texts to the second terminal, a user of the second terminal can refer to text information while receiving the voice generated by the sign language conversion, and communication information is transmitted in a multi-dimensional mode, so that communication is more accurate and smooth.
In step S302, the first interaction information is sent to the second terminal, which includes sending the first interaction information to the second terminal through a data channel of an internet protocol multimedia subsystem IMS based on the 5G ultra-high definition audio video call service VoNR, and/or sending the second interaction information to the first terminal, which includes sending the second interaction information to the first terminal through a data channel of the IMS based on the VoNR technology.
The traditional 5G ultra-high definition audio and video call service VoNR technology can be used for realizing high-quality call with low connection delay, and on the basis of VoNR technology, the interactive information generated in the embodiment of the invention is sent to the first terminal and the second terminal by means of an IMS data channel, so that barrier-free communication between voice and sign language can be realized on the basis of ensuring the original call.
In step S302, the first gesture video sent by the first terminal in the state that the first terminal and the second terminal are in communication is obtained, the first gesture in the first gesture video is converted into first voice, the method comprises the steps of copying a video stream of the first gesture video sent by the first terminal in the state that the first terminal and the second terminal are in communication, identifying and extracting the first gesture in the first gesture video, converting the first gesture into a gesture word sequence, converting the gesture word sequence into a first text, and converting the first text into the first voice.
In step S302, a second voice sent by the second terminal in a state that the first terminal and the second terminal are in communication is obtained, the second voice is converted into a second sign language action and a second sign language action video is generated, the method comprises the steps of copying an audio stream of the second voice sent by the second terminal in a state that the first terminal and the second terminal are in communication, converting spoken language logic of the second voice into corresponding sign language logic, generating a corresponding digital sign language action video according to the sign language logic, adding a digital sign language action display window for the first terminal, and sending the digital sign language action video to the digital sign language action display window.
In the process of converting the voice into the sign language, firstly, the spoken language logic of the voice is converted into the corresponding sign language logic, and the accuracy of the generated sign language action video can be ensured in the process of converting, so that the problem of incorrect conversion caused by the difference between the spoken language logic and the sign language logic is avoided, and the conversion quality of the sign language and the voice is improved.
In step S302, obtaining the first gesture video sent by the first terminal in the state that the first terminal is in communication with the second terminal includes judging whether a gesture and voice conversion service is on, obtaining the first gesture video sent by the first terminal in the state that the first terminal is in communication with the second terminal when at least one of the first terminal and the second terminal is on, and/or obtaining the second voice sent by the second terminal in the state that the first terminal is in communication with the second terminal includes judging whether the gesture and voice conversion service is on, and obtaining the second voice sent by the second terminal in the state that the first terminal is in communication with the second terminal when at least one of the first terminal and the second terminal is on.
In an exemplary embodiment, corresponding digital sign language action videos are generated according to the sign language logic, the method comprises the steps of inquiring and determining standard sign language actions corresponding to the sign language logic based on a sign language coding library, and rendering the standard sign language actions to obtain the digital sign language action videos, wherein rendering modes comprise image rendering, expression rendering, action rendering, mouth shape rendering and scene rendering.
In an exemplary embodiment, after converting the first gesture in the first gesture video to the first voice, the method further comprises sending the first text to the first terminal through a data channel of the IMS based on the VoNR technology, and/or after converting the second voice to the second text, the method further comprises sending the second text to the second terminal through a data channel of the IMS based on the VoNR technology.
And sending the first text generated by the first gesture conversion to the first terminal, so that a user of the first terminal can check whether the conversion result is accurate.
The embodiment also provides a device for converting voice and sign language, which is used for implementing the above embodiment and the preferred implementation manner, and is not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 4 is a block diagram of a voice-to-sign language conversion apparatus according to an embodiment of the present invention, and as shown in fig. 4, the voice-to-sign language conversion apparatus 400 includes at least one of the following:
the first conversion module 10 is configured to obtain a first gesture video sent by a first terminal in a state where the first terminal is in communication with a second terminal, convert a first gesture in the first gesture video into first voice, and send first interaction information to the second terminal, where the first interaction information includes the first voice;
The second conversion module 20 is configured to obtain a second voice sent by the second terminal in a state where the first terminal and the second terminal are in a call, convert the second voice into a second sign language action, generate a second sign language action video, and send second interaction information to the first terminal, where the second interaction information includes the second sign language action video.
Fig. 5 is a block diagram of a voice and sign language conversion apparatus according to still another embodiment of the present invention, and as shown in fig. 5, the voice and sign language conversion apparatus 500 includes a first conversion unit 11, a first transmission unit 12, a second conversion unit 21, and a second transmission unit 22.
A first conversion unit 11, configured to convert a first gesture in the first gesture video into a first voice, including converting the first gesture in the first gesture video into a first text, and converting the first text into a first voice, where the first interaction information includes the first text;
A first sending unit 12, configured to send the first interaction information to a second terminal through a data channel of an IMS based on VoNR technologies;
A second converting unit 21, configured to convert the second voice into a second sign language action and generate a second sign language action video, including converting the second voice into a second text, and converting the second text into a second sign language action video, where the second interaction information includes the second text;
and a second sending unit 22, configured to send the second interaction information to the first terminal through a data channel of the IMS based on the VoNR technology.
In order to facilitate understanding of the technical solutions provided by the present invention, the following details will be described in connection with embodiments of specific scenarios.
Fig. 6 is a schematic diagram of user data transmission according to an embodiment of the present invention, as shown in fig. 6, in which the embodiment of the present invention mainly implements real-time bi-directional translation of sign language and voice by means of VoNR + technology. VoNR + by means of IMS data channel (IMS DATA CHANNEL) technology, a data channel is added outside the voice channel and the video channel. The VoNR + network side performs layered coding and transmission on the audio and video channel service, provides different 5G QoS identifiers (5G QoS Identifier,5QI) for guaranteeing the QoS (Quality of Service, qoS), identifies different data packets and performs QoS control with finer granularity on the data channel service, and introduces new QoS parameters to support the transmission of tactile data or sensor data. The data channel can transmit more abundant interaction information such as positions, pictures and words, and even hearing, vision, touch sense, kinesthesia, environment information and the like along with the conversation, so that the conversation is upgraded from single voice to a multimedia form. The transmission of user interaction data related to the invention uses a DC data channel.
Fig. 7 is a schematic diagram of a terminal display interface according to an embodiment of the present invention, as shown in fig. 7, a hearing impaired person needs to expose his head and his elbow in a video window of the hearing impaired person, the system recognizes sign language and converts the sign language into voice broadcasting, the system collects data such as voice and expression of the hearing impaired person in the video window of the hearing impaired person, the sign language digital window displays the sign language converted and restored by the voice of the hearing impaired person, the caption scrolling area presents chat records of both parties on the same chat white edition, the hearing impaired person can confirm the voice information broadcasted by the system, the hearing impaired person can confirm the natural text translated by the sign language, and both parties can review the conversation context.
Fig. 8 is a flowchart of a method for converting speech and sign language according to another embodiment of the present invention, which specifically includes:
The user of the terminal A actively calls and starts sign language real-time translation;
Terminal A requests media surface to open sign language real-time translation (taking selecting sign language to convert into voice as an example)
The sign language real-time translation system changes the video of the terminal A into a four-grid mode after determining that one of the two parties of the call opens the sign language and voice conversion service;
The system performs media stream copying (video stream), and performs AI sign language identification on the video stream;
the system converts the voice generated by the sign language translation into characters and sends sign language identification subtitle streams to the terminal A and the terminal B;
and the terminal B plays the voice generated by the sign language translation.
Fig. 9 is a flowchart of a method for converting speech to sign language according to still another embodiment of the present invention, contrary to the method for converting speech to sign language shown in fig. 8, the system converts speech to sign language, and the specific flow is as follows:
The terminal B requests the media surface to start sign language real-time translation (taking selecting the conversion of voice into sign language as an example);
The sign language real-time translation system changes the video of the terminal A into a four-grid mode after determining that one of the two parties of the call opens the sign language and voice conversion service;
The system performs media stream copying (audio stream), converts voice into characters, converts the characters into sign language codes, and renders digital persons according to the sign language codes;
And sending the subtitle stream and the rendered digital human video stream to the terminal A and the terminal B.
The key modules related to the sign language and voice conversion method in fig. 8 and 9 comprise a sign language and natural language bi-directional translation module, a sign language digital person rendering center and auxiliary caption synthesis.
The sign language and natural language bi-directional translation module records and encodes national sign language standard gestures. Through data input and computer deep learning, a self-language and sign language analysis model is generated, and when the sign language and natural language bi-directional translation module is used for converting a spoken language into a sign language, a natural language processing model is used for converting the spoken language logic into a corresponding sign language logic, and a sign language logic code is transmitted to a rendering center.
The sign language conversion spoken language executes the reverse logic conversion process, and the system receives the sign language action video as input and outputs spoken language characters or voices. The sign language extraction in the sign language action video is completed by an image recognition technology. And analyzing the corresponding sign language action analysis model to obtain a sign language word sequence.
After the sign language digital person rendering center obtains the rendering instruction, the corresponding sign language actions are obtained according to the sign language sequence. Therefore, the sign language digital person rendering center needs to advance a model library of sign language codes and actions used by the synchronous translation module, and generates action instructions of each frame according to the code inquiry action library, so that rendering action parameters are obtained.
In addition, the system supports multidimensional superposition rendering of images, expressions, mouth shapes, actions, scenes and the like. And pushing the video data to the user terminal in the form of a real-time interactive video stream (Streaming).
And the head portraits support two types of writing and animation. And recognizing face parameters from the previous frames of the video during realistic rendering, and performing rapid model generation. The animated rendering is then a digital persona provided by the user selection system.
Clothing the system provides a clothing library, which the user can select in the conversation.
Expression realization according to expression knowledge base and context judgment
Synchronizing sign language code library, finding corresponding gesture for appointed sign language code
And the scene is rendered by the realistic class, the background is extracted from the video frame, and the animation rendering is synthesized by the background provided by a user selection system or uploading background pictures by the user.
And generating a unique session in the call process, and associating the calling party and the called party. And establishing a session-level caption generation controller, performing time sequence control on the returned caption stream identified by the AI, combining the returned caption stream into a new caption stream, and pushing the new caption stream to the calling terminal and the called terminal. And after receiving the command, the terminal plays the command in the display area set on the screen, and if the command exceeds the display area, the display area automatically scrolls upwards to cover the command.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
The high-definition video call under VoNR + technical architecture is realized at present, and along with the combination of the technology with the AI technology, the video call scene under VoNR + technology has infinite development potential. Meanwhile, the 5G technology has high bandwidth and low delay deterministic network, and also brings high-performance processing potential for video call scenes. The technical scheme disclosed by the embodiment of the invention is applied to the improvement of the user experience of the video call between the deaf-mute and the resident, and the IMS DC channel is carried on the basis of the 5G multimedia call to transmit the user interaction information. And a neural network model is adopted to realize the bidirectional translation of the voice and the sign language. The voice of the normal person is converted into a text, translated into sign language codes, restored into sign language numbers through a real-time rendering technology in the conversation process, and meanwhile, the sign language of the deaf-mute is converted into the text and the voice is broadcasted to the opposite party. The communication mode brings the most familiar communication mode for both parties, and the communication process provides real-time auxiliary captions, and the system transmits communication information from the characters, the voices and the sign language in a multi-dimensional manner, so that the communication is smoother.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.