CN119229714A

CN119229714A - Method and device for converting speech and sign language

Info

Publication number: CN119229714A
Application number: CN202310790326.8A
Authority: CN
Inventors: 仇文娟
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2024-12-31
Also published as: WO2025001597A1

Abstract

The embodiment of the present invention provides a method and device for converting speech and sign language, the method comprising at least one of the following: obtaining a first sign language action video sent by the first terminal when the first terminal is in a call with the second terminal, converting the first sign language action in the first sign language action video into a first speech, and sending the first interaction information to the second terminal; obtaining a second speech sent by the second terminal when the first terminal is in a call with the second terminal, converting the second speech into a second sign language action and generating a second sign language action video, and sending the second interaction information to the first terminal. Through the present invention, the problem that users using different expressions (sign language and speech) cannot communicate efficiently is solved, thereby achieving the effect that users using different expressions can communicate without obstacles using expressions they are familiar with.

Description

Voice and sign language conversion method and device

Technical Field

The embodiment of the invention relates to the field of communication, in particular to a method and a device for converting voice and sign language.

Background

When a deaf-mute is in remote communication with another person, the remote communication is usually realized by converting text to speech and converting speech to text by using specific software, applets, professional equipment, service intermediaries (artificial intermediaries, artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) intermediaries) and the like. These communication modes are either limited to expensive equipment, or to friends, or are essentially text-to-speech communication and are not the most familiar communication modes for the deaf-mute. These all result in many limitations in the way the deaf-mute group communicates with normal people, who cannot really realize mobile boring like normal people.

The current communication mode of the deaf-mute and the sound person mainly comprises the following steps:

The hand language is converted into voice through the manual intermediary, professional hand language translation customer service personnel are arranged in the background by the smart phone software, a user can use a telephone call replacing function or a call customer service function, and the hand language translator can complete translation from voice language translation to hand language translation or vice versa, so that online conversation between a hearing impaired person and a hearing person is realized;

The sign language is translated into characters through the AI sign language translator, and the AI sign language translator has the largest Chinese sign language identification data set worldwide, so that most daily sign language expression expressions of hearing impaired people can be identified. The hearing impaired people can quickly convert the sign language into the characters by carrying out high-speed operation on the background computer only by facing the camera.

Through the WeChat opening hearing barrier and navigation protection applet, the video call can be opened to invite WeChat friends to join in the video call, the deaf person sends words in the subtitle, the system converts the normal voice into words to be displayed in the subtitle, and the chat record of the words can be seen while the video is realized.

The off-line translation medium is utilized to translate sign language into voice, translate voice into sign language and translate text into voice, but direct remote video real-time conversation translation is not supported.

Summarizing the above ways, it can be found that the current technological way of communication between the deaf-mute and the world mainly relies on techniques such as voice-to-text, text-to-voice, intermediate sign language translation, and device sign language translation. The introduction of the technologies enables the communication experience between the deaf-mute and other people to be improved increasingly, but the problem that the mobile phone communication with more listeners is unobstructed and boring can not be solved under the condition of no auxiliary equipment.

Disclosure of Invention

The embodiment of the invention provides a method and a device for converting voice and sign language, which at least solve the problem that the sign language and the voice cannot be converted in a call state in the related art.

According to one embodiment of the invention, a voice-to-sign language conversion method is provided, and the voice-to-sign language conversion method comprises the steps of obtaining a first sign language action video sent by a first terminal in a conversation state of the first terminal and a second terminal, converting first sign language actions in the first sign language action video into first voices and sending first interaction information to the second terminal, wherein the first interaction information comprises the first voices, obtaining a second voice sent by the second terminal in a conversation state of the first terminal and the second terminal, converting the second voices into second sign language actions, generating second sign language action videos, and sending second interaction information to the first terminal, and the second interaction information comprises the second sign language action videos.

In one exemplary embodiment, the sending the first interaction information to the second terminal includes sending the first interaction information to the second terminal over a data channel of an internet protocol multimedia subsystem (IP Multimedia Subsystem, IMS) based on a 5G ultra high definition audio video telephony service (Voice over New Radio, voNR), and/or sending the second interaction information to the first terminal includes sending the second interaction information to the first terminal over a data channel of the IMS based on the VoNR technology.

According to another embodiment of the invention, a voice-to-sign language conversion device is provided, and the voice-to-sign language conversion device comprises a first conversion module, a second conversion module and a first interaction module, wherein the first conversion module is used for obtaining a first sign language action video sent by a first terminal in a conversation state of the first terminal and a second terminal, converting the first sign language action in the first sign language action video into a first voice and sending first interaction information to the second terminal, the first interaction information comprises the first voice, the second conversion module is used for obtaining a second voice sent by the second terminal in a conversation state of the first terminal and the second terminal, converting the second voice into a second sign language action video and generating the second sign language action video, and sending the second interaction information to the first terminal, and the second interaction information comprises the second sign language action video.

According to a further embodiment of the invention, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

By the embodiment of the invention, the sign language action video of the first terminal is converted into the voice and sent to the second terminal, the voice of the second terminal can also be converted into the sign language action video and sent to the first terminal, so that the users of the first terminal and the second terminal can respectively use the familiar expression modes, namely sign language and voice, and can normally communicate, in addition, the embodiment of the invention acquires the sign language action video or the sign language and voice conversion carried out by the voice while the first terminal and the second terminal are in communication, other auxiliary equipment is not needed, communication is not limited by friend relations, and therefore, the problem that users using different expression modes (sign language and voice) cannot communicate efficiently is solved, and the effect that users using different expression modes can communicate without barriers by using the familiar expression modes is achieved.

Drawings

Fig. 1 is a block diagram of a hardware structure of a computer terminal according to a method of converting voice and sign language according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a speech to sign language conversion system according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of speech to sign language conversion according to an embodiment of the present invention;

FIG. 4 is a block diagram of a speech to sign language conversion apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram illustrating a speech-to-sign language conversion apparatus according to still another embodiment of the present invention;

fig. 6 is a schematic diagram of user data transmission according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a terminal display interface according to an embodiment of the invention;

FIG. 8 is a flow chart of a method for converting speech to sign language according to yet another embodiment of the present invention;

fig. 9 is a flowchart of a method for converting speech and sign language according to another embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal according to a method for converting speech and sign language according to an embodiment of the present application. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the computer terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for converting speech and sign language in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

The embodiment of the application can be operated on a voice and sign language conversion system shown in fig. 2, and AS shown in fig. 2, the system comprises a new call operation management module, a new call sign language translation service application server (Application Server, AS), voNR capability network elements, voNR + media surface, a media capability platform and a terminal.

The new call operation management is responsible for the opening and setting of sign language translation service functions;

a new conversation sign language translation service AS is responsible for logic scheduling of the sign language translation service;

VoNR a capacity network element, which is responsible for providing signaling forwarding and providing a media capacity interface;

VoNR + media surface, providing media stream copying, user data interaction, providing a cloud rendering center based on a graphic processor to realize sign language digital person generation and provide real-time auxiliary caption synthesis;

the media capability platform is responsible for converting AI voice into characters, converting the AI characters into hand language codes, identifying AI hand language actions, identifying AI expression and identifying AI lips;

And the terminal is responsible for capturing the push stream of the audio and video to the media surface in real time, interacting with a user, submitting interaction data to the media surface, controlling the terminal to display, expanding a video window with multiple palace lattices and broadcasting subtitles.

In this embodiment, a method for converting speech to sign language is provided, fig. 3 is a flowchart of a method for converting speech to sign language according to an embodiment of the present invention, and as shown in fig. 3, the flowchart includes the following steps:

Step S302, a first voice action video sent by a first terminal in a conversation state of the first terminal and a second terminal is obtained, first voice actions in the first voice action video are converted into first voices, and first interaction information is sent to the second terminal, wherein the first interaction information comprises the first voices, and/or a second voice sent by the second terminal in a conversation state of the first terminal and the second terminal is obtained, the second voices are converted into second voice actions, second voice action videos are generated, and second interaction information is sent to the first terminal, wherein the second interaction information comprises the second voice action videos.

In step S302, converting the first voice action in the first voice action video into first voice includes converting the first voice action in the first voice action video into first text, converting the first text into first voice, wherein the first interaction information includes the first text, and/or converting the second voice into second voice action and generating a second voice action video, including converting the second voice into second text, converting the second text into second voice action video, and wherein the second interaction information includes the second text.

Through converting the sign language actions into texts and sending the texts to the second terminal, a user of the second terminal can refer to text information while receiving the voice generated by the sign language conversion, and communication information is transmitted in a multi-dimensional mode, so that communication is more accurate and smooth.

In step S302, the first interaction information is sent to the second terminal, which includes sending the first interaction information to the second terminal through a data channel of an internet protocol multimedia subsystem IMS based on the 5G ultra-high definition audio video call service VoNR, and/or sending the second interaction information to the first terminal, which includes sending the second interaction information to the first terminal through a data channel of the IMS based on the VoNR technology.

The traditional 5G ultra-high definition audio and video call service VoNR technology can be used for realizing high-quality call with low connection delay, and on the basis of VoNR technology, the interactive information generated in the embodiment of the invention is sent to the first terminal and the second terminal by means of an IMS data channel, so that barrier-free communication between voice and sign language can be realized on the basis of ensuring the original call.

In step S302, the first gesture video sent by the first terminal in the state that the first terminal and the second terminal are in communication is obtained, the first gesture in the first gesture video is converted into first voice, the method comprises the steps of copying a video stream of the first gesture video sent by the first terminal in the state that the first terminal and the second terminal are in communication, identifying and extracting the first gesture in the first gesture video, converting the first gesture into a gesture word sequence, converting the gesture word sequence into a first text, and converting the first text into the first voice.

In step S302, a second voice sent by the second terminal in a state that the first terminal and the second terminal are in communication is obtained, the second voice is converted into a second sign language action and a second sign language action video is generated, the method comprises the steps of copying an audio stream of the second voice sent by the second terminal in a state that the first terminal and the second terminal are in communication, converting spoken language logic of the second voice into corresponding sign language logic, generating a corresponding digital sign language action video according to the sign language logic, adding a digital sign language action display window for the first terminal, and sending the digital sign language action video to the digital sign language action display window.

In the process of converting the voice into the sign language, firstly, the spoken language logic of the voice is converted into the corresponding sign language logic, and the accuracy of the generated sign language action video can be ensured in the process of converting, so that the problem of incorrect conversion caused by the difference between the spoken language logic and the sign language logic is avoided, and the conversion quality of the sign language and the voice is improved.

In step S302, obtaining the first gesture video sent by the first terminal in the state that the first terminal is in communication with the second terminal includes judging whether a gesture and voice conversion service is on, obtaining the first gesture video sent by the first terminal in the state that the first terminal is in communication with the second terminal when at least one of the first terminal and the second terminal is on, and/or obtaining the second voice sent by the second terminal in the state that the first terminal is in communication with the second terminal includes judging whether the gesture and voice conversion service is on, and obtaining the second voice sent by the second terminal in the state that the first terminal is in communication with the second terminal when at least one of the first terminal and the second terminal is on.

In an exemplary embodiment, corresponding digital sign language action videos are generated according to the sign language logic, the method comprises the steps of inquiring and determining standard sign language actions corresponding to the sign language logic based on a sign language coding library, and rendering the standard sign language actions to obtain the digital sign language action videos, wherein rendering modes comprise image rendering, expression rendering, action rendering, mouth shape rendering and scene rendering.

In an exemplary embodiment, after converting the first gesture in the first gesture video to the first voice, the method further comprises sending the first text to the first terminal through a data channel of the IMS based on the VoNR technology, and/or after converting the second voice to the second text, the method further comprises sending the second text to the second terminal through a data channel of the IMS based on the VoNR technology.

And sending the first text generated by the first gesture conversion to the first terminal, so that a user of the first terminal can check whether the conversion result is accurate.

The embodiment also provides a device for converting voice and sign language, which is used for implementing the above embodiment and the preferred implementation manner, and is not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 4 is a block diagram of a voice-to-sign language conversion apparatus according to an embodiment of the present invention, and as shown in fig. 4, the voice-to-sign language conversion apparatus 400 includes at least one of the following:

the first conversion module 10 is configured to obtain a first gesture video sent by a first terminal in a state where the first terminal is in communication with a second terminal, convert a first gesture in the first gesture video into first voice, and send first interaction information to the second terminal, where the first interaction information includes the first voice;

The second conversion module 20 is configured to obtain a second voice sent by the second terminal in a state where the first terminal and the second terminal are in a call, convert the second voice into a second sign language action, generate a second sign language action video, and send second interaction information to the first terminal, where the second interaction information includes the second sign language action video.

Fig. 5 is a block diagram of a voice and sign language conversion apparatus according to still another embodiment of the present invention, and as shown in fig. 5, the voice and sign language conversion apparatus 500 includes a first conversion unit 11, a first transmission unit 12, a second conversion unit 21, and a second transmission unit 22.

A first conversion unit 11, configured to convert a first gesture in the first gesture video into a first voice, including converting the first gesture in the first gesture video into a first text, and converting the first text into a first voice, where the first interaction information includes the first text;

A first sending unit 12, configured to send the first interaction information to a second terminal through a data channel of an IMS based on VoNR technologies;

A second converting unit 21, configured to convert the second voice into a second sign language action and generate a second sign language action video, including converting the second voice into a second text, and converting the second text into a second sign language action video, where the second interaction information includes the second text;

and a second sending unit 22, configured to send the second interaction information to the first terminal through a data channel of the IMS based on the VoNR technology.

In order to facilitate understanding of the technical solutions provided by the present invention, the following details will be described in connection with embodiments of specific scenarios.

Fig. 6 is a schematic diagram of user data transmission according to an embodiment of the present invention, as shown in fig. 6, in which the embodiment of the present invention mainly implements real-time bi-directional translation of sign language and voice by means of VoNR + technology. VoNR + by means of IMS data channel (IMS DATA CHANNEL) technology, a data channel is added outside the voice channel and the video channel. The VoNR + network side performs layered coding and transmission on the audio and video channel service, provides different 5G QoS identifiers (5G QoS Identifier,5QI) for guaranteeing the QoS (Quality of Service, qoS), identifies different data packets and performs QoS control with finer granularity on the data channel service, and introduces new QoS parameters to support the transmission of tactile data or sensor data. The data channel can transmit more abundant interaction information such as positions, pictures and words, and even hearing, vision, touch sense, kinesthesia, environment information and the like along with the conversation, so that the conversation is upgraded from single voice to a multimedia form. The transmission of user interaction data related to the invention uses a DC data channel.

Fig. 7 is a schematic diagram of a terminal display interface according to an embodiment of the present invention, as shown in fig. 7, a hearing impaired person needs to expose his head and his elbow in a video window of the hearing impaired person, the system recognizes sign language and converts the sign language into voice broadcasting, the system collects data such as voice and expression of the hearing impaired person in the video window of the hearing impaired person, the sign language digital window displays the sign language converted and restored by the voice of the hearing impaired person, the caption scrolling area presents chat records of both parties on the same chat white edition, the hearing impaired person can confirm the voice information broadcasted by the system, the hearing impaired person can confirm the natural text translated by the sign language, and both parties can review the conversation context.

Fig. 8 is a flowchart of a method for converting speech and sign language according to another embodiment of the present invention, which specifically includes:

The user of the terminal A actively calls and starts sign language real-time translation;

Terminal A requests media surface to open sign language real-time translation (taking selecting sign language to convert into voice as an example)

The sign language real-time translation system changes the video of the terminal A into a four-grid mode after determining that one of the two parties of the call opens the sign language and voice conversion service;

The system performs media stream copying (video stream), and performs AI sign language identification on the video stream;

the system converts the voice generated by the sign language translation into characters and sends sign language identification subtitle streams to the terminal A and the terminal B;

and the terminal B plays the voice generated by the sign language translation.

Fig. 9 is a flowchart of a method for converting speech to sign language according to still another embodiment of the present invention, contrary to the method for converting speech to sign language shown in fig. 8, the system converts speech to sign language, and the specific flow is as follows:

The terminal B requests the media surface to start sign language real-time translation (taking selecting the conversion of voice into sign language as an example);

The system performs media stream copying (audio stream), converts voice into characters, converts the characters into sign language codes, and renders digital persons according to the sign language codes;

And sending the subtitle stream and the rendered digital human video stream to the terminal A and the terminal B.

The key modules related to the sign language and voice conversion method in fig. 8 and 9 comprise a sign language and natural language bi-directional translation module, a sign language digital person rendering center and auxiliary caption synthesis.

The sign language and natural language bi-directional translation module records and encodes national sign language standard gestures. Through data input and computer deep learning, a self-language and sign language analysis model is generated, and when the sign language and natural language bi-directional translation module is used for converting a spoken language into a sign language, a natural language processing model is used for converting the spoken language logic into a corresponding sign language logic, and a sign language logic code is transmitted to a rendering center.

The sign language conversion spoken language executes the reverse logic conversion process, and the system receives the sign language action video as input and outputs spoken language characters or voices. The sign language extraction in the sign language action video is completed by an image recognition technology. And analyzing the corresponding sign language action analysis model to obtain a sign language word sequence.

After the sign language digital person rendering center obtains the rendering instruction, the corresponding sign language actions are obtained according to the sign language sequence. Therefore, the sign language digital person rendering center needs to advance a model library of sign language codes and actions used by the synchronous translation module, and generates action instructions of each frame according to the code inquiry action library, so that rendering action parameters are obtained.

In addition, the system supports multidimensional superposition rendering of images, expressions, mouth shapes, actions, scenes and the like. And pushing the video data to the user terminal in the form of a real-time interactive video stream (Streaming).

And the head portraits support two types of writing and animation. And recognizing face parameters from the previous frames of the video during realistic rendering, and performing rapid model generation. The animated rendering is then a digital persona provided by the user selection system.

Clothing the system provides a clothing library, which the user can select in the conversation.

Expression realization according to expression knowledge base and context judgment

Synchronizing sign language code library, finding corresponding gesture for appointed sign language code

And the scene is rendered by the realistic class, the background is extracted from the video frame, and the animation rendering is synthesized by the background provided by a user selection system or uploading background pictures by the user.

And generating a unique session in the call process, and associating the calling party and the called party. And establishing a session-level caption generation controller, performing time sequence control on the returned caption stream identified by the AI, combining the returned caption stream into a new caption stream, and pushing the new caption stream to the calling terminal and the called terminal. And after receiving the command, the terminal plays the command in the display area set on the screen, and if the command exceeds the display area, the display area automatically scrolls upwards to cover the command.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

The high-definition video call under VoNR + technical architecture is realized at present, and along with the combination of the technology with the AI technology, the video call scene under VoNR + technology has infinite development potential. Meanwhile, the 5G technology has high bandwidth and low delay deterministic network, and also brings high-performance processing potential for video call scenes. The technical scheme disclosed by the embodiment of the invention is applied to the improvement of the user experience of the video call between the deaf-mute and the resident, and the IMS DC channel is carried on the basis of the 5G multimedia call to transmit the user interaction information. And a neural network model is adopted to realize the bidirectional translation of the voice and the sign language. The voice of the normal person is converted into a text, translated into sign language codes, restored into sign language numbers through a real-time rendering technology in the conversation process, and meanwhile, the sign language of the deaf-mute is converted into the text and the voice is broadcasted to the opposite party. The communication mode brings the most familiar communication mode for both parties, and the communication process provides real-time auxiliary captions, and the system transmits communication information from the characters, the voices and the sign language in a multi-dimensional manner, so that the communication is smoother.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for converting speech to sign language, comprising at least one of:

Acquiring a first hand-operated video sent by a first terminal in a conversation state between the first terminal and a second terminal, converting a first hand-operated in the first hand-operated video into first voice, and sending first interaction information to the second terminal, wherein the first interaction information comprises the first voice;

And acquiring second voice sent by the second terminal in a state that the first terminal and the second terminal are in communication, converting the second voice into second sign language actions, generating second sign language action videos, and sending second interaction information to the first terminal, wherein the second interaction information comprises the second sign language action videos.

2. The method of claim 1, wherein converting the first gesture in the first gesture video to a first voice comprises converting the first gesture in the first gesture video to a first text, converting the first text to a first voice, wherein the first interaction information comprises the first text, and/or,

Converting the second voice into a second sign language action and generating a second sign language action video comprises converting the second voice into a second text and converting the second text into a second sign language action video, wherein the second interaction information comprises the second text.

3. The method of claim 1, wherein transmitting the first interaction information to the second terminal comprises transmitting the first interaction information to the second terminal over a data channel of an Internet protocol multimedia subsystem IMS based on a 5G ultra high definition audio video call service VoNR, and/or,

The sending of the second interaction information to the first terminal comprises the step of sending the second interaction information to the first terminal through a data channel of the IMS based on the VoNR technology.

4. The method of claim 1, wherein obtaining the first gesture video sent by the first terminal in a state where the first terminal is communicating with the second terminal, and converting the first gesture in the first gesture video into the first voice, comprises:

The method comprises the steps of copying a video stream of a first hand-language action video sent by a first terminal in a state that the first terminal and a second terminal are in communication, identifying and extracting the first hand-language action in the first hand-language action video, converting the first hand-language action into a hand-language word sequence, converting the hand-language word sequence into a first text, and converting the first text into first voice.

5. The method of claim 1, wherein obtaining a second voice transmitted by the second terminal in a state in which the first terminal is in communication with the second terminal, converting the second voice into a second sign language action, and generating a second sign language action video, comprises:

The method comprises the steps that media streams copy audio streams of second voice sent by a second terminal in a state that the first terminal and the second terminal are in communication, spoken language logic of the second voice is converted into corresponding sign language logic, corresponding digital sign language action videos are generated according to the sign language logic, a digital sign language action display window is newly added for the first terminal, and the digital sign language action videos are sent to the digital sign language action display window.

6. The method of claim 5, wherein generating a corresponding digital human sign language action video from the sign language logic comprises:

Inquiring and determining standard sign language actions corresponding to the sign language logic based on a sign language coding library, and rendering the standard sign language actions to obtain the digital sign language action video;

the rendering modes comprise image rendering, expression rendering, action rendering, mouth shape rendering and scene rendering.

7. The method of claim 3, wherein after converting the first gesture in the first gesture video to the first voice, further comprising sending a first text to the first terminal via a data channel of the IMS based on the VoNR technology and/or,

After converting the second voice to a second text, further comprising sending the second text to the second terminal over a data channel of the IMS based on the VoNR technology.

8. The method of claim 1, wherein obtaining the first gesture video transmitted by the first terminal in a state in which the first terminal is in communication with the second terminal comprises determining whether a gesture-to-voice conversion service is on, obtaining the first gesture video transmitted by the first terminal in a state in which the first terminal is in communication with the second terminal in a case in which at least one of the first terminal and the second terminal is on,

The method comprises the steps of judging whether sign language and voice conversion service is on or not, and acquiring second voice sent by the second terminal in the state that the first terminal and the second terminal are in communication under the condition that at least one of the first terminal and the second terminal is on the sign language and voice conversion service.

9. A speech to sign language conversion device, comprising at least one of:

The first conversion module is used for obtaining a first hand-operated video sent by the first terminal in a state that the first terminal and the second terminal are in communication, converting the first hand-operated in the first hand-operated video into first voice and sending first interaction information to the second terminal, wherein the first interaction information comprises the first voice;

The second conversion module is used for obtaining second voice sent by the second terminal in a state that the first terminal and the second terminal are in communication, converting the second voice into second sign language actions, generating second sign language action videos and sending second interaction information to the first terminal, wherein the second interaction information comprises the second sign language action videos.

10. The apparatus of claim 9, wherein the first conversion module further comprises:

The first conversion unit is used for converting the first hand-language actions in the first hand-language action video into first voices and comprises the steps of converting the first hand-language actions in the first hand-language action video into first texts and converting the first texts into first voices, wherein the first interaction information comprises the first texts.

11. The apparatus of claim 9, wherein the second conversion module further comprises:

the second conversion unit is used for converting the second voice into a second sign language action and generating a second sign language action video, and comprises the steps of converting the second voice into a second text and converting the second text into the second sign language action video, wherein the second interaction information comprises the second text.

12. The apparatus of claim 9, wherein the first conversion module further comprises:

and the first sending unit is used for sending the first interaction information to the second terminal through a data channel of the IMS based on VoNR technology.

13. The apparatus of claim 9, wherein the second conversion module further comprises:

And the second sending unit is used for sending the second interaction information to the first terminal through a data channel of the IMS based on VoNR technology.

14. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method according to any of the claims 1 to 8.

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.