CN117915149A

CN117915149A - AI enhanced video conference system, method and computer program product

Info

Publication number: CN117915149A
Application number: CN202410122788.7A
Authority: CN
Inventors: 孙育宁; 赵明; 杨辉洲; 甄子腾
Original assignee: SHENZHEN SHANLIAN INFORMATION TECHNOLOGY CO LTD
Current assignee: SHENZHEN SHANLIAN INFORMATION TECHNOLOGY CO LTD
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-04-19

Abstract

The invention discloses an AI enhanced video conference system, a method and a computer program product, wherein the method comprises the following steps: collecting a current static image of the local terminal, transmitting the current static image to the remote terminal, and storing the current static image as a second local static image at the remote terminal; the local terminal collects local video images and sounds in real time, recognizes and obtains facial expression and limb action data of a local participant, and transmits the facial expression and limb action data to the remote terminal; the remote end fuses the facial expression and limb motion data of the local participant with the second local static image, and the dynamic ultra-high definition video picture of the local end is obtained through reconstruction; and synchronously playing the dynamic ultra-high definition video picture and the sound file according to the time stamp. According to the invention, only facial expressions and limb actions acquired by the camera are transmitted, the data transmission quantity is low, the response speed is high, and the picture is smooth. The reconstructed image provides a better visual experience, so that the user can more clearly observe and understand the participants and the content in the teleconference, and the experience of the teleconference and the interaction is improved.

Description

AI enhanced video conference system, method and computer program product

Technical Field

The invention relates to the technical field of video conferences, in particular to an AI enhanced video conference system, an AI enhanced video conference method and a computer program product.

Background

Video conferencing is now becoming more and more popular as an effective means of communication in everyday offices.

However, video transmission requires a large amount of data transmission, and requires a high network bandwidth. Due to network bandwidth and transmission limitations, existing remote video conferencing systems often suffer from the following problems:

the problems of blurring, clamping, distortion and the like often occur in video pictures in a remote video conference, and the effects of high definition or smoothness cannot be achieved, so that the audiovisual experience and the communication quality are affected. In addition, video conferences often suffer from delays, seizures and instability, especially in situations where network conditions are poor or there are many participants.

In view of this, there is a need for improvements in existing remote video conferencing systems to provide high definition or smooth pictures that avoid delays, jams and instability in video conferences.

Disclosure of Invention

Aiming at the defects, the technical problem to be solved by the invention is to provide an AI enhanced video conference system and method, so as to solve the problems of delay, clamping and instability phenomena in the prior art that video pictures are blurred.

To this end, the present invention provides a method for AI-enhanced video conferencing, the video conferencing including a local end and a remote end connected by a network, the method comprising the steps of:

The local terminal acquires a current static picture of the local terminal according to a static picture acquisition instruction to obtain a first local static image, wherein the first local static image comprises a local participant;

The local end transmits the first local static image to the remote end and stores the first local static image as a second local static image at the remote end;

The local terminal collects local video pictures and sounds in real time, recognizes and obtains facial expression and limb action data of a local participant, and transmits the facial expression and limb action data to the remote terminal, wherein a timestamp is arranged on the facial expression and limb action data, and a timestamp is also arranged on a local terminal sound file collected in real time, and the facial expression and limb action data and the local terminal sound file are respectively transmitted to the remote terminal through independent channels;

The remote end fuses the facial expression and limb motion data of the local participant with the second local static image, and reconstructs to obtain a dynamic ultra-high definition video picture of the local end;

and the remote terminal synchronously plays the dynamic ultra-high definition video picture of the local terminal and the local terminal sound file according to the time stamp.

In the above method, preferably, the first local still image is saved on a storage device of the local side.

In the above method, preferably, the operation of collecting the current still picture of the local terminal is triggered by a still picture collecting button, where the still picture collecting button is disposed on a software interface of the video conference system or on a camera of the local terminal.

In the above-described method, preferably,

Facial expression data of a local participant are obtained through recognition by a facial expression recognition technology;

And (5) identifying and obtaining the limb motion data of the local participant through a limb motion identification technology.

In the above method, preferably, an AI image intelligent processing module is built in the local camera, and the video picture is enhanced in real time.

In the above method, preferably, the enhancing of the video picture includes adjusting brightness, contrast, and sharpening.

In the above method, preferably, the local terminal collects a plurality of first local still images with different angles, and transmits the first local still images to the remote terminal to be stored as a plurality of second local still images;

And the remote end constructs and obtains a 3D model of the local participant by utilizing a plurality of second local static images, fuses the 3D model with facial expression and limb motion data of the local participant, and reconstructs and obtains a dynamic ultra-high definition video picture of the local end.

In the above method, preferably, during the video conference, the still picture collection button may be pressed at any time, so as to collect the current still picture and transmit the current still picture to the remote terminal to replace the second local still picture.

The invention also provides an AI-enhanced video conference system, comprising a local end and a remote end which are connected by a network,

The local end is provided with an acquisition module and a transmission module, the acquisition module acquires a current static picture of the local end according to a static picture acquisition instruction to obtain a first local static image, and the first local static image comprises a local participant; collecting local video pictures and sounds in real time, identifying and obtaining facial expression and limb action data of a local participant, wherein time stamps are arranged on the facial expression and limb action data, and time stamps are also arranged on a local sound file collected in real time; the transmission module transmits the first local static image to a remote end, and transmits facial expression and limb motion data of the participant and a local end sound file to the remote end through independent channels in real time;

The remote end is provided with a receiving module, a reconstruction module and a playing module, and the receiving module receives the first local static image and stores the first local static image as a second local static image; receiving facial expression and limb motion data of the local participant and a local-end sound file in real time, and fusing the facial expression and limb motion data of the local participant with the second local static image by the reconstruction module to obtain a dynamic ultra-high definition video picture of the local end in a reconstruction mode; and the playing module synchronously plays the dynamic ultra-high definition video picture and the local sound file of the local terminal according to the time stamp.

The invention also provides a computer program product comprising a computer program/instruction which, when executed by a processor, implements the method described above.

According to the technical scheme, the AI enhanced video conference system, the AI enhanced video conference method and the AI enhanced video conference computer program product provided by the invention solve the problems of delay, clamping and instability frequently occurring in the prior art of video picture blurring. Compared with the prior art, the invention has the following beneficial effects:

The local terminal collects the current static picture of the local terminal and sends the current static picture to the remote terminal, the local terminal collects the local video picture in real time, and recognizes and obtains facial expression and limb action data of a local participant, and the facial expression and limb action data are transmitted to the remote terminal; the remote end fuses the facial expression and limb motion data of the local participant with the second local static image, and the dynamic ultra-high definition video picture of the local end is obtained through reconstruction. Because only facial expression and limb actions acquired by the camera are transmitted, the data transmission quantity is low, the response speed is high, and the picture is smooth. The reconstructed image provides a better visual experience, so that the user can more clearly observe and understand the participants and the content in the teleconference, and the experience of the teleconference and the interaction is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will make brief description and illustrations of the drawings used in the description of the embodiments of the present invention or the prior art. It is obvious that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a flowchart of a method for AI-enhanced video conferencing provided by the invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention, and it is apparent that the embodiments described below are only some embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without making any inventive effort are intended to fall within the scope of the present invention.

In order to make the explanation and the description of the technical solution and the implementation of the present invention clearer, several preferred embodiments for implementing the technical solution of the present invention are described below.

In this document, the terms "inner, outer", "front, rear", and "left, right" are expressions based on the usage status of the product, and it is apparent that the usage of the corresponding terms does not limit the scope of the present solution.

Referring to fig. 1, fig. 1 is a flowchart of a method for AI-enhanced video conferencing provided by the invention.

Video conferencing systems typically include a local end and a remote end connected by a network. As shown in fig. 1, the method for AI-enhanced video conferencing provided by the invention comprises the following steps:

Step 110, the local terminal collects the current static picture of the local terminal according to the static picture collection instruction, and obtains the first local static image.

If there are multiple local participants, the first local still image may include all the local participants, or, of course, the camera may be adjusted to make the first local still image include only the speaker according to the speaking condition of the local participant.

The operation of collecting the current static picture at the local end can be triggered by a static picture collecting button, the static picture collecting button can be arranged on a software interface of the video conference system or on a camera at the local end, the operation of collecting the static picture at the local end is triggered when the static picture collecting button is pressed, and the camera captures the current picture and stores the current picture as a first local static picture on a local storage device. For example, in a JPEG or PNG format, a user may access and view the first local still image on a local storage device to confirm the sharpness and integrity of the captured image, facilitating adjustment of the capture parameters.

The local side transmits the first local still image to the remote side and saves it as a second local still image at the remote side, step 120.

Step 130, the local terminal collects the local video image and sound in real time, and recognizes and obtains the facial expression and limb motion data of the local participant, and transmits the facial expression and limb motion data to the remote terminal.

Facial expression data of a local participant, such as smiles, blinks, frowns and the like, are obtained by recognition by using facial expression recognition technologies such as Microsoft Azure face API and the like.

And using openpose and other limb movement recognition technologies to recognize and obtain limb movement data of the local participant.

The facial expression and limb motion data identified based on the video picture are provided with time stamps, and the local end sound files collected in real time are also provided with time stamps, so that the facial expression and limb motion data and the local end sound files are respectively transmitted to the remote end through independent channels.

In step 140, the remote end fuses the facial expression and limb motion data of the local participant into the second local still image, and reconstructs the dynamic ultra-high definition video frame of the local end. And synchronously transferring the facial expression and the timestamp on the limb motion data to the dynamic ultra-high definition video picture.

And step 150, the remote terminal synchronously plays the dynamic ultra-high definition video picture of the local terminal and the local terminal sound file according to the time stamp.

And the time stamp of the local sound file is used as a reference, and the time stamp of the dynamic ultra-high definition video picture of the local is aligned to be synchronously played.

The application scene of the video conference is generally basically unchanged in the conference background, and the dynamic change mainly comprises the facial expression and limb actions of the participants.

The method provided by the invention has the advantages that the remote end stores the static picture of the local end, the local end acquires the local video picture in real time, the facial expression and limb motion data of the local participant are identified and obtained and transmitted to the remote end, and the dynamic ultra-high definition video picture of the local end is reconstructed by fusing the facial expression and limb motion data with the static picture of the local end stored on the remote end. Because only the data such as the face, the expression, the limb actions and the like are required to be transmitted, the transmitted data volume is much lower than that of the traditional video transmission, the bandwidth pressure is reduced, the speed is improved, the video picture is smoother, and the high-definition effect is achieved.

In the method, the camera can be internally provided with the AI image intelligent processing module, and the image is enhanced by utilizing an image processing algorithm, deep learning and a neural network model, such as brightness adjustment, contrast, sharpening and the like, so that the picture quality is improved, and the participant feel better.

In the method, the local end can acquire a plurality of first local static images with different angles and transmit the first local static images to the remote end to be stored as a plurality of second local static images; the remote end utilizes a plurality of second local static images to construct and obtain a 3D model of the local participant, and the 3D model is fused with facial expression and limb motion data of the local participant to reconstruct and obtain a dynamic ultra-high definition video picture of the 3D character.

In the method, a static picture acquisition button can be pressed at any time in the video conference process, and the current static picture is acquired and transmitted to a remote terminal to replace a second local static picture. Therefore, in the video conference, if the current static picture of the local end is found to be defective, for example, the reference person is incomplete, the picture is not in the middle, the current static picture can be collected again, and the picture is reconstructed according to the newly collected current static picture, so that the picture looks more comfortable, and better visual effect and conference experience are achieved.

The method is described by reconstructing the picture of the local terminal in real time by the remote terminal, and in practical application, the local terminal and the remote terminal have the same function, and the local terminal can reconstruct the picture of the remote terminal in real time at the same time by adopting the same method so as to realize communication and exchange between two parties of the video conference.

The method for the complete AI enhanced video conference by simultaneously reconstructing the local terminal and the remote terminal comprises the following steps:

Step 210, the local terminal acquires a current static picture of the local terminal through a camera according to a static picture acquisition instruction of the local terminal, and a first local static image is obtained, wherein the first local static image comprises a local participant;

The remote terminal acquires a current static picture of the remote terminal through a camera according to a static picture acquisition instruction of the remote terminal, and a first remote static image is obtained, wherein the first remote static image comprises a remote participant.

Step 220, the local end transmits the first local static image to the remote end, and stores the first local static image as a second local static image at the remote end; the remote terminal transmits the first remote static image to the local terminal and stores the first remote static image as a second remote static image at the local terminal.

Step 230, the local terminal collects the local video image and sound in real time, and recognizes and obtains the facial expression and limb action data of the local participant, and transmits the facial expression and limb action data to the remote terminal; the remote terminal collects remote video pictures and sounds in real time, recognizes and obtains facial expression and limb action data of a remote participant, and transmits the facial expression and limb action data to the local terminal.

Step 240, the remote end fuses the facial expression and limb motion data of the local participant with the second local static image, and reconstructs to obtain a dynamic ultra-high definition video picture of the local end; the local terminal fuses the facial expression and limb motion data of the remote participant with the second remote static image, and the dynamic ultra-high definition video picture of the remote terminal is obtained through reconstruction.

Step 150, the remote end plays the dynamic ultra-high definition video picture and the local sound file of the local end synchronously. The local terminal synchronously plays the dynamic super-high definition video picture of the remote terminal and the sound file of the remote terminal.

Likewise, the time stamp of the sound file is used as a reference to align the time stamp of the dynamic ultra-high definition video picture for synchronous playing.

In the method, the local end and the remote end work simultaneously, and simultaneously display the dynamic ultra-high definition video picture of the video conference in real time.

The above-described method of AI-enhanced video conferencing may be designed as a computer program, and for this purpose the application also provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the above-described method of AI-enhanced video conferencing.

Based on the above method, the invention also provides an AI-enhanced video conference system, which comprises a local end and a remote end which are connected through a network,

The local end is provided with an acquisition module and a transmission module, the acquisition module acquires a current static picture of the local end according to a static picture acquisition instruction to obtain a first local static image, and the first local static image comprises a local participant; collecting local video images and sounds in real time, and identifying and obtaining facial expression and limb action data of a local participant; the transmission module transmits the first local static image to a remote end and transmits facial expression and limb action data of the participant to the remote end in real time;

the remote end is provided with a receiving module, a reconstruction module and a playing module, and the receiving module receives the first local static image and stores the first local static image as a second local static image; receiving facial expression and limb motion data of the local participant and a local-end sound file in real time, and fusing the facial expression and limb motion data of the local participant with a second local static image by the reconstruction module to obtain a dynamic ultra-high definition video picture of the local end in a reconstruction mode; the playing module synchronously plays the dynamic ultra-high definition video picture of the local terminal and the local terminal sound file.

In view of the above description of the specific embodiments, the AI-enhanced videoconferencing system, method, and computer program product provided by the present invention have the following advantages over the prior art:

Firstly, a local terminal collects a current static picture of the local terminal and sends the current static picture to a remote terminal, the local terminal collects a local video picture in real time, and recognizes and obtains facial expression and limb action data of a local participant and transmits the facial expression and limb action data to the remote terminal; the remote end fuses the facial expression and limb motion data of the local participant with the second local static image to reconstruct a dynamic ultra-high definition video picture of the local end. Because only facial expression and limb actions acquired by the camera are transmitted, the data transmission quantity is low, the response speed is high, and the picture is smooth.

Second, the reconstructed image provides a better visual experience, enabling users to more clearly view and understand the participants and content in the teleconference, improving the teleconference and interactive experience.

Thirdly, in the video conference process, if the picture reconstructed based on the current static picture of the local terminal is found to be defective, the static picture acquisition button can be pressed at any time to acquire the current static picture and transmit the current static picture to the remote terminal to replace the second local static picture for reconstruction, so that the picture looks more comfortable, and better visual effect and conference experience are achieved.

Fourth, the collection of facial and expressive features can provide more accurate and realistic user feedback, improving the experience of teleconferencing and interaction.

Finally, it is also noted that the terms "comprises," "comprising," or any other variation thereof, as used herein, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The present invention is not limited to the above-mentioned preferred embodiments, and any person who can learn the structural changes made under the teaching of the present invention can fall within the scope of the present invention if the present invention has the same or similar technical solutions.

Claims

1. A method of AI-enhanced video conferencing, the video conferencing comprising a local end and a remote end connected by a network, the method comprising the steps of:

The local terminal collects local video pictures and sounds in real time, and recognizes and obtains facial expression and limb action data of a local participant, a timestamp is arranged on the facial expression and limb action data, a timestamp is also arranged on a local terminal sound file collected in real time, and the facial expression and limb action data and the local terminal sound file are respectively transmitted to the remote terminal through independent channels;

2. The method of claim 1, wherein the first local still image is saved to a storage device on a local side.

3. The method of claim 1, wherein the operation of capturing the current still picture at the local side is triggered by a still picture capture button provided on a software interface of the video conference system or on a camera at the local side.

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

5. The method of claim 1, wherein the local camera has an AI image intelligent processing module built-in to enhance the video frames in real time.

6. The method of claim 5, wherein enhancing the video picture comprises adjusting brightness, contrast, and sharpening.

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The local end collects a plurality of first local static images with different angles, and transmits the first local static images to the remote end to be stored as a plurality of second local static images;

8. The method of claim 1, wherein the still picture collection button is pressed at any time during the video conference to collect a current still picture and transmit it to a remote terminal to replace the second local still picture.

9. An AI-enhanced videoconferencing system comprising a local end and a remote end connected by a network, characterized in that,

10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any one of claims 1 to 8.