CN119182875A

CN119182875A - Video interaction method and device and related equipment

Info

Publication number: CN119182875A
Application number: CN202411269348.0A
Authority: CN
Inventors: 尹柏成; 李振宇; 廖健华; 廖智勇; 王乐; 林浩
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd; MIGU Comic Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd; MIGU Comic Co Ltd
Priority date: 2024-09-11
Filing date: 2024-09-11
Publication date: 2024-12-24

Abstract

The application discloses a video interaction method, a video interaction device and related equipment, and belongs to the technical field of videos. The method comprises the steps of receiving first data to be processed, wherein the first data to be processed comprises a first video call picture and a target object appointed by a user in the first video call picture, the first video call picture is generated in the video call process of a first terminal and a second terminal, processing the first data to be processed to obtain first target data, the first target data comprises a first target picture and a first 3D model of the target object, the first target picture is a picture after the target object in the first video call picture is deleted, and sending the first target data to the first terminal and the second terminal so that the first terminal and the second terminal update the first video call picture according to the first target picture and the first 3D model. Users can describe the target object more intuitively, and the communication efficiency of the two parties of the video call is improved.

Description

Video interaction method and device and related equipment

Technical Field

The application belongs to the technical field of videos, and particularly relates to a video interaction method, a video interaction device and related equipment.

Background

Video call technology facilitates communication among people, but when two parties to a video call need to specifically describe an object in the video, the existing manner is generally described verbally by a party user to indicate the object of interest to the user. However, the verbal description requires a certain language expression capability, otherwise the other user is prone to comprehension bias, resulting in that both parties are not talking on the same object.

Therefore, the existing video interaction method cannot directly carry out interaction operation on video call contents, and has the problem of low communication efficiency.

Disclosure of Invention

The embodiment of the application aims to provide a video interaction method, a video interaction device and related equipment, which can solve the problem of low communication efficiency of the existing video interaction method.

In a first aspect, an embodiment of the present application provides a video interaction method, where the method includes:

Receiving first data to be processed, wherein the first data to be processed comprises a first video call picture and a target object appointed by a user in the first video call picture, and the first video call picture is generated in the video call process of a first terminal and a second terminal;

Processing the first data to be processed to obtain first target data, wherein the first target data comprises a first target picture and a first 3D model of the target object, and the first target picture is a picture after deleting the target object in the first video call picture;

And sending the first target data to the first terminal and the second terminal so that the first terminal and the second terminal update the first video call picture according to the first target picture and the first 3D model.

In a second aspect, an embodiment of the present application provides a video interaction device, including:

The first receiving module is used for receiving first data to be processed, wherein the first data to be processed comprises a first video call picture and a target object appointed by a user in the first video call picture, and the first video call picture is generated in the video call process of a first terminal and a second terminal;

The first processing module is used for processing the first data to be processed to obtain first target data, wherein the first target data comprises a first target picture and a first 3D model of the target object, and the first target picture is a picture after deleting the target object in the first video call picture;

And the first sending module is used for sending the first target data to the first terminal and the second terminal so that the first terminal and the second terminal update the first video call picture according to the first target picture and the first 3D model.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor, a memory, and a program stored on the memory and executable on the processor, where the program is executed by the processor to implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method as described in the first aspect.

In the embodiment of the application, a first server receives first data to be processed in the process of video call, wherein the first data to be processed comprises a first video call picture and a target object appointed by a user in the first video call picture, then the first server processes the first data to be processed to obtain first target data comprising a first target picture and a first 3D model, and then the first target data is respectively sent to the first terminal and the second terminal, so that the first terminal and the second terminal can synchronously present the first 3D model of the first target picture and the target object, and the first 3D model is displayed on the first target picture in a superposition way. Therefore, the user can still perform normal video communication through the first target picture, and when describing the target object based on the first 3D module, the user can describe the target object more intuitively, so that the situation that repeated confirmation and communication are required due to unclear description or the fact that the target object is indicated by other modes is reduced, and the communication efficiency of both sides of the video call is improved. The method can be suitable for various video call scenes, especially when the target object needs to be accurately described or operated, such as teaching, working conference, remote technical support and the like.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is a schematic flow chart of a video interaction method according to an embodiment of the present application;

FIG. 2 is a second flowchart of a video interaction method according to an embodiment of the present application;

FIG. 3 is a third flowchart of a video interaction method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a video interaction device according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The method provided by the embodiment of the application is described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.

As shown in fig. 1, in one of the flow diagrams of the video interaction method provided by the embodiment of the present application, the execution subject of the video interaction method of the present embodiment may be a first server, which specifically includes the following steps:

Step 101, receiving first data to be processed, wherein the first data to be processed comprises a first video call picture and a target object appointed by a user in the first video call picture, and the first video call picture is generated in the video call process of a first terminal and a second terminal;

The first terminal may be a terminal device used by the user a, such as a mobile phone, a tablet computer, a video phone, etc., and the second terminal may be a terminal device used by the user B, such as a mobile phone, a tablet computer, a video phone, etc. The first terminal can collect the picture of the environment of the user A and send the picture to the second terminal through the first terminal so as to present the picture of the environment of the user A to the user B in the second terminal, and the second terminal can collect the picture of the environment of the user B and send the picture to the first terminal through the second terminal so as to present the picture of the environment of the user B to the user A in the first terminal, thereby realizing real-time video call between the user A and the user B.

The first server may be a server corresponding to the first terminal, i.e. a background of the first terminal. When the user a and the user B respectively perform video call through the corresponding first terminal and second terminal, for example, when the user a wants to describe a target object in a video picture, the user a wants to describe the target object in the video call process to the user B in detail, if the target object is a certain object in the environment where the user a is located, the user a can pick up the target object to enable the target object to be close to the first terminal, and based on communication connection between the first terminal and the second terminal, the user B can clearly view the target object on the second terminal. However, the target object is a certain object in the environment where the user B is located, or in the case that the user a cannot pick up the target object, the user a describes the target object in the video call process to the user B in detail, which can be described as follows:

in an example, when the first video call frame is a first video call frame acquired by a camera of the second terminal, that is, the target object is an object in an environment where the user B is located, the first terminal and the second terminal synchronously display the first video call frame, the user a performs a first input on the first terminal on the target object in the first video call frame, the first terminal generates first to-be-processed data based on the first input of the user a, and sends the first to-be-processed data to the first server, where the first to-be-processed data includes the first video call frame and position information of the target object in the first video call frame.

In another example, in the case where the first video call frame is a first video call frame acquired by a camera of the first terminal, that is, the target object is an object in an environment where the user a is located, the first terminal and the second terminal synchronously display the first video call frame, the user a may also perform a first input on the first terminal on the target object in the first video call frame, and the first terminal generates first data to be processed based on the first input of the user a and sends the first data to be processed to the first server, where the first data to be processed includes the first video call frame and the position information of the target object in the first video call frame.

In the above example, the first data to be processed may be directly transmitted to the first server by the first terminal. It should be understood that the first data to be processed may also be sent to the first server through other transmission media, so that the first server receives the first data to be processed of the first terminal, which may also achieve the same technical effects, and will not be described herein again.

The location information of the target object specified by the user in the first video call picture may be that the user performs a first input on the first terminal on the target object in the first video call picture to determine the location information of the target object in the first video call picture. See, for details, the following:

In an example, the first terminal and the second terminal synchronously display a first video call screen, and when the user a designates a target object in the first video call screen, the target object may be selected by long-pressing the screen of the first terminal with a finger. To ensure that the user a has clear his choice, the first terminal may superimpose a conspicuous color as a cue on the display area of the target object in the first video call picture. The first terminal generates first data to be processed based on the first input of the user A and sends the first data to be processed to the first server.

In another example, the first terminal and the second terminal synchronously display a first video call screen, and when the user a designates a target object in the first video call screen, the target object may be selected by means of voice input. To ensure that the user a has clear his choice, the first terminal may superimpose a conspicuous color as a cue on the display area of the target object in the first video call picture. The first terminal generates first data to be processed based on the first input of the user A and sends the first data to be processed to the first server.

In yet another example, the first terminal and the second terminal synchronously display a first video call screen, and when the user a designates a target object in the first video call screen, the target object may be selected through a screenshot screen and through image recognition. To ensure that the user a has clear his choice, the first terminal may superimpose a conspicuous color as a cue on the display area of the target object in the first video call picture. The first terminal generates first data to be processed based on the first input of the user A and sends the first data to be processed to the first server.

It should be appreciated that for user B, the target object in the first video call frame may be specified in the manner described above to determine the location information of the target object in the first video call frame.

The first server receives the first data to be processed sent by the first terminal, and further processes the first data to be processed through the subsequent step 102, which can be seen in the following expression:

102, processing the first data to be processed to obtain first target data, wherein the first target data comprises a first target picture and a first 3D model of the target object, and the first target picture is a picture after deleting the target object in the first video call picture;

The first server can perform Three-dimensional (3D) modeling according to image information of a target object in the first video call picture to obtain a first 3D model of the target object in the process of processing the first data to be processed, and delete the target object in the first video call picture according to the position information of the target object in the first video call picture to obtain the first target picture. The first 3D model can be displayed on the first target picture in a superimposed mode, the user A and the user B can still perform normal video communication through the first target picture, and when the target object is described based on the first 3D module, the user can more intuitively see and describe the target object, so that misunderstanding caused by description blurring or indication of the target object in other modes is reduced.

In some examples, the processing the first data to be processed to obtain first target data includes:

deleting the target object from the first video call picture according to the position information of the target object in the first video call picture to obtain a first intermediate picture;

and performing image restoration on the position of deleting the target object in the first intermediate picture to obtain the first target picture.

In the example, a first server deletes a target object in a first video call picture according to the position information of the target object in the first video call picture, namely, the position of the target object is subjected to matting processing to obtain a first intermediate picture, then, the first server performs image restoration on the position of the deleted target object in the first intermediate picture by utilizing a neural network technology, and a real environment is presented by reconstructing the background of the position. Wherein, to ensure that the target object attracts the attention of the user, the first server may further add a special effect in the first target screen, alternate the target object between display and concealment, and apply a blinking effect when it is visible. Therefore, the method can more effectively attract the attention of the user, and help the user to quickly locate the target object in the video call process.

Step 103, the first target data is sent to the first terminal and the second terminal, so that the first terminal and the second terminal update the first video call picture according to the first target picture and the first 3D model.

In an example, the first server may send the first target data to the first terminal to cause the first terminal to update the first video call screen according to the first target screen and the first 3D model.

In another example, the first server may send the first target data to the first terminal and the second terminal, respectively, so that the first terminal and the second terminal update the displayed first video call screen, respectively.

The updating of the displayed first video call picture comprises the steps of updating the displayed first video call picture to the first target picture, and superposing and displaying a first 3D model of the target object in the first target picture.

The first server sends the first target data to the first terminal and the second terminal respectively, so that the first terminal and the second terminal can synchronously present a first 3D model of a first target picture and a first target object, and the first 3D model is displayed on the first target picture in a superposition mode. The user can still perform normal video communication through the first target picture, and when describing the target object based on the first 3D module, the user can describe the target object more intuitively, so that the situation that repeated confirmation and communication are required due to unclear description or the fact that the target object is indicated by other modes is reduced, and the communication efficiency of both sides of the video call is improved. The method can be suitable for various video call scenes, especially when the target object needs to be accurately described or operated, such as teaching, working conference, remote technical support and the like.

In the embodiment of the application, a first server receives first data to be processed in the video call process of a first terminal and a second terminal, the first data to be processed comprises a first video call picture and a target object appointed by a user in the first video call picture, then the first server processes the first data to be processed to obtain first target data comprising a first target picture and a first 3D model, and then the first target data are respectively sent to the first terminal and the second terminal, so that the first terminal and the second terminal can synchronously present the first 3D model of the first target picture and the target object, and the first 3D model is overlapped and displayed on the first target picture. Therefore, the user can still perform normal video communication through the first target picture, and when describing the target object based on the first 3D module, the user can describe the target object more intuitively, so that the situation that repeated confirmation and communication are required due to unclear description or the fact that the target object is indicated by other modes is reduced, and the communication efficiency of both sides of the video call is improved. The method can be suitable for various video call scenes, especially when the target object needs to be accurately described or operated, such as teaching, working conference, remote technical support and the like.

In some optional embodiments, after the sending the first target data to the first terminal and the second terminal, so that the first terminal and the second terminal update the first video call frame according to the first target frame and the first 3D model, the method further includes:

under the condition that a first request of the first terminal or the second terminal is received, responding to the first request, updating the first 3D model to obtain a second 3D model, wherein the first request is used for requesting updating at least one parameter of the first 3D model, namely a display position, a model gesture and a model size;

And sending the second 3D model to the first terminal and the second terminal so that the first terminal and the second terminal update the displayed first 3D model respectively.

In an example, when the user a uses a first terminal and the user B uses a second terminal, and the user a and the user B respectively perform a video call through the corresponding first terminal and the second terminal, the first terminal and the second terminal synchronously display a first target picture and a first 3D model of the target object. At this time, when the user a wants to further introduce the first 3D model displayed in the first target screen in a more detailed and comprehensive manner, the user a may perform a second input on the displayed first 3D model at the first terminal, and the first terminal generates a first request based on the second input of the user a and sends the first request to the first server, where the first request and the user request update at least one parameter of the display position, the model posture, and the model size of the first 3D model. And under the condition that the first server receives a first request of the first terminal, the first server responds to the first request to update the first 3D model, and a second 3D model is obtained.

Illustratively, the user a may make a second input to the displayed first 3D model on the first terminal, which may be described as follows:

The first terminal and the second terminal synchronously display a first target picture and a first 3D model overlapped in the first target picture, when the user A wants to further describe the first 3D model overlapped in the first target picture more in detail and comprehensively, the user A can press/double-click the first 3D module on the first terminal by a long finger, adjust the display position of the first 3D model (namely, move the first 3D model) in a three-finger sliding mode when the first 3D model is selected, slide the model gesture of the first 3D model (namely, rotate the first 3D model) by two fingers, zoom out the first 3D model in a two-finger opposite sliding mode, and zoom in the first 3D model (namely, zoom in the first 3D model) in a two-finger opposite sliding mode. In this way, user A completes the second input on the first terminal, which generates a first request based on the second input of user A and sends the first request to the first server. And under the condition that the first server receives the first request of the first terminal, analyzing and responding to the first request, and updating the first 3D model to obtain a second 3D model. The first terminal and the second terminal synchronously display the second 3D model in real time, so that the process of interaction (such as movement, rotation, scaling and the like) between the gesture and the three-dimensional model is realized, the display content of the video call process is enriched, and the communication efficiency of both sides of the video call is further improved.

In other optional embodiments, in a case that the first video call frame is a video call frame generated by capturing frames by a camera of the second terminal during a video call between the first terminal and the second terminal, the method further includes, after the first terminal and the second terminal update the first video call frame according to the first target frame and the first 3D model, sending the first target data to the first terminal and the second terminal:

Receiving second data to be processed, wherein the second data to be processed comprises a second video call picture and a gesture change amount of a camera of the second terminal, and the second video call picture is generated according to a camera acquisition picture of the second terminal after the gesture of the camera of the second terminal is changed;

Processing the second data to be processed and the first 3D model to obtain second target data, wherein the second target data comprises a second target picture and a third 3D model of the target object, the second target picture is a picture after deleting the target object in the second video call picture, and the third 3D model is a 3D model obtained by carrying out gesture update on the first 3D model based on the gesture change amount;

And sending the second target data to the first terminal and the second terminal so that the first terminal and the second terminal update the displayed video call picture according to the second target picture and the third 3D model.

The first server may send the second target data to the first terminal and the second terminal, so that the first terminal and the second terminal update the displayed video call frames, respectively, where updating the displayed video call frames includes updating the first target frame to the second target frame and updating the first 3D model to the third 3D model.

In this embodiment, as shown in fig. 2, in the process of performing a video call between a first terminal and a second terminal, a first video call picture is taken as an example, and an exemplary description is made based on a video call picture generated by a camera acquisition picture of the second terminal. It should be understood that, in the process that the first video call frame is a video call frame generated by the first terminal and the second terminal according to the picture acquired by the camera of the first terminal, the same technical effect can be achieved.

The first terminal may be a terminal device used by user a and the second terminal may be a terminal device used by user B. The camera of the first terminal can collect the picture of the environment where the user A is located and send the picture to the second terminal through the first terminal so as to present the picture of the environment where the user A is located to the user B on the screen of the second terminal, and likewise, the camera of the second terminal can collect the picture of the environment where the user B is located and send the picture to the first terminal through the second terminal so as to present the picture of the environment where the user B is located to the user A on the screen of the first terminal, thereby realizing real-time video call between the user A and the user B. The method comprises the steps of enabling a user A to conduct first input on a first terminal on a target object in a first video call picture, enabling the first terminal to generate first data to be processed based on the first input of the user A and send the first data to be processed to a first server, enabling the first server to process the first data to be processed to obtain first target data, enabling the first server to send the first target data to the first terminal to display the first target picture in the first terminal and to display a first 3D model of the target object in the first target picture in an overlapping mode, and enabling the first target data to be synchronized to a second terminal through the first terminal to display the first target picture in the second terminal and to display the first 3D model of the target object in the first target picture in an overlapping mode.

Further, as shown in fig. 3, the second server (i.e., the server corresponding to the second terminal) records the gesture of the camera of the second terminal in real time, when the first server obtains that the gesture of the camera of the second terminal changes, for example, the user B moves the second terminal, the gesture of the camera of the second terminal correspondingly changes, at this time, the first server receives second to-be-processed data sent by the second terminal, where the second to-be-processed data includes a second video call picture and a gesture change amount of the camera of the second terminal, the second video call picture is a picture collected by the camera of the second terminal (i.e., a picture of an environment where the user B is located) after the gesture of the camera of the second terminal changes, and the second terminal synchronizes the collected picture to the first terminal.

The first terminal sends a picture collected by the camera of the second terminal (namely, a second video call picture, namely, a picture of an environment where the user B collected by the camera of the second terminal is located after the gesture of the camera of the second terminal is changed) to the first server. Then, the first server performs a picture update process on the second video call picture to obtain a second target picture, and performs a model update process on the first 3D model to obtain a second 3D model. The second target picture is a picture after deleting the target object in the second video call picture, and the third 3D model is a 3D model obtained by carrying out posture update on the first 3D model based on the posture change amount.

The first server sends the second target data to the first terminal and the second terminal respectively, so that the first terminal and the second terminal can update the presented first target picture to the second target picture, update the presented first 3D model to the second 3D model, and display the second 3D model on the second target picture in a superposition manner, thereby realizing the update of the displayed video call picture. The spatial position of the target object is tracked in real time by utilizing the instant positioning and map construction (Simultaneous Localization AND MAPPING, SLAM) technology, and corresponding adjustment is carried out according to the position and the gesture of the mobile phone, so that the communication efficiency of the two parties of the video call is improved. The method can be more widely applied to various video call scenes, especially when the target object needs to be accurately described or operated, such as teaching, working conference, remote technical support and the like.

Under the condition that the first video call picture is a first video call picture acquired by a camera of a second terminal, namely, a target object is an object in an environment where a user B is located, the first target data further comprises a position mark and a distance mark, updating the first video call picture specifically comprises updating the displayed first video call picture into the first target picture, superposing and displaying a first 3D model of the target object and the distance mark at a first position of the first target picture, superposing and displaying the position mark at a second position of the first target picture, wherein the position mark is used for indicating a display position of the target object in the first video call picture, and the distance mark is used for indicating a distance between the target object and the second terminal.

The spatial position of the target object is tracked in real time using SLAM techniques. The first server continuously calculates the distance between the position of the second terminal camera and the position of the target object, and displays the distance in real time through the distance mark, and meanwhile, a dynamic arrow appears on the screen to guide the approximate direction of the target object, namely the display position of the target object in the first video call picture. In order to keep the stable representation of the three-dimensional model in the video, the first server can also update the angle and the position of the camera of the second terminal in real time, and relocate the three-dimensional model to ensure that the spatial position of the three-dimensional model is not changed. Therefore, when the target object is not displayed in the camera range of the second terminal, the target object can still be tracked rapidly through the position identifier and the distance identifier.

Wherein the location identification comprises at least one of a static identification and a dynamic identification. Preferably, the dynamic identification is used for displaying, and the dynamic identification, such as flashing special effects, is used for indicating the target object, so that the user can be attracted more effectively, and the user can be helped to quickly locate the target object.

In some optional embodiments, the processing the first data to be processed to obtain first target data includes:

acquiring a reference image, wherein the reference image is an image of the target object in the first video call picture;

Generating a target image based on the reference image, wherein the target image is an image obtained after updating the target object from a first visual angle to a second visual angle, and the first visual angle is a shooting visual angle of the target object in the first video call picture;

taking the reference image as a benchmark, taking the first view angle and the target image as constraint conditions, and optimizing the acquired initial nerve radiation field model to obtain a target nerve radiation field model;

inputting the reference image into the target nerve radiation field model for identification to obtain texture information of the reference image;

and performing point cloud mapping based on the texture information to obtain the first 3D model.

In this embodiment, first, an image corresponding to a target object in a first video call frame determined by a user a based on a first terminal is used as a reference image, and a 2D diffusion model is used to generate a target image with a new viewing angle (i.e., a second viewing angle) so as to enhance 3D perception.

The acquired initial neural radiation field (NeRF) model is then optimized, neRF model is able to map 3D coordinates and viewing directions to color and density. The NeRF model is optimized with the reference image as a benchmark, the first viewing angle and the target image as constraint conditions, and consistency is ensured when the reference image is observed from the second viewing angle by rendering a view matched with the reference image. In the optimization process, the possible 3D structure is understood by assisting NeRF by introducing the second view image generated by the diffusion model as a priori information.

After the target neural radiation field model is obtained, the optimized target neural radiation field model is converted into a point cloud representation, and the representation comprises the position, the color and the probability of each point. And inputting the reference image into a target nerve radiation field model for identification, and performing texture mapping on the point cloud by utilizing texture information in the reference image so as to increase details and sense of reality. The point cloud is further optimized, and the visual effect of the point cloud under a new view angle is ensured to be consistent with the diffusion priori. The point cloud is refined to add additional detail so that the resulting first 3D model can exhibit high quality texture and geometric detail at different viewing angles.

And finally, carrying out post-processing on the generated first 3D model, removing noise and smoothing to improve visual quality.

As shown in fig. 4, an embodiment of the present application further provides a video interaction device, where the video interaction device 400 includes:

A first receiving module 401, configured to receive first data to be processed, where the first data to be processed includes a first video call picture and a target object specified by a user in the first video call picture, and the first video call picture is generated during a video call between a first terminal and a second terminal;

A first processing module 402, configured to process the first data to be processed to obtain first target data, where the first target data includes a first target picture and a first 3D model of the target object, and the first target picture is a picture after deleting the target object in the first video call picture;

And a first sending module 403, configured to send the first target data to the first terminal and the second terminal, so that the first terminal and the second terminal update the first video call frame according to the first target frame and the first 3D model.

Optionally, the video interaction device 400 further includes:

The updating module is used for responding to the first request to update the first 3D model to obtain a second 3D model under the condition that the first request of the first terminal or the second terminal is received, wherein the first request is used for requesting to update at least one parameter of the first 3D model, namely a display position, a model posture and a model size;

and the second sending module is used for sending the second 3D model to the first terminal and the second terminal so that the first terminal and the second terminal update the displayed first 3D model.

Optionally, the video interaction device 400 further includes:

The second receiving module is used for receiving second data to be processed, wherein the second data to be processed comprises a second video call picture and an attitude change amount of a camera of the second terminal, and the second video call picture is generated according to a camera acquisition picture of the second terminal after the attitude of the camera of the second terminal is changed;

The second processing module is configured to process the second data to be processed and the first 3D model to obtain second target data, where the second target data includes a second target picture and a third 3D model of the target object, the second target picture is a picture after the target object in the second video call picture is deleted, and the third 3D model is a 3D model obtained by performing posture update on the first 3D model based on the posture change amount;

And the third sending module is used for sending the second target data to the first terminal and the second terminal so that the first terminal and the second terminal update the displayed video call picture according to the second target picture and the third 3D model.

Optionally, the first processing module 402 includes:

the deleting sub-module is used for deleting the target object from the first video call picture according to the position information of the target object in the first video call picture to obtain a first intermediate picture;

And the restoration sub-module is used for carrying out image restoration on the position of deleting the target object in the first intermediate picture to obtain the first target picture.

Optionally, the first processing module 402 includes:

The first acquisition sub-module is used for acquiring a reference image, wherein the reference image is an image of the target object in the first video call picture;

An updating sub-module, configured to generate a target image based on the reference image, where the target image is an image obtained after updating the target object from a first viewing angle to a second viewing angle, where the first viewing angle is a shooting viewing angle of the target object in the first video call picture;

The second acquisition submodule is used for optimizing the acquired initial nerve radiation field model by taking the reference image as a benchmark and taking the first visual angle and the target image as constraint conditions to obtain a target nerve radiation field model;

the identification sub-module is used for inputting the reference image into the target nerve radiation field model for identification to obtain texture information of the reference image;

and the mapping sub-module is used for carrying out point cloud mapping based on the texture information to obtain the first 3D model.

Optionally, the first target data further includes a location identifier and a distance identifier, and updating the first video call picture specifically includes:

Updating the displayed first video call picture into the first target picture, displaying a first 3D model of the target object and the distance identifier in a superposition manner at a first position of the first target picture, and displaying the position identifier in a superposition manner at a second position of the first target picture, wherein the position identifier is used for indicating the display position of the target object in the first video call picture, and the distance identifier is used for indicating the distance between the target object and the second terminal.

It should be noted that, the video interaction device 400 provided in the embodiment of the present application can implement all technical processes of the video interaction method shown in the embodiment of fig. 1, and achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.

The embodiment of the application also provides electronic equipment, which comprises a processor, a memory and a program stored in the memory and capable of running on the processor, wherein the program realizes the processes of the embodiment of the method shown in the figure 1 when being executed by the processor, and can achieve the same technical effects, and the repetition is avoided, so that the description is omitted.

Specifically, referring to fig. 5, an embodiment of the present application further provides an electronic device, including a bus 501, a transceiver 502, an antenna 503, a bus interface 504, a processor 505, and a memory 506.

In this embodiment, the electronic device further comprises a computer program stored on the memory 506 and executable on the processor 505. The computer program, when executed by the processor 505, may implement the respective processes of the video interaction method as shown in the embodiment of fig. 1, and achieve the same technical effects, and for avoiding repetition, will not be described herein.

In fig. 5, a bus architecture (represented by bus 501), the bus 501 may include any number of interconnected buses and bridges, with the bus 501 linking together various circuits, including one or more processors, represented by processor 505, and memory, represented by memory 506. The bus 501 may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. Bus interface 504 provides an interface between bus 501 and transceiver 502. The transceiver 502 may be one element or may be multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 505 is transmitted over a wireless medium via the antenna 503, and further, the antenna 503 receives the data and transmits the data to the processor 505.

The processor 505 is responsible for managing the bus 501 and general processing, and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 506 may be used to store data used by processor 505 in performing operations.

Alternatively, the processor 505 may be CPU, ASIC, FPGA or a CPLD.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the video interaction method embodiment described above, and can achieve the same technical effects, so that repetition is avoided and no further description is given here. Wherein the computer readable storage medium is such as ROM, RAM, magnetic or optical disk.

The embodiment of the present application further provides a computer program product, which includes computer instructions, where the computer instructions, when executed by a processor, implement each process of the embodiment of the method shown in fig. 1 and achieve the same technical effects, and in order to avoid repetition, are not described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A method of video interaction, the method comprising:

2. The method of claim 1, wherein after the sending the first target data to the first terminal and the second terminal to cause the first terminal and the second terminal to update the first video call frame according to the first target frame and the first 3D model, the method further comprises:

And sending the second 3D model to the first terminal and the second terminal so that the first terminal and the second terminal update the displayed first 3D model.

3. The method according to claim 1, wherein in a case where the first video call frame is a video call frame generated by capturing frames by a camera of the second terminal during a video call between the first terminal and the second terminal, the sending the first target data to the first terminal and the second terminal, so that the first terminal and the second terminal update the first video call frame according to the first target frame and the first 3D model, the method further comprises:

4. The method of claim 1, wherein processing the first data to be processed to obtain first target data comprises:

5. The method of claim 1, wherein processing the first data to be processed to obtain first target data comprises:

6. The method of claim 1, wherein the first target data further includes a location identifier and a distance identifier, and updating the first video call frame specifically includes:

7. A video interaction device, comprising:

8. An electronic device comprising a processor, a memory and a program stored on the memory and executable on the processor, the program when executed by the processor implementing the steps of the video interaction method of any one of claims 1 to 6.

9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the video interaction method according to any of claims 1 to 6.

10. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the video interaction method of any one of claims 1 to 6.