Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
In one implementation of the present application, a user device for generating augmented reality video information of a user scene is provided; in an implementation manner of the application, a network device for generating augmented reality video information of a user scene is also provided; further, in an implementation manner of the present application, a system for generating augmented reality video information of a user scene is also provided, where the system includes the one or more user devices and the network device. The user device may include, but is not limited to, various mobile devices such as a smartphone, a tablet, a smart wearable device, and the like. In one implementation, the user equipment includes an acquisition module, such as a microphone, which may perform image video acquisition, such as a camera, or perform video acquisition. The network device may include, but is not limited to, a computer, a network host, a single network server, multiple network server sets, or a cloud server, wherein the cloud server is a virtual supercomputer operating in a distributed system and composed of a group of loosely coupled computer sets, which is used to realize a simple, efficient, secure, reliable, and processing power scalable computing service. In the present application, the user equipment may be referred to as user equipment 1, and the network equipment may be referred to as network equipment 2 (refer to fig. 1).
FIG. 1 illustrates a system diagram for generating augmented reality video information of a user scene in accordance with an aspect of the subject application. The system comprises a user equipment 1 and a network device 2. The user equipment 1 comprises a video key frame sending device 11, a scene object related information acquiring device 12, an image calibration identifying device 13 and a synthesizing device 14; the network device 2 comprises video key frame acquisition means 21, image matching recognition means 22 and scene object related information transmission means 23.
The video key frame sending device 11 may send the video key frame of the first video stream corresponding to the user scene to the corresponding network device 2; correspondingly, the video key frame obtaining device 21 may obtain a video key frame corresponding to a user scene of the user equipment 1; then, the image matching identification device 22 may perform image matching identification on the video key frame to determine the scene object related information corresponding to the video key frame; then, the scene object related information sending means 23 may send the scene object related information to the user equipment 1; correspondingly, the scene object related information obtaining device 12 may obtain the scene object related information corresponding to the video key frame, which is determined by the network device 2 based on the image matching identification; then, the image calibration recognition device 13 may perform image calibration recognition on the target frame of the second video stream acquired by the user equipment 1 based on the scene object related information; then, the synthesizing device 14 may synthesize the corresponding virtual object and the second video stream into the augmented reality video information based on the result of the image calibration recognition.
In the application, the generated augmented reality video information of the user scene may be applied to scene video presentation of a single user, such as a single user video mode, and also may be seen by each user sharing the augmented reality video information of the user scene of the user to other users when multiple users interact, for example, a multiple user video chat mode. In addition, any other mode that can be applied to the augmented reality video information of the user scene can be taken as the application scene of the present application and is included in the protection scope of the present application.
Specifically, the video key frame transmitting device 11 may transmit the video key frame of the first video stream corresponding to the user scene to the corresponding network device. Then, correspondingly, the video key frame acquiring means 21 may acquire a video key frame corresponding to the user scene of the user device 1.
In one implementation, the user equipment 1 further includes a capturing device (not shown) configured to capture a first video stream corresponding to a user scene. Here, the collecting device is used for collecting video information, namely the video stream, of a corresponding user during video recording or interaction with other users. In this application, the first video stream may be a video stream at any time. In one implementation, the capturing of the first video stream of the user scene may be performed by various types of cameras, or a combination of cameras, on the user device 1. Here, the first video stream corresponds to a plurality of consecutive frames, each frame corresponds to corresponding image information, and each object in the image information is a scene object in the user scene. In one implementation, the user equipment 1 may acquire, in real time, a first video stream corresponding to the scene object.
The user equipment 1 then further comprises video key frame determination means (not shown), where the video key frame determination means may determine video key frames from the first video stream. Here, the video key frame may be one or more frames in the first video stream, and the confirmation criteria of the video key frame may be customized based on different scene needs. In one implementation, when image information of a frame of the first video stream is changed greatly compared with image information of a previous frame, for example, a scene object is increased or decreased, and if the scene object moves obviously and reaches another preset image information change threshold, the frame is determined to be a video key frame; then, the video key frame sending device 11 may send the video key frame corresponding to the scene object to the corresponding network device 2, so as to perform image matching identification on the video key frame in the network device 2, where the image matching identification is used to effectively determine core information used for identifying the scene object, such as attribute information, position information, and surface information of the scene object. Furthermore, for a frame whose image information does not change much compared to the previous frame, it may be determined as a non-video key frame, and it is set that uploading is not required, and further, in actual operation, it may be selected for the non-video key frame to ignore the frame, or it may also be selected to perform image recognition on the user equipment 1 through image calibration recognition. In the application, only a small amount of video key frames need to be transmitted between the user equipment 1 and the corresponding network equipment 2, so that the transmission data volume is small, the network delay is small, the burden on data communication is small, the user experience cannot be influenced, and meanwhile, the defects that the user equipment 1 cannot perform a large amount of complex image recognition operation can be effectively overcome due to the strong computing capacity and storage capacity of the network equipment 2.
In one implementation, an information transmission channel may be established between the network device 2 and one or more user devices, and between multiple user devices that interact with each other through video, where the information transmission channel may include a signaling channel and a data channel, where the signaling channel is responsible for transmitting contents such as a control instruction with a small data volume, and the data channel is responsible for transmitting contents such as a video key frame, a video stream with a large data volume, and a virtual object set.
In one implementation, the user equipment 1 may acquire a video stream corresponding to the scene object in real time. Further, there may be video key frames in each video stream. For example, one or more key frames may be present in both the first video stream and the subsequent second video stream. Furthermore, in one implementation, the video key frame may be determined in real time, and the video key frame may be set to be sent to the corresponding network device 2. For example, the determination and uploading of video key frames in the first video stream may be performed as described above; in another example, the determination and uploading of the video key frame may also be performed on the subsequent second video stream.
Then, the image matching identification device 22 may perform image matching identification on the video key frame to determine the scene object related information corresponding to the video key frame; then, the scene object related information sending means 23 may send the scene object related information to the user equipment 1; in correspondence therewith, the scene object related information obtaining means 12 may obtain scene object related information corresponding to the video key frame, which is determined by the network device 2 based on image matching recognition. In one implementation, the image matching recognition may be performed on the video key frames through a scene object database preset or callable in the network device 2, or through a large number of trained image recognition models preset in the network device 2 and determined through machine learning, so as to recognize one or more scene objects of the video key frames, and match corresponding scene object related information for the one or more scene objects.
In one implementation, the scene object related information includes at least any one of: the method comprises the steps of firstly, attribute information of a scene object, secondly, position information of the scene object and thirdly, surface information of the scene object. For example, it is necessary to identify a table image in a video keyframe as a table object and identify the position coordinates of the table in the image, as well as the orientation of the table surface, e.g., the top surface orientation of the table, in order to subsequently place a virtual object on the table and provide interaction.
Specifically, in one implementation, the attribute information of the scene object may include what the scene object is, and here, fuzzy matching may be implemented: if the scene object is a building, furniture, plant, etc.; further, more accurate matching may also be achieved, such as the scene object being a tower, a table, a tree, etc. In one implementation, the position information of the scene object may include image position information of the scene object in the video key frame, and may include coordinate information, such as contour coordinate information of a tower, position coordinates of a table, and the like. In one implementation, the surface information of the scene object may include surface contour information of an object, where a surface contour of the scene object to be identified may be set, for example, an upper surface of a table needs to be identified for subsequently adding a virtual object on the table top, and thus, the identified surface information mainly includes the table upper surface information.
Here, those skilled in the art should understand that the attribute information of the scene object, the position information of the scene object, and the surface information of the scene object are only examples, and the information related to the scene object, which may be present or may appear in the future, as applicable to the present application, should also be included in the scope of the present application and included by reference.
Then, the image calibration recognition device 13 may perform image calibration recognition on the target frame of the second video stream acquired by the user equipment 1 based on the scene object related information. Here, the image calibration identification is a supplement to the image matching identification of the network device 2, and the image calibration identification is only image information identification performed on the video key frame, but for the user device 1, in the user video process, for example, in the user video process, or in the user video chat or other interaction processes, the collecting device, such as a camera, collects the video stream in real time, that is, collects continuous multiple frames in real time, and the picture information of each frame may have changes compared with the previous frame, such as the previous frame, which may be slight, and may also be identified without complex image matching operation, and at this time, the image calibration identification may be adopted in cooperation. Here, the image calibration recognition may be performed on the identified scene object related information corresponding to the video key frame, such as attribute information, position information, surface information, and the like of the scene object, on the basis of the image matching recognition, and the target frame of the second video stream, which is a new video stream currently acquired by the user equipment 1, is performed, the image calibration recognition aims to determine the scene object related information of the target frame, and particularly, to identify the slight change information of the position information, the surface information, and the like of the scene object therein, so that the second video stream may be rendered to have the augmented reality effect by performing the overlay synthesis of the virtual object on the basis of the scene object related information of the target frame determined by the recognition result. In one implementation, each frame in the second video stream may be set as the target frame, or one or more frames in the second video stream may also be set as the target frame.
Then, the synthesizing device 14 may synthesize the corresponding virtual object and the second video stream into the augmented reality video information based on the result of the image calibration recognition. In one implementation, one or more target frames in the second video that are subject to image alignment identification may be respectively composited with corresponding virtual objects. For example, image information of one target frame is superimposed with image information corresponding to a virtual object or a model, thereby synthesizing augmented reality image information corresponding to the image information of the target frame. The augmented reality video information corresponding to the second video stream may include one or more frames of augmented reality image information, for example, consecutive frames of the video stream are corresponding to the augmented reality image information. In one implementation, the image information of the target frame of the second video stream may be replaced with the augmented reality image information. In addition, in one implementation, the virtual object may be a set of virtual objects acquired from the network device 1 or other third-party devices, such as various virtual article images or models; in another implementation, the virtual object may also be extracted from the user equipment 1, for example, a picture in a picture application of the user equipment 1, such as a photo in a mobile phone album. In addition, in one implementation, the corresponding virtual object may be a single virtual object, or may be a combination of multiple virtual objects, for example, a virtual photo frame determined from a virtual object set is combined with a photo in a user's mobile phone album to form a photo frame photo.
Herein, the video key frame corresponding to the scene object is sent to the corresponding network device 2, and scene object related information, such as attribute information, position information, surface information, and the like of the scene object, which is determined by the user device 1 based on image matching identification and corresponds to the video key frame, is acquired, then, the user device 1 performs image calibration identification on each target frame in a second video stream acquired by the current user device 1 in real time in combination with the scene object related information acquired from the network device 2, and synthesizes a corresponding virtual object and the second video stream into augmented reality video information based on an image calibration identification result. Here, the method of combining image matching recognition of the network device 2 with image calibration recognition of the user device 1 breaks through the limitation that only simple face recognition can be realized due to the limited computing power and storage capacity of the mobile device in the prior art, so that the range of recognizable objects can be effectively expanded to any scene object in the user scene, wherein, on one hand, the core information for identifying the scene object, such as attribute information, position information, surface information and the like of the scene object can be effectively determined by utilizing the stronger computing power and storage power of the network device 2 compared with the user device 1 to perform image matching recognition on the video key frame; on the other hand, the user equipment 1 may further perform image calibration recognition aiming at deviation correction on a video stream updated in real time in the user equipment 1, such as a target frame of a second video stream, based on a result of image matching recognition of the network equipment 2, so that accurate recognition of a scene object in each frame of image of the current user equipment 1 can be realized; then, based on the result of the image calibration recognition, the corresponding virtual object is rendered as augmented reality video information by synthesizing with the second video stream, and can be presented to the user. In the application, because arbitrary scene object that user equipment 1 corresponds can all be discerned and resynthesized, therefore the augmented reality video information that this application presented compares in traditional video application or current augmented reality's user video chats application, and visual breakthrough will be very obvious, and the augmented reality video information variability that the user sees will strengthen greatly to user's interactive interest has been promoted, user's intelligent video experience has been optimized.
Meanwhile, only a small amount of video key frames or scene object related information corresponding to the video key frames need to be transmitted between the user equipment 1 and the corresponding network equipment 2, so that the transmission data volume is small, the network delay is small, the burden on data communication is small, and the user experience is not influenced.
In one implementation, the image calibration identification device 13 includes a first image calibration identification unit (not shown), a first determination unit (not shown). The first image calibration identification unit may perform image calibration identification on a first target frame of a second video stream acquired by the user equipment 1 based on the scene object related information; the first determination unit may determine scene object related information corresponding to the first target frame based on image calibration recognition performed on the first target frame.
In particular, in this implementation, a target frame in the second video stream, such as the first target frame, may perform the image alignment recognition with reference to scene object related information of a video key frame of the first video stream. First, comparing the image information of the first target frame with the image information of the video key frame to determine the difference between the two image information, such as comparing the outline of the scene object, comparing the position of the scene object, etc., and further, based on the scene object related information of the known video key frame, such as the attribute information, the position information, the surface information, etc. of the scene object, the data of each specific scene object related information corresponding to the first target frame is calculated, for example, the first target frame is compared with the video key frame, when the image position of one scene object table moves, the actual position coordinates of the table in the first target frame can be determined by combining the known position coordinates of the table in the video key frame based on the fact that the attribute information identified in the two frames calculated by comparison is the position offset of the object of the table. In one implementation, any target frame in the second video stream may be the first target frame, such that one or more first target frames may be identified based on scene object related information with reference to a video key frame of the first video stream.
Then, the synthesizing device 14 may synthesize the corresponding virtual object and the first target frame into first augmented reality image information based on the scene object related information corresponding to the first target frame; then, augmented reality video information is generated based on the first augmented reality image information. In an implementation manner, the image information included in the augmented reality video information may be all augmented reality image information similar to or identical to the first augmented reality image information, or may be common image information including a part of no augmented reality effect.
Further, in one implementation, the image calibration identification device 13 further includes a second image calibration identification unit (not shown), and a second determination unit (not shown). The second image calibration identification unit may perform image calibration identification on a second target frame of a second video stream acquired by the user equipment 1 based on scene object related information corresponding to the first target frame; next, the second determining unit may determine scene object related information corresponding to the second target frame based on image calibration recognition performed on the second target frame.
In particular, in this implementation, a target frame in the second video stream, such as the second target frame, may perform the image alignment recognition with reference to the scene object related information of the first target frame. In one implementation, the second target frame may be a frame in a second video stream that is sequentially subsequent to the first target frame. At this time, the appearance time of the first target frame is closer to the second target frame than the video key frame of the first video stream, so that it can be reasonably understood that the probability that the image information of the first target frame is more similar to the image information in the second target frame is relatively high.
Further, in an implementation manner, if the user equipment 1 acquires a new video key frame after the video key frame of the first video stream and the new video key frame appears in an order after the first target frame, the probability that the image information of the target frame of the new video key frame is higher in approximation degree than the image information in the second target frame is relatively higher than that of the first target frame, and at this time, the new video key frame may be preferentially used as a reference for identifying the image information of the second target frame.
Then, the synthesizing device 14 may synthesize the corresponding virtual object and the second target frame into second augmented reality image information based on the scene object related information corresponding to the second target frame; then, augmented reality video information is generated based on the first augmented reality image information and the second augmented reality image information. In an implementation manner, the image information included in the augmented reality video information may be all augmented reality image information similar to or the same as the first augmented reality image information or the second augmented reality image information, or may include part of common image information without augmented reality effect.
In one implementation, the user equipment 1 further comprises a presentation device (not shown); the presenting means may present the augmented reality video information corresponding to the second video stream.
Specifically, the user equipment 1 may play the augmented reality video information in real time on a corresponding device display screen, for example, in a process of taking a picture and recording by the user equipment 1 such as a mobile phone, the application is used to perform augmented reality effect processing on a video stream acquired in real time, and the corresponding augmented reality video information is presented on the mobile phone in real time; for another example, when the user performs a video chat with another user through the user equipment 1, for example, the mobile phone of the user may present a video picture with an augmented reality effect, and further, the mobile phone of another user interacting with the user may also view the augmented reality video information.
In one implementation, the user equipment 1 further includes a user interaction device (not shown), which may provide the augmented reality video information to one or more other user equipments corresponding to the user equipment 1. In the application, the user scene video presentation based on the augmented reality may be not only a scene video presentation of a single user, such as a single user video recording mode, but also a user scene video sharing mode in which each user shares its own user scene video with other users during interaction of multiple users, such as a multiple user video chat mode. In an implementation manner, the augmented reality video information, for example, an augmented reality video stream, may be sent to a corresponding network device by the user equipment 1, such as the network device 1, and then the network device 1 forwards the augmented reality video information to a corresponding other user equipment. In another implementation, the user equipment 1 and other user equipments may also directly interact with their respective augmented reality video information without the intermediary of the network equipment 1.
In one implementation, the user equipment 1 further includes a scene interaction device (not shown), and the scene interaction device can obtain operation instruction information of a user on the virtual object; and executing corresponding operation based on the operation instruction information. For example, a user may control a video scene or a virtual object in a video chat scene by touching or speaking the virtual object in the video chat scene, for example, a virtual pet may be placed on a table surface in a real environment, and a user who records the video or participates in the chat may control the virtual pet to perform a series of actions by touching, speaking, and the like. In one implementation, the interaction with the virtual object in the augmented reality video information may be performed by a user corresponding to the user equipment 1, and in another implementation, if the user interacts with another user, such as multi-user video chat, the interaction with the virtual object may also be implemented by the other user based on the interactive augmented reality video information.
Further, in one implementation, the scene interaction device includes at least any one of the following: a first scene interaction unit (not shown), which may acquire touch screen operation information of a user and determine operation instruction information of the user on a virtual object based on the touch screen operation information; for example, if the virtual object is a pet puppy, the user may instruct the puppy in the video to perform a corresponding reaction by clicking a preset region of the screen, such as a region where the puppy is located, and for example, the virtual puppy may operate a tail based on the clicked screen of the user. For another example, if the virtual object is a photo set in a mobile phone of the user, switching between photos may be performed through a sliding operation on the touch screen. A second scene interaction unit (not shown), which may obtain gesture information of the user through the user equipment camera device, and determine operation instruction information of the user on the virtual object based on the gesture information, for example, the user shoots a hand motion through a camera, extracts gesture information from a cluster, such as tapping, clicking, and the like, and then determines the operation instruction information based on a preset corresponding relationship between the gesture information and the operation instruction information. And a third scene interaction unit (not shown), which may acquire voice information of the user, and determine operation instruction information of the user on the virtual object based on the voice information, where the voice information of the user may be acquired through a microphone built in the user equipment 1, and the operation instruction information is determined based on a corresponding relationship between preset voice information and operation instruction information. Therefore, the interaction experience of the user can be further enriched through the interaction between the user and the virtual object in the augmented reality video information.
In one implementation, the user equipment 1 further includes a virtual object set obtaining device (not shown) and a target virtual object determining device (not shown), and the network device 2 further includes a virtual object set sending device (not shown). Specifically, the virtual object set sending means may send the virtual object set matching the scene object related information corresponding to the video key frame to the user equipment 1, and the virtual object set is obtained by the virtual object set obtaining means correspondingly. For example, the network device 2 may screen out a set of virtual objects matching the determined scene object in the video key frame based on the attribute information of the scene object in the video key frame, and if the scene object is a tree, screen out a set of virtual objects including various virtual small animals based on the determination required by the user scene. For another example, the filtering parameters such as the size of the virtual object may be set in combination with scene object related information such as position information and surface information of the scene object. Then, the target virtual object determining means may determine one or more target virtual objects from the set of virtual objects, and the synthesizing means 14 may synthesize the target virtual objects and the second video stream into augmented reality video information based on the result of the image alignment recognition. Here, this implementation can enrich the composite rendering effect of the augmented reality video information by matching the corresponding virtual object set for the user equipment 1, and at the same time, can optimize the intelligent experience of the user.
Fig. 2 shows a flowchart of a method for generating augmented reality video information of a user scene at a user device side and a network device side according to another aspect of the present application. The method includes step S301, step S302, step S303, step S304, step S401, step S402, and step S403.
In step S301, the user equipment 1 may send a video key frame of a first video stream corresponding to a user scene to a corresponding network device 2; correspondingly, in step S401, the network device 2 may obtain a video key frame corresponding to a user scene of the user device 1; next, in step S402, the network device 2 may perform image matching recognition on the video key frame to determine scene object related information corresponding to the video key frame; next, in step S403, the network device 2 may send the scene object related information to the user device 1; correspondingly, in step S302, the user equipment 1 may obtain scene object related information corresponding to the video key frame, which is determined by the network equipment 2 based on image matching identification; next, in step S303, the user equipment 1 may perform image calibration recognition on a target frame of the second video stream acquired by the user equipment 1 based on the scene object related information; next, in step S304, the user equipment 1 may synthesize the corresponding virtual object and the second video stream into augmented reality video information based on the result of the image calibration recognition.
In the application, the generated augmented reality video information of the user scene may be applied to scene video presentation of a single user, such as a single user video mode, and also may be seen by each user sharing the augmented reality video information of the user scene of the user to other users when multiple users interact, for example, a multiple user video chat mode. In addition, any other mode that can be applied to the augmented reality video information of the user scene can be taken as the application scene of the present application and is included in the protection scope of the present application.
Specifically, in step S301, the user equipment 1 may send a video key frame of a first video stream corresponding to a user scene to a corresponding network device. In step S401, the network device 2 may obtain a video key frame corresponding to the user scene of the user device 1.
In one implementation, the method further includes step S306 (not shown), and in step S306, the user equipment 1 may capture a first video stream corresponding to a user scene. Here, the collecting device is used for collecting video information, namely the video stream, of a corresponding user during video recording or interaction with other users. In this application, the first video stream may be a video stream at any time. In one implementation, the capturing of the first video stream of the user scene may be performed by various types of cameras, or a combination of cameras, on the user device 1. Here, the video stream corresponds to a plurality of consecutive frames, each frame corresponds to corresponding image information, and each object in the image information is a scene object in the user scene. In one implementation, the user equipment 1 may acquire, in real time, a first video stream corresponding to the scene object.
Next, the method further comprises step S307 (not shown), where in step S307 the user equipment 1 may determine video key frames from the first video stream. Here, the video key frame may be one or more frames in the first video stream, and the confirmation criteria of the video key frame may be customized based on different scene needs. In one implementation, when image information of a frame of the first video stream is changed greatly compared with image information of a previous frame, for example, a scene object is increased or decreased, and if the scene object moves obviously and reaches another preset image information change threshold, the frame is determined to be a video key frame; next, in step S301, the user equipment 1 may send the video key frame corresponding to the scene object to the corresponding network equipment 2, so as to perform image matching recognition on the video key frame in the network equipment 2, where the image matching recognition is used to effectively determine core information used for identifying the scene object, such as attribute information, position information, and surface information of the scene object. Furthermore, for a frame whose image information does not change much compared to the previous frame, it may be determined as a non-video key frame, and it is set that uploading is not required, and further, in actual operation, it may be selected for the non-video key frame to ignore the frame, or it may also be selected to perform image recognition on the user equipment 1 through image calibration recognition. In the application, only a small amount of video key frames need to be transmitted between the user equipment 1 and the corresponding network equipment 2, so that the transmission data volume is small, the network delay is small, the burden on data communication is small, the user experience cannot be influenced, and meanwhile, the defects that the user equipment 1 cannot perform a large amount of complex image recognition operation can be effectively overcome due to the strong computing capacity and storage capacity of the network equipment 2.
In one implementation, an information transmission channel may be established between the network device 2 and one or more user devices, and between multiple user devices that interact with each other through video, where the information transmission channel may include a signaling channel and a data channel, where the signaling channel is responsible for transmitting contents such as a control instruction with a small data volume, and the data channel is responsible for transmitting contents such as a video key frame, a video stream with a large data volume, and a virtual object set.
In one implementation, the user equipment 1 may acquire a video stream corresponding to the scene object in real time. Further, there may be video key frames in each video stream. For example, one or more key frames may be present in both the first video stream and the subsequent second video stream. Furthermore, in one implementation, the video key frame may be determined in real time, and the video key frame may be set to be sent to the corresponding network device 2. For example, the determination and uploading of video key frames in the first video stream may be performed as described above; in another example, the determination and uploading of the video key frame may also be performed on the subsequent second video stream.
Next, in step S402, the network device 2 may perform image matching recognition on the video key frame to determine scene object related information corresponding to the video key frame; next, in step S403, the network device 2 may send the scene object related information to the user device 1; correspondingly, in step S302, the user equipment 1 may acquire scene object related information corresponding to the video key frame, which is determined by the network equipment 2 based on image matching identification. In one implementation, the image matching recognition may be performed on the video key frames through a scene object database preset or callable in the network device 2, or through a large number of trained image recognition models preset in the network device 2 and determined through machine learning, so as to recognize one or more scene objects of the video key frames, and match corresponding scene object related information for the one or more scene objects.
In one implementation, the scene object related information includes at least any one of: the method comprises the steps of firstly, attribute information of a scene object, secondly, position information of the scene object and thirdly, surface information of the scene object. For example, it is necessary to identify a table image in a video keyframe as a table object and identify the position coordinates of the table in the image, as well as the orientation of the table surface, e.g., the top surface orientation of the table, in order to subsequently place a virtual object on the table and provide interaction.
Specifically, in one implementation, the attribute information of the scene object may include what the scene object is, and here, fuzzy matching may be implemented: if the scene object is a building, furniture, plant, etc.; further, more accurate matching may also be achieved, such as the scene object being a tower, a table, a tree, etc. In one implementation, the position information of the scene object may include image position information of the scene object in the video key frame, and may include coordinate information, such as contour coordinate information of a tower, position coordinates of a table, and the like. In one implementation, the surface information of the scene object may include surface contour information of an object, where a surface contour of the scene object to be identified may be set, for example, an upper surface of a table needs to be identified for subsequently adding a virtual object on the table top, and thus, the identified surface information mainly includes the table upper surface information.
Here, those skilled in the art should understand that the attribute information of the scene object, the position information of the scene object, and the surface information of the scene object are only examples, and the information related to the scene object, which may be present or may appear in the future, as applicable to the present application, should also be included in the scope of the present application and included by reference.
Next, in step S303, the user equipment 1 may perform image calibration recognition on the target frame of the second video stream acquired by the user equipment 1 based on the scene object related information. Here, the image calibration identification is a supplement to the image matching identification of the network device 2, and the image calibration identification is only image information identification performed on the video key frame, but for the user device 1, in the user video process, for example, in the user video process, or in the user video chat or other interaction processes, the collecting device, such as a camera, collects the video stream in real time, that is, collects continuous multiple frames in real time, and the picture information of each frame may have changes compared with the previous frame, such as the previous frame, which may be slight, and may also be identified without complex image matching operation, and at this time, the image calibration identification may be adopted in cooperation. Here, the image calibration recognition may be performed on the identified scene object related information corresponding to the video key frame, such as attribute information, position information, surface information, and the like of the scene object, on the basis of the image matching recognition, and the target frame of the second video stream, which is a new video stream currently acquired by the user equipment 1, is performed, the image calibration recognition aims to determine the scene object related information of the target frame, and particularly, to identify the slight change information of the position information, the surface information, and the like of the scene object therein, so that the second video stream may be rendered to have the augmented reality effect by performing the overlay synthesis of the virtual object on the basis of the scene object related information of the target frame determined by the recognition result. In one implementation, each frame in the second video stream may be set as the target frame, or one or more frames in the second video stream may also be set as the target frame.
Next, in step S304, the user equipment 1 may synthesize the corresponding virtual object and the second video stream into augmented reality video information based on the result of the image calibration recognition. In one implementation, one or more target frames in the second video that are subject to image alignment identification may be respectively composited with corresponding virtual objects. For example, image information of one target frame is superimposed with image information corresponding to a virtual object or a model, thereby synthesizing augmented reality image information corresponding to the image information of the target frame. The augmented reality video information corresponding to the second video stream may include one or more frames of augmented reality image information, for example, consecutive frames of the video stream are corresponding to the augmented reality image information. In one implementation, the image information of the target frame of the second video stream may be replaced with the augmented reality image information. In addition, in one implementation, the virtual object may be a set of virtual objects acquired from the network device 1 or other third-party devices, such as various virtual article images or models; in another implementation, the virtual object may also be extracted from the user equipment 1, for example, a picture in a picture application of the user equipment 1, such as a photo in a mobile phone album. In addition, in one implementation, the corresponding virtual object may be a single virtual object, or may be a combination of multiple virtual objects, for example, a virtual photo frame determined from a virtual object set is combined with a photo in a user's mobile phone album to form a photo frame photo.
Herein, the video key frame corresponding to the scene object is sent to the corresponding network device 2, and scene object related information, such as attribute information, position information, surface information, and the like of the scene object, which is determined by the user device 1 based on image matching identification and corresponds to the video key frame, is acquired, then, the user device 1 performs image calibration identification on each target frame in a second video stream acquired by the current user device 1 in real time in combination with the scene object related information acquired from the network device 2, and synthesizes a corresponding virtual object and the second video stream into augmented reality video information based on an image calibration identification result. Here, the method of combining image matching recognition of the network device 2 with image calibration recognition of the user device 1 breaks through the limitation that only simple face recognition can be realized due to the limited computing power and storage capacity of the mobile device in the prior art, so that the range of recognizable objects can be effectively expanded to any scene object in the user scene, wherein, on one hand, the core information for identifying the scene object, such as attribute information, position information, surface information and the like of the scene object can be effectively determined by utilizing the stronger computing power and storage power of the network device 2 compared with the user device 1 to perform image matching recognition on the video key frame; on the other hand, the user equipment 1 may further perform image calibration recognition aiming at deviation correction on a video stream updated in real time in the user equipment 1, such as a target frame of a second video stream, based on a result of image matching recognition of the network equipment 2, so that accurate recognition of a scene object in each frame of image of the current user equipment 1 can be realized; then, based on the result of the image calibration recognition, the corresponding virtual object is rendered as augmented reality video information by synthesizing with the second video stream, and can be presented to the user. In the application, because arbitrary scene object that user equipment 1 corresponds can all be discerned and resynthesized, therefore the augmented reality video information that this application presented compares in traditional video application or current augmented reality's user video chats application, and visual breakthrough will be very obvious, and the augmented reality video information variability that the user sees will strengthen greatly to user's interactive interest has been promoted, user's intelligent video experience has been optimized.
Meanwhile, only a small amount of video key frames or scene object related information corresponding to the video key frames need to be transmitted between the user equipment 1 and the corresponding network equipment 2, so that the transmission data volume is small, the network delay is small, the burden on data communication is small, and the user experience is not influenced.
In one implementation, the step S303 includes a step S3031 (not shown), and a step S3032 (not shown). In step S3031, the user equipment 1 may perform image calibration recognition on a first target frame of a second video stream acquired by the user equipment 1 based on the scene object related information; in step S3032, the user equipment 1 may determine scene object related information corresponding to the first target frame based on image calibration identification performed on the first target frame.
In particular, in this implementation, a target frame in the second video stream, such as the first target frame, may perform the image alignment recognition with reference to scene object related information of a video key frame of the first video stream. First, comparing the image information of the first target frame with the image information of the video key frame to determine the difference between the two image information, such as comparing the outline of the scene object, comparing the position of the scene object, etc., and further, based on the scene object related information of the known video key frame, such as the attribute information, the position information, the surface information, etc. of the scene object, the data of each specific scene object related information corresponding to the first target frame is calculated, for example, the first target frame is compared with the video key frame, when the image position of one scene object table moves, the actual position coordinates of the table in the first target frame can be determined by combining the known position coordinates of the table in the video key frame based on the fact that the attribute information identified in the two frames calculated by comparison is the position offset of the object of the table. In one implementation, any target frame in the second video stream may be the first target frame, such that one or more first target frames may be identified based on scene object related information with reference to a video key frame of the first video stream.
Next, in step S304, the user equipment 1 may synthesize a corresponding virtual object and the first target frame into first augmented reality image information based on the scene object related information corresponding to the first target frame; then, augmented reality video information is generated based on the first augmented reality image information. In an implementation manner, the image information included in the augmented reality video information may be all augmented reality image information similar to or identical to the first augmented reality image information, or may be common image information including a part of no augmented reality effect.
Further, in one implementation, the step S303 further includes a step S3033 (not shown), and a step S3034 (not shown). In step S3033, the user equipment 1 may perform image calibration identification on a second target frame of a second video stream acquired by the user equipment 1 based on the scene object related information corresponding to the first target frame; next, in step S3034, the user equipment 1 may determine scene object related information corresponding to the second target frame based on image calibration identification performed on the second target frame.
In particular, in this implementation, a target frame in the second video stream, such as the second target frame, may perform the image alignment recognition with reference to the scene object related information of the first target frame. In one implementation, the second target frame may be a frame in a second video stream that is sequentially subsequent to the first target frame. At this time, the appearance time of the first target frame is closer to the second target frame than the video key frame of the first video stream, so that it can be reasonably understood that the probability that the image information of the first target frame is more similar to the image information in the second target frame is relatively high.
Further, in an implementation manner, if the user equipment 1 acquires a new video key frame after the video key frame of the first video stream and the new video key frame appears in an order after the first target frame, the probability that the image information of the target frame of the new video key frame is higher in approximation degree than the image information in the second target frame is relatively higher than that of the first target frame, and at this time, the new video key frame may be preferentially used as a reference for identifying the image information of the second target frame.
Next, in step S304, the user equipment 1 may synthesize a corresponding virtual object and the second target frame into second augmented reality image information based on the scene object related information corresponding to the second target frame; then, augmented reality video information is generated based on the first augmented reality image information and the second augmented reality image information. In an implementation manner, the image information included in the augmented reality video information may be all augmented reality image information similar to or the same as the first augmented reality image information or the second augmented reality image information, or may include part of common image information without augmented reality effect.
In one implementation, the method further includes step S305 (not shown); in step S305, the user equipment 1 may present the augmented reality video information corresponding to the second video stream.
Specifically, the user equipment 1 may play the augmented reality video information in real time on a corresponding device display screen, for example, in a process of taking a picture and recording by the user equipment 1 such as a mobile phone, the application is used to perform augmented reality effect processing on a video stream acquired in real time, and the corresponding augmented reality video information is presented on the mobile phone in real time; for another example, when the user performs a video chat with another user through the user equipment 1, for example, the mobile phone of the user may present a video picture with an augmented reality effect, and further, the mobile phone of another user interacting with the user may also view the augmented reality video information.
In one implementation, the method further includes S308 (not shown), and in step S308, the user device 1 may provide the augmented reality video information to one or more other user devices corresponding to the user device 1. In the application, the user scene video presentation based on the augmented reality may be not only a scene video presentation of a single user, such as a single user video recording mode, but also a user scene video sharing mode in which each user shares its own user scene video with other users during interaction of multiple users, such as a multiple user video chat mode. In an implementation manner, the augmented reality video information, for example, an augmented reality video stream, may be sent to a corresponding network device by the user equipment 1, such as the network device 1, and then the network device 1 forwards the augmented reality video information to a corresponding other user equipment. In another implementation, the user equipment 1 and other user equipments may also directly interact with their respective augmented reality video information without the intermediary of the network equipment 1.
In one implementation, the method further includes S309 (not shown), in step S309, the user equipment 1 may acquire operation instruction information of the virtual object by the user; and executing corresponding operation based on the operation instruction information. For example, a user may control a video scene or a virtual object in a video chat scene by touching or speaking the virtual object in the video chat scene, for example, a virtual pet may be placed on a table surface in a real environment, and a user who records the video or participates in the chat may control the virtual pet to perform a series of actions by touching, speaking, and the like. In one implementation, the interaction with the virtual object in the augmented reality video information may be performed by a user corresponding to the user equipment 1, and in another implementation, if the user interacts with another user, such as multi-user video chat, the interaction with the virtual object may also be implemented by the other user based on the interactive augmented reality video information.
Further, in an implementation manner, the step S309 further includes at least any one of the steps S3091 (not shown), S3092 (not shown), and S3093 (not shown): in step S3091, the user equipment 1 may obtain touch screen operation information of a user, and determine operation instruction information of the user on the virtual object based on the touch screen operation information; for example, if the virtual object is a pet puppy, the user may instruct the puppy in the video to perform a corresponding reaction by clicking a preset region of the screen, such as a region where the puppy is located, and for example, the virtual puppy may operate a tail based on the clicked screen of the user. For another example, if the virtual object is a photo set in a mobile phone of the user, switching between photos may be performed through a sliding operation on the touch screen. In step S3092, the user equipment 1 may obtain gesture information of the user through the user equipment camera device, and determine operation instruction information of the user on the virtual object based on the gesture information, for example, the user takes a hand motion through a camera, extracts gesture information such as tapping, clicking and the like from a cluster, and then determines the operation instruction information based on a preset corresponding relationship between the gesture information and the operation instruction information. In step S3093, the user equipment 1 may acquire voice information of a user, and determine operation instruction information of the user on the virtual object based on the voice information, where the voice information of the user may be acquired through a microphone built in the user equipment 1, and the operation instruction information is determined based on a preset correspondence between the voice information and the operation instruction information. Therefore, the interaction experience of the user can be further enriched through the interaction between the user and the virtual object in the augmented reality video information.
In one implementation, the method further includes step S310 (not shown), step S311 (not shown), and step S404 (not shown).
Specifically, in step S404, the network device 2 may send the set of virtual objects matching the scene object related information corresponding to the video key frame to the user device 1, and correspondingly, in step S310, obtain the set of virtual objects by the user device 1. For example, the network device 2 may screen out a set of virtual objects matching the determined scene object in the video key frame based on the attribute information of the scene object in the video key frame, and if the scene object is a tree, screen out a set of virtual objects including various virtual small animals based on the determination required by the user scene. For another example, the filtering parameters such as the size of the virtual object may be set in combination with scene object related information such as position information and surface information of the scene object. Next, in step S311, one or more target virtual objects may be determined from the set of virtual objects by the user equipment 1, so that, in step S304, the target virtual objects and the second video stream may be synthesized into augmented reality video information by the user equipment 1 based on the result of the image calibration recognition. Here, this implementation can enrich the composite rendering effect of the augmented reality video information by matching the corresponding virtual object set for the user equipment 1, and at the same time, can optimize the intelligent experience of the user.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.