CN113536866B

CN113536866B - A person tracking display method and electronic device

Info

Publication number: CN113536866B
Application number: CN202010323761.6A
Authority: CN
Inventors: 陈泽曦; 邹双一; 王凡
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2025-02-25
Anticipated expiration: 2040-04-22
Also published as: CN113536866A

Abstract

A person tracking display method. In the method, the electronic equipment intelligently detects the targets of the people in the video stream, cuts the video stream according to the number and the positions of the detected targets, and outputs and displays the cut video stream. By implementing the technical scheme provided by the application, the electronic equipment can automatically complete character tracking in a video scene, and the output video picture displays more character details.

Description

Character tracking display method and electronic equipment

Technical Field

The present application relates to the field of terminals and image processing technologies, and in particular, to a person tracking display method and an electronic device.

Background

With the development of network technology, more and more users have become accustomed to remote video interactions. For example, staff can develop online conferences through real-time video shooting, students can conduct remote lessons through real-time video shooting in a special period, and even if the students are in a thousand of a year, family can conduct online video communication through real-time video shooting conveniently to share near conditions.

However, currently, in the process of shooting a person video, the person may be at the edge of the shot screen because the camera is not facing the person, or because the person is in a moving state. The shot characters are smaller, and the details of the characters are difficult to clearly display.

Disclosure of Invention

The application provides a person tracking display method and electronic equipment, which automatically complete person tracking in a video scene, and an output video picture displays more person details.

In a first aspect, an embodiment of the present application provides a person tracking display method, including performing object detection on a latest frame in a video stream by using an electronic device, to obtain the number of objects and positions of the objects in the latest frame, where the objects are persons in the latest frame, determining, by using the electronic device, a clipping area according to the number of the objects and positions of the objects in the latest frame, where the clipping area covers a position of the objects in the latest frame and is smaller than a frame range of the latest frame, and outputting, by using the electronic device, a clipped frame according to the clipping area.

In the embodiment of the application, the electronic equipment intelligently determines the clipping area which is smaller than the picture size of the latest frame picture and covers the target according to the number of detected targets and the position of the latest frame picture in the video stream by detecting the targets in the video stream, and outputs the clipped frame picture according to the clipping area. Therefore, the automatic completion of character tracking is realized, and the character in the output video picture is larger than that in the original video picture because the clipping area is smaller than the picture size of the latest frame picture in the video stream, and more character details can be displayed.

In combination with the first aspect, in some embodiments, the electronic device determines a clipping region according to the number of the targets and the positions of the targets in the latest frame, and specifically includes determining the clipping region according to the positions of 1 target in the latest frame when the number of the targets is 1, wherein the clipping region covers the positions of 1 target in the latest frame with the positions of 1 target in the latest frame as a center and is smaller than the frame range of the latest frame, and determining the clipping region according to the positions of a plurality of targets in the latest frame when the number of the targets is a plurality of, and the clipping region covers the positions of a plurality of targets in the latest frame and is smaller than the frame range of the latest frame.

In the above embodiment, according to different numbers of targets obtained after the electronic device performs target detection, the manner of determining the clipping region by the electronic device is also different, so that a suitable clipping region can be intelligently determined according to different numbers of shooting objects, and a frame picture obtained by final clipping is more in line with the expectations of users.

In combination with some embodiments of the first aspect, in some embodiments, the electronic device determines the clipping region according to positions of a plurality of targets in the latest frame, and specifically includes the electronic device determining a principal angle in the plurality of targets, the electronic device determining the clipping region according to positions of the plurality of targets in the latest frame and positions of the principal angle in the latest frame, when there is no principal angle in the plurality of targets, the clipping region covers positions of the plurality of targets in the latest frame and is smaller than a frame range of the latest frame, and when there is a principal angle in the plurality of targets, the clipping region covers positions of the plurality of targets in the latest frame and is smaller than a frame range of the latest frame with the position of the principal angle in the latest frame as a center.

In the above embodiment, under the condition that the target detects a plurality of targets, the electronic device may determine the principal angles in the plurality of targets, and then intelligently determine the appropriate clipping region according to whether there is a principal angle and the position of the principal angle in the latest frame image, so that the frame image obtained by clipping finally meets the actual scene requirement.

In combination with some embodiments of the first aspect, in some embodiments, the electronic device determines a principal angle in the plurality of targets, and specifically includes if the principal angle in the plurality of targets is not determined in a history frame, the electronic device performing gesture analysis on the plurality of targets to determine the principal angle in the plurality of targets, where the history frame is a frame preceding the latest frame in the video stream, and if the principal angle in the plurality of targets is determined in the history frame, the electronic device tracking and determining the principal angle in the plurality of targets according to the position and feature information of the principal angle in the history frame.

In the above embodiment, the electronic device may determine the principal angle in the latest frame in different manners according to whether the principal angle has been determined in the history frame. Under the condition that the principal angle is already determined in the historical frame picture, the principal angle is not required to be determined through gesture analysis, and the principal angle in a plurality of targets is only required to be determined through the position and characteristic information tracking of the principal angle in the historical frame picture, so that gesture analysis is not required to be continuously performed on a plurality of targets in each frame picture, the power consumption of the electronic equipment is reduced, and the efficiency of determining the principal angle is improved.

In combination with some embodiments of the first aspect, in some embodiments, the electronic device performs gesture analysis on the plurality of targets to determine a principal angle in the plurality of targets, and specifically includes that the electronic device determines a target in the plurality of targets that maintains a preset principal angle action for a preset principal angle duration is a principal angle in the plurality of targets.

In the above embodiment, the principal angle is determined by maintaining the preset principal angle action by the target for the preset principal angle duration. On one hand, the false judgment on the target action is avoided, and the accuracy of determining the principal angle is improved. On the other hand, the user can autonomously select whether to maintain the preset main angle action, so that the interaction performance of the electronic equipment and the user is improved.

In combination with some embodiments of the first aspect, in some embodiments, the electronic device tracks and determines a principal angle of the plurality of targets according to the position and the feature information of the principal angle in the historical frame picture, and specifically includes the electronic device determining a candidate target in the latest frame picture, where the candidate target is a target in the plurality of targets, the position of the candidate target is within a preset distance threshold from the position of the principal angle in the previous frame picture, determining that the candidate target is the principal angle of the plurality of targets when the candidate target is one, and determining that the candidate target closest to the feature information of the principal angle in the historical frame picture is the principal angle of the plurality of targets when the candidate target is a plurality of targets.

In the above-described embodiment, the candidate object is determined by the distance from the position of the principal angle in the previous frame, and the principal angle among the plurality of candidate objects is determined by the degree of proximity to the characteristic information of the principal angle in the history frame. The power consumption of the electronic device is reduced while ensuring the accuracy of determining the principal angle.

With reference to some embodiments of the first aspect, in some embodiments, the method further includes the electronic device recording characteristic information of a principal angle of the plurality of targets.

In the above embodiment, after determining the principal angle in the latest frame, the feature information of the principal angle is recorded, so that the principal angle can be conveniently tracked and determined in the newly generated frame, and the gesture analysis of each frame is not required, thereby reducing the power consumption of the electronic device.

In combination with some embodiments of the first aspect, in some embodiments, the electronic device performs object detection on a latest frame in a video stream to obtain the number of objects and positions of the objects in the latest frame, and specifically includes that the electronic device downsamples an original latest frame in the video stream to obtain the latest frame, the resolution of the latest frame is smaller than that of the original latest frame, the electronic device performs object detection on the latest frame to obtain the number of objects and positions of the objects in the latest frame, and the electronic device outputs a cropped frame according to the cropping area, and specifically includes that the electronic device crops and upsamples the original latest frame according to the cropping area in the latest frame to obtain a cropped frame, the resolution of the cropped frame is equal to that of the original latest frame, and the electronic device outputs the cropped frame.

In the above embodiment, the frame image is downsampled before the target detection, so that the calculation load of the electronic device is reduced. And up-sampling is carried out after clipping, so that the resolution of the clipped frame picture is improved, and the detail display of the figure is clearer.

With reference to some embodiments of the first aspect, in some embodiments, before the step of outputting the cropped frame picture by the electronic device, the method further includes the electronic device performing distortion correction on the cropped frame picture.

In the above embodiment, distortion correction is performed on the cut frame image, and distortion of the image captured by the wide-angle camera is corrected, so that the video image can more truly reflect the subject.

In combination with some embodiments of the first aspect, in some embodiments, the electronic device performs object detection on a latest frame in the video stream to obtain the number of objects and positions of the objects in the latest frame, and before the step of obtaining the number of objects and positions of the objects in the latest frame, the method further includes merging video streams captured by a plurality of cameras by the electronic device to obtain the video stream.

In the above embodiment, the video streams shot by the cameras are combined and then subjected to subsequent processing, so that the viewing angle range of the video picture is improved.

In a second aspect, an embodiment of the application provides an electronic device, which comprises a camera, one or more processors and a memory, wherein the camera is used for shooting to obtain a video stream, the memory is coupled with the one or more processors and used for storing computer program codes, the computer program codes comprise computer instructions, the one or more processors call the computer instructions to enable the electronic device to execute, target detection is carried out on a latest frame picture in the video stream to obtain the number of targets and the positions of the targets in the latest frame picture, the targets are characters in the latest frame picture, a clipping area is determined according to the number of targets and the positions of the targets in the latest frame picture, the clipping area covers the positions of the targets in the latest frame picture and is smaller than the picture range of the latest frame picture, and the clipped frame picture is output according to the clipping area.

In some embodiments in combination with the second aspect, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform determining the clipping region based on a position of 1 object in the latest frame when the number of objects is 1, the clipping region covering a position of 1 object in the latest frame centered on the position of 1 object in the latest frame and smaller than a frame range of the latest frame, determining the clipping region based on a position of a plurality of objects in the latest frame when the number of objects is a plurality, the clipping region covering a position of the plurality of objects in the latest frame and smaller than a frame range of the latest frame.

In some embodiments in combination with the second aspect, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform determining a principal angle of the plurality of targets, determining a clipping region based on a position of the plurality of targets in the latest frame and a position of the principal angle in the latest frame, the clipping region covering a position of the plurality of targets in the latest frame and being smaller than a frame range of the latest frame when there is no principal angle in the plurality of targets, and the clipping region covering a position of the plurality of targets in the latest frame and being smaller than a frame range of the latest frame when there is a principal angle in the plurality of targets.

In some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform, when a principal angle in the plurality of targets is not determined in a historical frame picture, performing a gesture analysis on the plurality of targets, determining a principal angle in the plurality of targets, the historical frame picture being a frame picture in the video stream preceding the latest frame picture, and tracking, when a principal angle in the plurality of targets has been determined in the historical frame picture, determining a principal angle in the plurality of targets based on location and feature information of the principal angle in the historical frame picture.

In some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform determining a target of the plurality of targets that maintains a preset principal angle action for a preset principal angle duration as a principal angle of the plurality of targets.

In some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform determining a candidate object in the latest frame, the candidate object being an object in the plurality of objects that is within a preset distance threshold from a position of a principal angle in a previous frame, determining the candidate object as a principal angle in the plurality of objects when the candidate object is one, and determining a candidate object in the plurality of candidate objects that has closest feature information to feature information of a principal angle in a historical frame as a principal angle in the plurality of objects when the candidate object is a plurality of.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform recording characteristic information of a principal angle of the plurality of targets.

In combination with some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform downsampling an original latest frame in the video stream to obtain the latest frame, the latest frame having a resolution smaller than a resolution of the original latest frame, performing object detection on the latest frame to obtain a number of objects and a position of the objects in the latest frame, cropping and upsampling the original latest frame according to a cropping area in the latest frame to obtain a cropped frame, the cropped frame having a resolution equal to the resolution of the original latest frame, and outputting the cropped frame.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform distortion correction on the cropped frame picture.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform combining video streams captured by a plurality of cameras to obtain the video stream.

In a third aspect, embodiments of the present application provide a chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.

It will be appreciated that the chip system may include a processor 110 in the electronic device 100 shown in fig. 5, or may include a plurality of processors 110 in the electronic device 100 shown in fig. 5, or may include one or more other chips, for example, may include an image signal processing chip in the camera 193 in the electronic device 100 shown in fig. 5, or may include an image display chip in the display 194, or the like, which is not limited herein.

In a fourth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer readable storage medium comprising instructions which, when executed on an electronic device, cause the electronic device to perform a method as described in the first aspect and any possible implementation manner of the first aspect.

It will be appreciated that the electronic device provided in the second aspect, the chip system provided in the third aspect, the computer program product provided in the fourth aspect and the computer storage medium provided in the fifth aspect described above are all configured to perform the method provided by the embodiment of the present application. Therefore, the advantages achieved by the method can be referred to as the advantages of the corresponding method, and will not be described herein.

Drawings

FIG. 1 is a diagram illustrating a relationship between video streams and frame pictures in an embodiment of the present application;

FIG. 2 is a schematic diagram showing the effect of the method according to the embodiment of the application;

FIG. 3 is a schematic diagram showing another use effect of the method according to the embodiment of the application;

FIG. 4 is a schematic diagram showing another embodiment of a method for tracking and displaying a person according to the present application;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 6 is a block diagram of a software architecture of an electronic device according to an embodiment of the present application;

FIG. 7 is an exemplary diagram of a video capture interface in an embodiment of the application;

FIG. 8 is an exemplary schematic diagram of a camera setup interface in an embodiment of the application;

FIG. 9 is an exemplary diagram of object detection in a person tracking display method according to an embodiment of the present application;

FIG. 10 is an exemplary diagram of determining a crop area in a person tracking display method according to an embodiment of the present application;

FIG. 11 is an exemplary schematic diagram of a crop and output in a person tracking display method according to an embodiment of the present application;

FIG. 12 is another exemplary diagram of object detection in a person tracking display method according to an embodiment of the present application;

FIG. 13 is another exemplary diagram of determining a crop area in a person tracking display method according to an embodiment of the present application;

FIG. 14 is another exemplary schematic diagram of a crop and output in a person tracking display method according to an embodiment of the present application;

FIG. 15 is an exemplary diagram of determining a principal angle in a person tracking display method according to an embodiment of the present application;

FIG. 16 is another exemplary diagram illustrating a determination of a crop area in a person tracking display method according to an embodiment of the present application;

FIG. 17 is another exemplary illustration of a crop and output in a person tracking display method in accordance with an embodiment of the present application;

FIG. 18 is an exemplary diagram of a method for person tracking display of an embodiment of the present application for determining candidate targets;

fig. 19 is another exemplary diagram of determining candidate targets in a person tracking display method in an embodiment of the application.

Detailed Description

The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure refers to and encompasses any or all possible combinations of one or more of the listed items.

The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.

Since the embodiments of the present application relate to the application of image processing technology, for convenience of understanding, the related terms and concepts related to the embodiments of the present application will be described below.

(1) SSD (single shot multibox detector) algorithm:

The SSD algorithm is an artificial neural network algorithm for object detection. The accuracy is ensured, and meanwhile, compared with other target detection algorithms, the method improves the detection speed, and can meet the requirement of real-time detection.

The SSD algorithm is used for target detection, and a labeled sample is used as a training set to train the target detection, so that the SSD algorithm meeting the requirements of the user is obtained.

For example, in the embodiment of the present application, a series of images with labels on characters may be used to train the SSD algorithm, so that the SSD algorithm can detect the characters in the input image and output the position information of the characters in the input image. In general, in the SSD algorithm, position information of a person in an input image may be represented using a prediction box that frames the person in the input image. The specific size of the prediction frame may be preset, or may be adjusted automatically according to a preset rule according to the detected size of the target, which is not limited herein.

According to different requirements of the target detection use scene, the person in the embodiment of the application can only include a person, can also include a person and a family pet (such as a cat, a dog and the like), and can also include more objects related to the person, which is not limited herein. For different target demand conditions, only a sample marked on a target to be identified is required to be used as a training set to train out a corresponding SSD algorithm.

(2) Downsampling (subsampled) and upsampling (upsampling) of an image:

Downsampling of an image is the process of shrinking the image. The main purpose of downsampling is to reduce the resolution of the target detection image, reducing the artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) computational load.

For example, if an image has a resolution of m×n, s times downsampling is performed to obtain an image with a resolution of (M/s) ×n/s, where s should be a common divisor of M and N.

Up-sampling of an image is the process of magnifying an image so that it can be displayed on a higher resolution display device. The image magnification almost adopts an interpolation method, namely, new elements are inserted between pixel points by adopting a proper interpolation algorithm on the basis of original image pixels.

In the embodiment of the application, the downsampling is performed on the image to reduce the resolution of the image, thereby reducing the operation load of image processing. For example, when an object is detected by using an SSD algorithm on an image, or when an attitude is detected on an image, an image with a small resolution is less computationally loaded and is faster to operate when processed.

In the embodiment of the application, the image is up-sampled because the resolution of the frame picture in the obtained video stream is reduced after the original video stream is cut according to the determined cutting area. Therefore, up-sampling is performed on the frame pictures in the clipped video stream, so that the resolution of the frame pictures in the obtained video stream is restored to the resolution of the frame pictures in the original video stream.

(3) Video stream and frame picture:

the video stream is made up of frames, for example a video stream with a frame rate of 24 Frames Per Second (FPS), i.e. it means that the video stream is made up of 24 frames per second.

Fig. 1 is a schematic diagram of a relationship between a video stream and a frame. In the embodiment of the application, the video stream is shot and generated by the camera in real time, so that the video stream obtained by shooting can continuously generate new frame pictures along with time. Assuming that photographing starts at time T0, one frame is generated at each of time T1 to the latest time T13, wherein the frame generated at time T1 to time T12 may be referred to as a history frame and the frame generated at the latest time T13 may be referred to as a latest frame.

By using the character tracking display method in the embodiment of the application, each latest frame picture is processed, and the processed frame pictures can form a new video stream.

In the prior art, when a video is shot, a person is likely to be at the edge of a shot picture because the camera is not right facing the person or because the person is in a moving state. The shot characters are smaller, and the details of the characters cannot be clearly displayed.

By using the person tracking display method in the embodiment of the application, a user does not need to manually adjust a camera or a video picture. In the video scene, the electronic equipment can automatically complete character tracking, and the output video picture can display more character details, so that the user experience in the video scene is optimized.

The character tracking display method in the embodiment of the application has better application effect in different application scenes.

For example, in a case where the shooting target is a single person, as shown in fig. 2, a schematic diagram of an application effect of the method for tracking and displaying a person in an embodiment of the application is shown. Fig. 2 (a) shows an original video captured, where a single person is located at the edge of the video, and the person is small, and it is difficult to observe the details. After being processed by the person tracking display method in the embodiment of the application, the video shown in (b) in fig. 2 is output, the output video shows details of the person, the background near the person is reserved, and the person is ensured to be in the video center.

For another example, in a case where the shooting target is a multi-person scene, as shown in fig. 3, another use effect schematic diagram of the person tracking display method in the embodiment of the application is shown. Fig. 3 (a) shows an original video captured, and a plurality of captured persons are located at the edges of the video, so that details of the persons cannot be observed. After being processed by the person tracking display method in the embodiment of the application, the video shown in the (b) of fig. 3 is output, the output video contains all the persons of the original video, the details of the persons are reserved as far as possible, and the video experience of the user is improved.

For another example, in the case that a plurality of cameras shoot at the same time, as shown in fig. 4, another schematic view of the use effect of the method for tracking and displaying the person in the embodiment of the application is shown. Fig. 4 (a) shows original videos respectively shot by two cameras, wherein a person is located at the junction of the two cameras, and the action of the person cannot be completely observed. After being processed by the person tracking display method in the embodiment of the application, the video shown in (b) in fig. 4 is output, the output video contains all the persons in the video shot by the original two cameras, the actions of the persons can be clearly and completely observed, and the range of the visual angle of the video is improved.

An exemplary electronic device 100 provided by an embodiment of the present application is first described below.

Fig. 5 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application.

The embodiment will be specifically described below taking the electronic device 100 as an example. It should be understood that electronic device 100 may have more or fewer components than shown, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SERIAL DATA LINE, SDA) and a serial clock line (derail clock line, SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, charger, flash, camera 193, etc., respectively, through different I2C bus interfaces. For example, the processor 110 may couple the touch sensor 180K through an I2C interface, such that the processor 110 communicates with the touch sensor 180K through an I2C bus interface, to implement a touch function of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through the I2S interface, to implement a function of answering a call through the bluetooth headset.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface to implement a function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160. For example, the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement bluetooth functions. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through a UART interface, to implement a function of playing music through a bluetooth headset.

The MIPI interface may be used to connect the processor 110 to peripheral devices such as a display 194, a camera 193, and the like. The MIPI interfaces include camera serial interfaces (CAMERA SERIAL INTERFACE, CSI), display serial interfaces (DISPLAY SERIAL INTERFACE, DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the photographing functions of electronic device 100. The processor 110 and the display 194 communicate via a DSI interface to implement the display functionality of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal or as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, an MIPI interface, etc.

The SIM interface may be used to communicate with the SIM card interface 195 to perform functions of transferring data to or reading data from the SIM card.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also employ different interfacing manners in the above embodiments, or a combination of multiple interfacing manners.

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), an active-matrix organic LIGHT EMITTING diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-oLed, a quantum dot LIGHT EMITTING diode (QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. Thus, the electronic device 100 may play or record video in a variety of encoding formats, such as moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The internal memory 121 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an operating system, an application required for at least one function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.), and the like. The storage data area may store data created during use of the electronic device 100 (e.g., face information template data, fingerprint information templates, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example, when a touch operation with a touch operation intensity smaller than a first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

The touch sensor 180K, also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

In the embodiment of the present application, the camera 193 in the electronic device 100 may be triggered to record video by the user operation collected by the pressure sensor 180A and/or the touch sensor 180K. The recorded video may be displayed on the display 194 after being processed by the processor 110 to invoke the operation instructions stored in the internal memory 121.

Fig. 6 is a software configuration block diagram of the electronic device 100 according to the embodiment of the present application.

The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun rows (Android runtime) and system libraries, and a kernel layer, respectively.

The application layer may include a series of application packages.

As shown in fig. 10, the application package may include applications (also referred to as applications) such as a person tracking display module, a camera, a gallery, a calendar, a conversation, a map, navigation, WLAN, bluetooth, music, video, short messages, and so on.

In the embodiment of the application, after the camera application is started, the person tracking display module in the application program package can be called, so that the person tracking display method in the embodiment of the application is executed, and the person is tracked and displayed in a video scene.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for the application of the application layer. The application framework layer includes a number of predefined functions.

As shown in fig. 4, the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.

The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The telephony manager is used to provide the communication functions of the electronic device 100. Such as the management of call status (including on, hung-up, etc.).

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification presented in the form of a chart or scroll bar text in the system top status bar, such as a notification of a background running application, or a notification presented on a screen in the form of a dialog interface. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

The android runtime (android runtime) includes a core library and virtual machines. android runtime is responsible for scheduling and management of the android system.

The core library comprises two parts, wherein one part is a function required to be called by java language, and the other part is an android core library.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. Such as surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (e.g., openGL ES), two-dimensional graphics engine (e.g., SGL), etc.

The surface manager is used to manage the display subsystem and provides a fusion of two-dimensional (2D) and three-dimensional (3D) layers for multiple applications.

Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing 3D graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The kernel layer at least comprises a display driver, a camera driver, an audio driver, a sensor driver and a virtual card driver.

The workflow of the electronic device 100 software and hardware is illustrated below in connection with capturing a photo scene.

When touch sensor 180K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into the original input event (including information such as touch coordinates, time stamp of touch operation, etc.). The original input event is stored at the kernel layer. The application framework layer acquires an original input event from the kernel layer, and identifies a control corresponding to the input event. Taking the touch operation as a touch click operation, taking a control corresponding to the click operation as an example of a control of a camera application icon, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera driver by calling a kernel layer, and captures a still image or video by the camera 193. In an embodiment of the present application, the camera application may process the video captured by the camera 193 to call the person tracking display module in the application package, and the camera application may output the processed video to the display 194.

The following describes a person tracking display method in the embodiment of the present application with reference to the above-described software and hardware architecture of the electronic device 100. The person tracking display method has the following cases according to different numbers of shooting targets and different modes selected by users, and the following cases are respectively described below:

1. the photographed object is a single object.

Fig. 7 is a schematic diagram of an exemplary video capturing interface of an electronic device according to an embodiment of the present application.

The video capture interface 700 may be a video capture interface that is displayed after a user opens a camera application and clicks on a video recording control. Other application programs in the electronic device 100, such as a video conference application, a chat application, etc., may be used, and the video capturing interface displayed after the video chat function is started is not limited herein.

The video shooting interface 700 may include a shooting screen display area 701, a setting control 702, and a recording control 703.

The photographed-screen display area 701 is used to display a screen photographed by the camera 193 in the electronic apparatus 100.

The setting control 702 is used for triggering and displaying a shooting setting interface;

The recording control 703 is used to control the start, pause, and stop of video recording.

When electronic device 100 receives a user click on setting control 702 in fig. 7, electronic device 100 may display a capture setting interface. Fig. 8 is a schematic diagram illustrating an exemplary shooting setting interface according to an embodiment of the present application.

The shooting setting interface 800 may include a person tracking control 801, a deformity correction control 802, a principal angle mode control 803, and a video stitching control 804. It will be appreciated that in practical applications, the shooting setting interface 800 may further include many other functional controls, such as a portrait beautifying control, a time-lapse shooting control, etc., which are not limited herein.

Wherein the person tracking control 801 is used to trigger the initiation of the person tracking function. After the person tracking function is started, the electronic device 100 may execute the person tracking display method in the embodiment of the present application on the video stream captured by the camera 193.

The deformity correction control 802 is used to trigger the activation of the visual deformity correction function. After the picture deformity correction function is started, before pictures in the video stream are sent to the display screen 194 for display, the picture deformity correction is performed, so that the output pictures are more stable and clearer.

The main angle mode control 803 is used to trigger the start of the main angle mode. The principal angle mode is to determine one of the persons as a principal angle and display the same in a center region of the screen when the shooting target is a plurality of persons.

The video stitching control 804 is used to trigger a video stream stitching function. After the video stream splicing function is started, the electronic device 100 can process and display video streams captured by the plurality of cameras 193 together as one video stream.

If the current shooting target is a single target, and the user starts the person tracking function by clicking the person tracking control 801 in the shooting setting interface 800 shown in fig. 8. Then, during video capturing, the electronic device 100 performs object detection on the frame in the video stream obtained by current capturing, and determines the number of objects and the positions of the objects in the frame.

Fig. 9 is a schematic diagram illustrating an exemplary object detection performed in the person tracking display method according to the embodiment of the present application. The electronic device 100 may detect an object in the frame and determine coordinates (Xm, ym) of a center point of the object in the frame.

It will be appreciated that the initiation of the person tracking function by clicking the person tracking control 801 is merely an example, and in practical applications, there may be many ways to initiate the person tracking function, for example, the electronic device may default to initiate the person tracking function, or may control to initiate the person tracking function by a preset gesture operation instruction, which is not limited herein.

It will be appreciated that fig. 9 is merely one example of determining the position of the target by target detection, and that in practical applications, there may be many different ways to represent the position of the target in the frame. For example, if a coordinate system is established on the frame image on the XY axis, the position of the object in the frame image can be represented by feeding back the maximum X value, the minimum X value, the maximum Y value, and the minimum Y value of the region where the object is located. The frame body can be used for fixing the area where the target is located, and the position of the target in the frame picture can be represented by feeding back the coordinates of the frame body. And are not limited herein.

The artificial intelligence neural network for target detection can have a wide variety of options, such as the fast R-CNN algorithm, the R-FCN algorithm, the SSD algorithm, and the like. Preferably, in the embodiment of the application, a trained SSD algorithm for detecting people can be adopted to detect the target.

It will be appreciated that the object detection of the frame is mainly performed by the processor 110 in the electronic device 100.

Optionally, in the embodiment of the present application, before performing object detection on a frame in a video stream obtained by current shooting, the electronic device 100 may downsample the frame first, so that the resolution of the frame is reduced. And then carrying out target detection on the frame picture subjected to downsampling, thereby reducing the operation load of target detection on the frame picture.

When the electronic device 100 performs object detection and determines that there is only one object in the frame, the electronic device may determine a clipping region according to the position of the object in the frame, where the clipping region is centered on the object and is smaller than the frame range of the original frame. Fig. 10 is a schematic diagram illustrating an exemplary method for determining a clipping region in a person tracking display according to an embodiment of the present application. The electronic device 100 may determine a frame range centered on the object and covering an X/Y axis length of the object as 1/2 (or other value) of the original frame as a clipping region. For example, in fig. 10, if the frame has an X-axis length of X1 and a Y-axis length of Y1, the electronic device can determine that the region 1001 covering the X-axis length X1/2 and the Y-axis length Y1/2 of the target is a clipping region with the target as the center.

It can be appreciated that the electronic device 100 may automatically determine the clipping region according to a preset composition rule and according to the position of the target in the frame. The preset composition rule may be a factory preset rule, a rule which is added by a user independently, or a trained artificial intelligent model, which is not limited herein.

The electronic device 100 can clip and output the frame image according to the clipping region. Fig. 11 is a schematic diagram showing an exemplary method for displaying a person tracking display according to an embodiment of the present application. Compared with the frame in the original video stream, the frame displayed on the display screen 194 after being cut and output has the character located at the center of the display screen 194 and larger than in the original frame, so that it can display more character details.

When clipping a frame, clipping a frame in a video stream obtained by capturing. If the frame is downsampled before the target detection in order to reduce the computational load, the coordinates of the clipping region in the frame in the video stream obtained by shooting need to be determined according to the coordinates of the clipping region in the downsampled frame. Then clipping is carried out according to clipping areas on the frame pictures in the video stream obtained by shooting.

Since the resolution of the clipped frame is smaller than the resolution of the original frame, upsampling is required to restore the resolution of the clipped frame to the resolution of the original frame.

2. The shooting targets are a plurality of targets, and the user starts the main angle mode.

If the current shooting target is a plurality of targets, the user starts the person tracking function by clicking the person tracking control 801 in the shooting setting interface 800 shown in fig. 8, and the user starts the principal angle mode by clicking the principal angle mode control 803 in the shooting setting interface 800 shown in fig. 8. Then, during video capturing, the electronic device 100 performs object detection on the frame in the video stream obtained by current capturing, and determines the number of objects and the positions of the objects in the frame.

It will be appreciated that the main angle mode is only an example, and in practical applications, there may be many ways to start the main angle by clicking the main angle mode control 803, for example, the electronic device may start the main angle mode by default, or may control to start the main angle mode by a preset gesture operation instruction, which is not limited herein.

Fig. 12 is a schematic diagram illustrating another exemplary object detection method in the person tracking display method according to the embodiment of the present application. Through the object detection, the electronic apparatus 100 can determine that the number of objects is 3, and can obtain the coordinates of each object in the frame.

The principal angle mode of the electronic device 100 is on. In the principal angle mode, when the photographed object is a plurality of objects, the electronic device 100 displays principal angles of the plurality of objects in a central area of the frame.

In the case where the shooting targets are a plurality of targets and the user has turned on the main angle mode, the processing of the frame image by the electronic apparatus 100 is divided into 3 cases, and the following descriptions are respectively given:

case 1. The electronic device has not yet determined the principal angle from a plurality of targets.

And determining a principal angle from a plurality of targets, and determining whether any target in the plurality of targets meets the preset principal angle condition through gesture analysis. Specifically, the electronic device may perform gesture analysis on a plurality of targets in the frame. And when determining that a certain target maintains the preset main angle action for the preset main angle duration Tz, determining that the target is the main angle of a plurality of targets.

The preset main angle action can be preset by factory or set by a user, and is not limited herein. For example, the preset main angle action may be an OK gesture action, or may be an action of recognizing that an elbow key point is higher than a shoulder key point, an included angle of the wrist-elbow-shoulder key point is greater than a certain angle, or the like, or there may be many other actions or gestures as the preset main angle action, which is not limited herein.

It can be understood that, since the gesture analysis is performed to determine the principal angle from the multiple targets at least by the preset principal angle duration Tz, and it is possible that the targets do the preset principal angle action after the shooting is performed for a period of time, the principal angle cannot be determined from the multiple targets for a period of time in the initial stage of video shooting. At this time, the electronic device 100 may determine the clipping region according to the number of targets determined by the target detection and the positions of the targets in the frame picture. The clipping region is smaller than the picture range of the original frame picture and covers all detected targets. Fig. 13 is another exemplary schematic diagram illustrating determination of a clipping region in the person tracking display method according to the embodiment of the present application. The electronic apparatus 100 determines the region 1301 covering the plurality of targets as a clipping region.

The electronic device 100 can clip and output the frame image according to the clipping region. Fig. 14 is a schematic diagram showing another exemplary method for displaying a person tracking display according to an embodiment of the present application. Compared with the frame in the original video stream, the frame displayed on the display screen 194 after being cut and output is larger than in the original frame, and all the objects detected in the original frame are completely displayed, so that more character details can be displayed.

Case 2 the electronic device determines the principal angle from a plurality of objects in the latest frame picture.

If a target of the plurality of targets in the latest frame is always kept in the historical frame, and the duration of the preset main angle is just kept until the latest frame reaches the preset main angle duration Tz, the electronic device 100 may determine that the target continuously keeping the preset main angle reaches the preset main angle duration Tz in the latest frame is the main angle of the plurality of targets. Fig. 15 is a schematic diagram illustrating an exemplary embodiment of a principal angle determination method in a person tracking display method according to an embodiment of the present application. Assume that the preset main angle motion is a one-hand wishbone motion. The photographed video stream continuously generates new frames over time T. In the frame images generated at the time T1 and the time T2, the electronic device 100 does not detect that the target maintains the preset main angle motion, and at the time T3, the electronic device 100 determines that the target among the 3 targets maintains the preset main angle motion through gesture analysis. The electronic device 100 continuously performs gesture detection on each frame generated later, and determines that the intermediate target maintains the preset main angle action for the preset main angle duration in the latest frame generated up to the time T13. Then the electronic device 100 determines the intermediate target as the principal angle of the plurality of targets. After determining that a certain object in the frame is a principal angle in a plurality of objects, the electronic device 100 may record feature information of the object.

After determining the principal angles among the plurality of targets in the frame, the electronic device 100 may determine the clipping region according to the number of the plurality of targets, the positions of the plurality of targets in the frame, and the positions of the principal angles in the frame. The clipping region is smaller than the frame range of the original frame, covers all detected targets, and is centered on the principal angle. Fig. 16 is another exemplary schematic diagram illustrating determination of a clipping region in the person tracking display method according to the embodiment of the present application. The electronic apparatus 100 determines an area 1601 that covers the plurality of targets and is centered at the principal angle as a clipping area.

The electronic device 100 can clip and output the frame image according to the clipping region. Fig. 17 is a schematic diagram showing another example of the method for displaying the character tracking according to the embodiment of the present application. Compared with the frame pictures in the original video stream, the frame pictures displayed on the display screen 194 after being cut and output are displayed by taking the main angle in the middle as the center, each character is larger than the original frame pictures, not only all targets detected in the original frame pictures are displayed completely, more character details are displayed, but also main characters can be determined clearly at a glance, and user experience under a video scene is improved.

The electronic device has determined the principal angle from the historical frame pictures and tracks and determines the principal angle in the latest frame picture.

If the electronic device 100 has determined the principal angle in the history frame, in the newly generated frame, the electronic device 100 does not need to determine the principal angle in the latest frame according to the gesture analysis, and can track and determine the principal angle in the latest frame according to the feature information of the principal angle in the history frame and the position information of the principal angle in the previous frame.

Specifically, the electronic device 100 may first determine that the target in the latest frame and the target in the position of the main angle in the previous frame within the preset distance threshold s are candidate targets. If the number of candidate objects is 1, the electronic device 100 may determine that the candidate object is a principal angle in the latest frame. If the number of candidate objects is plural, the electronic device 100 may determine that a candidate object whose feature information is closest to the feature information of the principal angle in the history frame is the principal angle in the latest frame.

Fig. 18 is a schematic diagram illustrating an exemplary method for determining candidate objects in a person tracking display method according to an embodiment of the present application. In the history frame images generated at the times T1 to T13, the electronic apparatus 100 has determined the principal angle among the plurality of targets. In the last frame of the latest frame generated at time T14, that is, in the frame generated at time T13, the coordinates of the center point of the principal angle are (Xz, yz). It will be appreciated that the position of the target in the frame generated at time T14 may not be exactly the same as the position in the frame generated at time T13 due to factors such as camera deflection or target movement. As shown in fig. 18, in the frame generated at time T14, there is only one candidate object within the preset distance threshold s from the position of the main angle in the previous frame. When the electronic device 100 determines that there is only one candidate object in the latest frame, the electronic device 100 may determine that the candidate object is a principal angle in the latest frame. The electronic device 100 may record the characteristic information of the principal angle.

Fig. 19 is another exemplary schematic diagram illustrating determination of candidate targets in the person tracking display method according to the embodiment of the present application. At time T13, the coordinates of the center point of the principal angle in the frame picture are (Xz, yz). In the frame generated at time T14, there are 2 candidate targets within a preset distance threshold s from the position of the principal angle in the previous frame. At this time, when determining the principal angle in the historical frame, the electronic device 100 compares the feature information of the principal angle recorded with the feature information of the 2 candidate targets, and determines the principal angle in the frame generated at the time T14 by comparing the feature information of the 2 candidate targets with the feature information of the principal angle in the historical frame.

Specifically, in the history frame picture, the electronic device may record the feature information of the principal angle each time the principal angle is determined. After a plurality of candidate targets are determined in the latest frame, the electronic equipment can use the recorded characteristic information of the principal angle to carry out cosine distance comparison with the characteristic information of the candidate targets in the latest frame, and lock the nearest principal angle.

After determining the principal angle in the latest frame, the electronic device 100 may also record the location information and the feature information of the principal angle, so as to facilitate tracking and determining the principal angle in the subsequent generated frame.

After determining the principal angles of the plurality of targets in the frame, the electronic device 100 may determine a clipping region according to the number of the plurality of targets, the positions of the plurality of targets in the frame, and the positions of the principal angles in the frame, clip, and output. The specific manner can be seen in fig. 16 and 17, and the clipping region is determined in the case 2, and the clipping and outputting are similar to those described in the case 2, and will not be repeated here.

3. The shooting targets are a plurality of targets, and the main angle mode is not started by the user.

If the current shooting target is a plurality of targets, the user starts the person tracking function by clicking the person tracking control 801 in the shooting setting interface 800 shown in fig. 8, but the user does not click the principal angle mode control 803 in the shooting setting interface 800 shown in fig. 8, and does not start the principal angle mode. Then, during video capturing, the electronic device 100 performs object detection on the frame in the video stream obtained by current capturing, and determines the number of objects and the positions of the objects in the frame.

And then determining a cutting area according to the number of the targets and the positions of the targets in the frame picture, cutting and outputting. The specific manner can be seen in fig. 13 and 14, and the clipping region is determined in the case 1, and the clipping and outputting are similar to those described in the case 1, and are not repeated here.

It will be appreciated that in the case where the photographed object is a plurality of objects, the electronic device may also perform downsampling to reduce the frame resolution to reduce the processing load before performing object detection, as in the case where the photographed object is a single object. After the clipping is completed, up-sampling may be performed to restore the resolution of the clipped frame to the resolution of the original frame, which is not described herein.

In the embodiment of the present application, in a video shooting scene, when the electronic device 100 cuts out and outputs a processed frame image according to a cutting area, the frame image may be output to the display screen 194 of the electronic device 100 for display. In the video conference or video chat scenario, the electronic device 100 may further output the processed frame image to the electronic device of the opposite communication end through the mobile communication module 150 and/or the wireless communication module 160, so that the frame image is displayed on the display screen of the electronic device of the opposite communication end. The specific output object may be determined according to the actual usage scenario, and is not limited herein.

In some embodiments of the present application, the recorded video may be processed by using the person tracking display method, and the frame being processed is taken as the latest frame and the frame being processed is taken as the history frame according to the generation time of the frames in the video. The specific processing manner is the same as the above character tracking display method, and will not be described here again.

While the application has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit of the embodiments.

As used in the above embodiments, the term "when..is interpreted as meaning" if..or "after..or" in response to determining..or "in response to detecting..is" depending on the context. Similarly, the phrase "when determining..or" if (a stated condition or event) is detected "may be interpreted to mean" if determined.+ -. "or" in response to determining.+ -. "or" when (a stated condition or event) is detected "or" in response to (a stated condition or event) "depending on the context.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. The storage medium includes a ROM or a random access memory RAM, a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A method for tracking and displaying a person, comprising:

The electronic device displays a video shooting interface in response to the first user operation, wherein the video shooting interface includes a picture shot by a camera;

After turning on the person tracking control and the protagonist mode, the electronic device performs target detection on the latest frame in the video stream to obtain the number of targets and the positions of the targets in the latest frame, wherein the targets are the people in the latest frame;

The electronic device determines a cropping area according to the number of the targets and the positions of the targets in the latest frame, wherein the cropping area covers the positions of the targets in the latest frame and is smaller than the image range of the latest frame;

The electronic device displays the cropped frame image according to the cropped area;

The electronic device determines the cropping area according to the number of the targets and the positions of the targets in the latest frame, specifically including: when the number of the targets is multiple, if the protagonist among the multiple targets is not determined in the historical frame, the electronic device determines that the target among the multiple targets that maintains a preset protagonist action for a preset protagonist duration is the protagonist among the multiple targets, and the historical frame is a frame in the video stream before the latest frame; the electronic device determines the cropping area according to the positions of the multiple targets in the latest frame and the position of the protagonist in the latest frame.

2. The method according to claim 1, wherein the electronic device determines the cropping area according to the number of the objects and the positions of the objects in the latest frame, and specifically further comprises:

When the number of the targets is one, the electronic device determines the cropping area according to the position of the target in the latest frame; the cropping area is centered on the position of the target in the latest frame, covers the position of the target in the latest frame, and is smaller than the screen range of the latest frame.

3. The method according to claim 1 is characterized in that, when there is no protagonist among the multiple targets, the cropping area covers the positions of the multiple targets in the latest frame and is smaller than the screen range of the latest frame; when there is a protagonist among the multiple targets, the cropping area is centered on the position of the protagonist in the latest frame, covers the positions of the multiple targets in the latest frame, and is smaller than the screen range of the latest frame.

4. The method according to claim 3, characterized in that the method further comprises:

If the protagonist among the multiple targets has been determined in the historical frame, the electronic device tracks and determines the protagonist among the multiple targets according to the position and feature information of the protagonist in the historical frame.

5. The method according to claim 4 is characterized in that the electronic device tracks and determines the protagonist among the multiple targets according to the position and feature information of the protagonist in the historical frame, specifically comprising:

The electronic device determines a candidate target in the latest frame, wherein the candidate target is a target among the multiple targets whose position is within a preset distance threshold from the main character in the previous frame;

When there is only one candidate target, the electronic device determines that the candidate target is a protagonist among the multiple targets;

When there are multiple candidate targets, the electronic device determines a candidate target whose feature information is closest to feature information of the protagonist in the historical frame as the protagonist among the multiple targets.

6. The method according to any one of claims 3 to 5, characterized in that the method further comprises:

The electronic device records feature information of a protagonist among the multiple targets.

7. The method according to any one of claims 1 to 5, characterized in that the electronic device performs target detection on the latest frame in the video stream to obtain the number of targets and the positions of the targets in the latest frame, specifically comprising:

The electronic device downsamples the original latest frame in the video stream to obtain the latest frame, wherein the resolution of the latest frame is smaller than the resolution of the original latest frame;

The electronic device performs target detection on the latest frame to obtain the number of targets and the positions of the targets in the latest frame;

The electronic device displays the cropped frame image according to the cropped area, specifically including:

The electronic device crops and upsamples the original latest frame according to the cropped area in the latest frame to obtain a cropped frame, wherein the resolution of the cropped frame is equal to the resolution of the original latest frame;

The electronic device displays the cropped frame image.

8. The method according to claim 7, characterized in that before the step of the electronic device outputting the cropped frame image, the method further comprises:

The electronic device performs distortion correction on the cropped frame image.

9. The method according to any one of claims 1 to 5 and 8, characterized in that before the step of the electronic device detecting targets on the latest frame in the video stream to obtain the number of targets and the positions of the targets in the latest frame, the method further comprises:

The electronic device combines the video streams captured by multiple cameras to obtain the video stream.

10. An electronic device, characterized in that the electronic device comprises: a camera, one or more processors and a memory;

The camera is used for shooting to obtain a video stream;

The memory is coupled to the one or more processors, and the memory is used to store computer program codes, wherein the computer program codes include computer instructions, and the one or more processors call the computer instructions to cause the electronic device to execute:

In response to the first user operation, displaying a video shooting interface through the display screen, wherein the video shooting interface includes a picture shot by the camera;

After turning on the person tracking control and the protagonist mode, performing target detection on the latest frame in the video stream to obtain the number of targets and the positions of the targets in the latest frame, wherein the targets are the people in the latest frame;

Determine a cropping area according to the number of the targets and the positions of the targets in the latest frame, where the cropping area covers the positions of the targets in the latest frame and is smaller than the image range of the latest frame;

According to the cropping area, displaying the cropped frame image through the display screen;

The method of determining the cropping area according to the number of targets and the positions of the targets in the latest frame specifically includes: when there are multiple targets, if the protagonist among the multiple targets is not determined in the historical frame, then determining the target among the multiple targets that maintains a preset protagonist action for a preset protagonist duration as the protagonist among the multiple targets, and the historical frame is a frame in the video stream before the latest frame; determining the cropping area according to the positions of the multiple targets in the latest frame and the position of the protagonist in the latest frame.

11. The electronic device according to claim 10, wherein the one or more processors are specifically configured to call the computer instructions so that the electronic device executes:

When the number of the targets is 1, the cropping area is determined according to the position of the target in the latest frame; the cropping area is centered on the position of the target in the latest frame, covers the position of the target in the latest frame, and is smaller than the screen range of the latest frame.

12. The electronic device according to claim 10 is characterized in that, when there is no protagonist among the multiple targets, the cropping area covers the positions of the multiple targets in the latest frame and is smaller than the screen range of the latest frame; when there is a protagonist among the multiple targets, the cropping area is centered on the position of the protagonist in the latest frame, covers the positions of the multiple targets in the latest frame, and is smaller than the screen range of the latest frame.

13. The electronic device according to claim 12, wherein the one or more processors are further configured to call the computer instruction to enable the electronic device to execute:

When the protagonist among the multiple targets has been determined in the historical frame, the protagonist among the multiple targets is tracked and determined according to the position and feature information of the protagonist in the historical frame.

14. The electronic device according to claim 13, wherein the one or more processors are specifically configured to call the computer instructions so that the electronic device executes:

Determine a candidate target in the latest frame, wherein the candidate target is a target among the multiple targets whose position is within a preset distance threshold from the main character in the previous frame;

When there is only one candidate target, determining the candidate target as a protagonist among the multiple targets;

When there are multiple candidate targets, the candidate target whose feature information is closest to the feature information of the protagonist in the historical frame is determined as the protagonist among the multiple targets.

15. The electronic device according to claim 12 or 14, wherein the one or more processors are specifically configured to call the computer instructions so that the electronic device executes:

The characteristic information of the protagonist among the multiple targets is recorded.

16. The electronic device according to any one of claims 10 or 14, characterized in that the one or more processors are specifically configured to call the computer instructions to enable the electronic device to execute:

Downsampling the original latest frame in the video stream to obtain the latest frame, wherein the resolution of the latest frame is smaller than the resolution of the original latest frame;

Performing target detection on the latest frame to obtain the number of targets and the positions of the targets in the latest frame;

According to the cropping area in the latest frame, the original latest frame is cropped and up-sampled to obtain a cropped frame, wherein the resolution of the cropped frame is equal to the resolution of the original latest frame;

The cropped frame image is displayed through the camera.

17. The electronic device according to claim 16, wherein the one or more processors are further configured to call the computer instructions to enable the electronic device to execute:

Distortion correction is performed on the cropped frame image.

18. The electronic device according to any one of claims 10 to 14 and 17, wherein the one or more processors are further configured to call the computer instructions to enable the electronic device to execute:

The video streams captured by multiple cameras are merged to obtain the video stream.

19. A chip system, wherein the chip system is applied to an electronic device, the chip system comprises one or more processors, and the processor is used to call computer instructions so that the electronic device executes the method as described in any one of claims 1-9.

20. A computer program product comprising instructions, characterized in that when the computer program product is run on an electronic device, the electronic device is caused to execute the method according to any one of claims 1 to 9.

21. A computer-readable storage medium, comprising instructions, wherein when the instructions are executed on an electronic device, the electronic device executes the method according to any one of claims 1 to 9.