CN116582637A

CN116582637A - Screen splitting method of video conference picture and related equipment

Info

Publication number: CN116582637A
Application number: CN202310611376.5A
Authority: CN
Inventors: 王曌; 刘冀洋; 张才荣; 李尚霖
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-08-11
Also published as: WO2024245105A1

Abstract

The disclosure provides a split screen method of a video conference picture and related equipment. The method comprises the following steps: acquiring a target image acquired by an acquisition unit; detecting a target object in the target image; dividing a video conference picture into at least two sub-pictures according to the target object; and correspondingly displaying the target object in the target image in the at least two sub-pictures.

Description

Screen splitting method and related equipment for video conferencing images

技术领域technical field

本公开涉及计算机技术领域，尤其涉及一种视频会议画面的分屏方法及相关设备。The present disclosure relates to the field of computer technology, and in particular to a method for splitting screens of a video conference screen and related equipment.

背景技术Background technique

在视频会议中，面对面的交流互动是至关重要的。但是，当多人坐在会议室中开视频会议的时候，由于大部分会议室的摄像头只有一个，采集的视频会议画面也只有一个，从视频会议画面中难以区分参会人及人数，尤其无法识别当前的说话人。In video conferencing, face-to-face interaction is crucial. However, when many people are sitting in a conference room for a video conference, since there is only one camera in most conference rooms, and only one video conference screen is collected, it is difficult to distinguish the participants and the number of participants from the video conference screen, especially the Identify the current speaker.

发明内容Contents of the invention

本公开提出一种视频会议画面的分屏方法及相关设备，以解决或部分解决上述问题。The present disclosure proposes a method for splitting screens of a video conference screen and related equipment to solve or partially solve the above-mentioned problems.

本公开第一方面，提供了一种视频会议画面的分屏方法，包括：In a first aspect of the present disclosure, a method for splitting a video conference screen is provided, including:

获取采集单元采集的目标图像；Acquiring the target image collected by the acquisition unit;

检测所述目标图像中的目标对象；detecting a target object in the target image;

根据所述目标对象将视频会议画面划分为至少两个子画面；dividing the video conference screen into at least two sub-pictures according to the target object;

将所述目标图像中的目标对象对应显示在所述至少两个子画面中。Correspondingly displaying the target object in the target image in the at least two sub-pictures.

本公开第二方面，提供了一种视频会议画面的分屏装置，包括：The second aspect of the present disclosure provides a split-screen device for a video conference screen, including:

获取模块，被配置为：获取采集单元采集的目标图像；The acquisition module is configured to: acquire the target image collected by the acquisition unit;

检测模块，被配置为：检测所述目标图像中的目标对象；a detection module configured to: detect a target object in the target image;

划分模块，被配置为：根据所述目标对象将视频会议画面划分为至少两个子画面；A division module configured to: divide the video conference screen into at least two sub-pictures according to the target object;

显示模块，被配置为：将所述目标图像中的目标对象对应显示在所述至少两个子画面中。The display module is configured to: correspondingly display the target object in the target image in the at least two sub-pictures.

本公开第三方面，提供了一种计算机设备，包括一个或者多个处理器、存储器；和一个或多个程序，其中所述一个或多个程序被存储在所述存储器中，并且被所述一个或多个处理器执行，所述程序包括用于执行根据第一方面所述的方法的指令。In a third aspect of the present disclosure, a computer device is provided, including one or more processors, memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the Executed by one or more processors, the program includes instructions for performing the method according to the first aspect.

本公开第四方面，提供了一种包含计算机程序的非易失性计算机可读存储介质，当所述计算机程序被一个或多个处理器执行时，使得所述处理器执行第一方面所述的方法。In a fourth aspect of the present disclosure, there is provided a non-volatile computer-readable storage medium containing a computer program. When the computer program is executed by one or more processors, the processors execute the program described in the first aspect. Methods.

本公开第五方面，提供了一种提供了一种计算机程序产品，包括计算机程序指令，当所述计算机程序指令在计算机上运行时，使得计算机执行第一方面所述的方法。A fifth aspect of the present disclosure provides a computer program product, including computer program instructions. When the computer program instructions are run on a computer, the computer is made to execute the method described in the first aspect.

本公开实施例提供的视频会议画面的分屏方法及相关设备，通过检测采集单元采集的目标图像中的目标对象，进而根据目标对象来将视频会议画面划分为至少两个子画面，并将相应的目标对象对应显示在子画面中，从而可以在采集的目标图像中包含多个参会人的情况下，自动对视频会议画面进行分屏，从而有助于在视频会议过程中提升会议室参会人与其他线上参会人的互动感，进而提升用户体验。The video conference screen splitting method and related equipment provided by the embodiments of the present disclosure, by detecting the target object in the target image collected by the acquisition unit, and then dividing the video conference screen into at least two sub-pictures according to the target object, and dividing the corresponding The target object is correspondingly displayed in the sub-picture, so that when the captured target image contains multiple participants, the video conference screen can be automatically divided into screens, which helps to improve conference room participation during the video conference The sense of interaction between people and other online participants, thereby improving the user experience.

附图说明Description of drawings

为了更清楚地说明本公开或相关技术中的技术方案，下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本公开的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present disclosure or related technologies, the following will briefly introduce the accompanying drawings that need to be used in the descriptions of the embodiments or related technologies. Obviously, the accompanying drawings in the following description are only for the present disclosure Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1A示出了本公开实施例所提供的示例性系统的示意图。FIG. 1A shows a schematic diagram of an exemplary system provided by an embodiment of the present disclosure.

图1B示出了图1A所示的场景中所采集的视频会议画面的示意图。FIG. 1B shows a schematic diagram of video conference images collected in the scene shown in FIG. 1A .

图2示出了根据本公开实施例的示例性流程的示意图。Fig. 2 shows a schematic diagram of an exemplary process according to an embodiment of the present disclosure.

图3A示出了根据本公开实施例的一张示例性目标图像的示意图。FIG. 3A shows a schematic diagram of an exemplary target image according to an embodiment of the present disclosure.

图3B示出了根据本公开实施例的在目标图像中显示检测框的示意图。Fig. 3B shows a schematic diagram of displaying a detection frame in a target image according to an embodiment of the present disclosure.

图3C示出了根据本公开实施例的一个示例性视频会议画面的示意图。FIG. 3C shows a schematic diagram of an exemplary video conference screen according to an embodiment of the present disclosure.

图3D示出了根据本公开实施例的另一示例性视频会议画面的示意图。Fig. 3D shows a schematic diagram of another exemplary video conference screen according to an embodiment of the present disclosure.

图3E示出了根据本公开实施例的一种分屏模式的示意图。FIG. 3E shows a schematic diagram of a split-screen mode according to an embodiment of the present disclosure.

图3F示出了根据本公开实施例的另一分屏模式的示意图。FIG. 3F shows a schematic diagram of another split-screen mode according to an embodiment of the present disclosure.

图3G示出了根据本公开实施例的又一示例性视频会议画面的示意图。Fig. 3G shows a schematic diagram of another exemplary video conference screen according to an embodiment of the present disclosure.

图3H示出了一种人脸关键点检测的示意图。FIG. 3H shows a schematic diagram of face key point detection.

图3I示出了一个转动的人脸的示意图。Fig. 3I shows a schematic diagram of a turned human face.

图4示出了本公开实施例所提供的一个示例性方法的示意图。Fig. 4 shows a schematic diagram of an exemplary method provided by an embodiment of the present disclosure.

图5示出了本公开实施例所提供的示例性计算机设备的硬件结构示意图。Fig. 5 shows a schematic diagram of a hardware structure of an exemplary computer device provided by an embodiment of the present disclosure.

图6示出了本公开实施例所提供的一种示例性装置的示意图。Fig. 6 shows a schematic diagram of an exemplary device provided by an embodiment of the present disclosure.

具体实施方式Detailed ways

为使本公开的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本公开进一步详细说明。In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

需要说明的是，除非另外定义，本公开实施例使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开实施例中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同，而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系，当被描述对象的绝对位置改变后，则该相对位置关系也可能相应地改变。It should be noted that, unless otherwise defined, the technical terms or scientific terms used in the embodiments of the present disclosure shall have ordinary meanings understood by those skilled in the art to which the present disclosure belongs. "First", "second" and similar words used in the embodiments of the present disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right" and so on are only used to indicate the relative positional relationship. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.

图1A示出了本公开实施例所提供的示例性系统100的示意图。FIG. 1A shows a schematic diagram of an exemplary system 100 provided by an embodiment of the present disclosure.

如图1A所示，该系统100可以包括至少一个终端设备(例如，终端设备102、104)、服务器106和数据库服务器108。终端设备102和104与服务器106和数据库服务器108之间可以包括提供通信链路的介质，例如，网络，该网络可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1A , the system 100 may include at least one terminal device (eg, terminal devices 102 and 104 ), a server 106 and a database server 108 . The media between the terminal devices 102 and 104 and the server 106 and the database server 108 may include a communication link, for example, a network, and the network may include various connection types, such as wired, wireless communication links or optical fiber cables.

用户110A～110C可以使用终端设备102通过网络与服务器106进行交互，以接收或发送消息等，同样地，用户112可以使用终端设备104通过网络与服务器106进行交互，以接收或发送消息等。终端设备102和104上可以安装有各种应用程序(APP)，例如，视频会议类应用程序、读书类应用程序、视频类应用程序、社交类应用程序、支付类应用程序、网页浏览器和即时通讯工具等。在一些实施例中，用户110A～110C和用户112可以分别使用安装在终端设备102和104上的视频会议类应用程序来使用服务器106提供的视频会议类服务，并且，终端设备102和104可以通过摄像头(例如，设置在终端设备102和104上的摄像头)来采集图像1022和1042并可以通过麦克风(例如，设置在终端设备102和104上的麦克风)来采集现场音频并上传到服务器106，进而使得用户110A～110C和用户112可以分别通过终端设备102和104上的视频会议类应用程序来观看对方的画面并听到对方的声音。Users 110A-110C can use terminal device 102 to interact with server 106 through the network to receive or send messages, and similarly, user 112 can use terminal device 104 to interact with server 106 through the network to receive or send messages. Various application programs (APPs) can be installed on the terminal devices 102 and 104, for example, video conferencing application programs, reading application programs, video application programs, social networking application programs, payment application programs, web browsers and instant communication tools, etc. In some embodiments, users 110A-110C and user 112 can respectively use video conference applications installed on terminal devices 102 and 104 to use video conference services provided by server 106, and terminal devices 102 and 104 can use The camera (for example, the camera set on the terminal equipment 102 and 104) collects the images 1022 and 1042 and can collect live audio through the microphone (for example, the microphone set on the terminal equipment 102 and 104) and uploads it to the server 106, and then The users 110A-110C and the user 112 can watch each other's picture and hear the other party's voice through the video conferencing application programs on the terminal devices 102 and 104 respectively.

这里的终端设备102和104可以是硬件，也可以是软件。当终端设备102和104为硬件时，可以是具有显示屏的各种电子设备，包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器、膝上型便携计算机(Laptop)和台式计算机(PC)等等。当终端设备102和104为软件时，可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。The terminal devices 102 and 104 here may be hardware or software. When the terminal devices 102 and 104 are hardware, they can be various electronic devices with display screens, including but not limited to smartphones, tablet computers, e-book readers, MP3 players, laptop computers (Laptop) and desktop Computer (PC) and so on. When the terminal devices 102 and 104 are software, they can be installed in the electronic devices listed above. It can be implemented as a plurality of software or software modules (for example, to provide distributed services), or as a single software or software module. No specific limitation is made here.

服务器106可以是提供各种服务的服务器，例如对终端设备102和104上显示的各种应用提供支持的后台服务器。数据库服务器108也可以是提供各种服务的数据库服务器。可以理解，在服务器106可以实现数据库服务器108的相关功能的情况下，系统100中可以不设置数据库服务器108。The server 106 may be a server that provides various services, such as a background server that provides support for various applications displayed on the terminal devices 102 and 104 . The database server 108 may also be a database server that provides various services. It can be understood that, in the case that the server 106 can implement related functions of the database server 108 , the system 100 may not be provided with the database server 108 .

这里的服务器106和数据库服务器108同样可以是硬件，也可以是软件。当它们为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当它们为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。The server 106 and the database server 108 here can also be hardware or software. When they are hardware, they can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When they are software, they may be implemented as a plurality of software or software modules (for example, to provide distributed services), or as a single software or software module. No specific limitation is made here.

需要说明的是，本申请实施例所提供的视频会议画面的分屏方法一般由终端设备102和104执行。It should be noted that the method for splitting a video conference screen provided in the embodiment of the present application is generally performed by the terminal devices 102 and 104 .

应该理解，图1A中的终端设备、用户、服务器和数据库服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、用户、服务器和数据库服务器。It should be understood that the numbers of terminal devices, users, servers and database servers in FIG. 1A are only illustrative. According to the realization needs, there can be any number of terminal devices, users, servers and database servers.

图1B示出了图1A所示的场景中所采集的视频会议画面120的示意图。FIG. 1B shows a schematic diagram of a video conference picture 120 collected in the scene shown in FIG. 1A .

如图1B所示，在图1A所示的场景中，视频会议画面120可以将两个摄像头采集的画面分别在两个子画面1202和1204中进行显示，其中，终端设备102一侧的摄像头采集的是整个会议室的画面，因此，在子画面1202中会相应显示包括用户110A～110C对应的目标对象(例如，脸部图像)的画面。As shown in FIG. 1B , in the scene shown in FIG. 1A , the video conference screen 120 can display the screens captured by the two cameras in two sub-screens 1202 and 1204 respectively, wherein the video captured by the camera on the side of the terminal device 102 It is the picture of the entire meeting room, therefore, the picture including the target objects (for example, facial images) corresponding to the users 110A-110C will be correspondingly displayed in the sub-picture 1202 .

可以看出，由于会议室中设置的摄像头的位置固定，当用户110A～110C与摄像头的相对位置不同时，用户110A～110C在子画面1202中的目标对象的大小、朝向都会出现差别，并且，由于会议室的摄像头通常与座位距离较远，使得在子画面1202中难以很好地看清每个参会人。因此，一个对于单镜头采集的画面进行自动分屏的方案，对于提升视频会议中会议室与线上参会人的互动具有很大的实际意义。It can be seen that since the positions of the cameras set in the conference room are fixed, when the relative positions of the users 110A-110C and the cameras are different, the size and orientation of the target objects of the users 110A-110C in the sprite 1202 will be different, and, Since the camera in the meeting room is usually far away from the seats, it is difficult to see each participant clearly in the sprite 1202 . Therefore, a solution for automatically splitting screens captured by a single lens is of great practical significance for improving the interaction between conference rooms and online participants in video conferencing.

有鉴于此，本公开实施例提供了一种视频会议画面的分屏方法，通过检测采集单元采集的目标图像中的目标对象，进而根据所述目标对象来将视频会议画面划分为至少两个子画面，并将所述目标图像中的目标对象对应显示在子画面中，从而可以在采集的目标图像中包含多个参会人的情况下，自动对视频会议画面进行分屏，从而有助于在视频会议过程中提升会议室参会人与其他线上参会人的互动感，进而提升用户体验。In view of this, an embodiment of the present disclosure provides a method for splitting a video conference screen, by detecting a target object in a target image collected by an acquisition unit, and then dividing the video conference screen into at least two sub-pictures according to the target object , and correspondingly display the target object in the target image in the sub-picture, so that when the collected target image contains multiple participants, the video conference picture can be automatically divided into screens. During the video conference, the interaction between conference room participants and other online participants is improved, thereby improving user experience.

图2示出了本公开实施例所提供的示例性方法200的流程示意图。该方法200可以用于对视频会议画面进行自动分屏。可选地，该方法200可以由图1A的终端设备102、104来实施，也可以由图1A的服务器106来实施。下面以服务器106实施方法200来进行说明。FIG. 2 shows a schematic flowchart of an exemplary method 200 provided by an embodiment of the present disclosure. The method 200 can be used to automatically split the screen of the video conference picture. Optionally, the method 200 may be implemented by the terminal devices 102 and 104 in FIG. 1A , and may also be implemented by the server 106 in FIG. 1A . The method 200 implemented by the server 106 will be described below.

如图2所示，该方法200可以进一步包括以下步骤。As shown in FIG. 2 , the method 200 may further include the following steps.

在步骤202，服务器106可以获取采集单元采集的目标图像。以图1A为例，该采集单元可以是终端设备102、104中设置的摄像头，该目标图像可以是摄像头采集的图像1022和1042。摄像头采集了图像之后，终端设备102、104可以将采集的图像上传到服务器106进行处理。In step 202, the server 106 may obtain the target image collected by the collection unit. Taking FIG. 1A as an example, the acquisition unit may be a camera installed in the terminal devices 102 and 104, and the target image may be the images 1022 and 1042 collected by the camera. After the camera captures the image, the terminal devices 102 and 104 can upload the captured image to the server 106 for processing.

图3A示出了根据本公开实施例的一张示例性目标图像300的示意图。FIG. 3A shows a schematic diagram of an exemplary target image 300 according to an embodiment of the present disclosure.

该目标图像300可以是系统100中任一参与视频会议的终端设备所采集的图像，如图3A所示，该目标图像300中可以包括多个参会人的目标对象302A～302C。The target image 300 may be an image captured by any terminal device participating in the video conference in the system 100. As shown in FIG. 3A, the target image 300 may include target objects 302A- 302C of multiple conference participants.

接着，在步骤204，服务器106可以检测目标图像300中的目标对象。作为一个可选实施例，可以采用目标检测(Object Detection)技术来检测目标图像300中的目标对象。可选地，可以采用预先训练好的目标检测模型来检测目标图像300中的目标对象302A～302C，并可以得到与目标对象302A～302C对应的检测框304A～304C，如图3B所示。Next, at step 204 , the server 106 may detect the target object in the target image 300 . As an optional embodiment, an object detection (Object Detection) technology may be used to detect the target object in the target image 300 . Optionally, a pre-trained target detection model can be used to detect the target objects 302A-302C in the target image 300, and detection frames 304A-304C corresponding to the target objects 302A-302C can be obtained, as shown in FIG. 3B.

进一步地，考虑到参会人在会议室中的位置是时刻变化的，如果在得到检测框之后将检测框位置固定，当参会人发生移动时可能无法及时跟随人脸的位置，因此，在一些实施例中，可以采用目标跟踪技术来实时跟踪目标对象的位置变化，相应地，检测框也会发生相应的变化，这样，即使在视频会议过程中参会人发生了移动，也可以对其目标对象进行追踪。可选地，可以采用预先训练好的目标跟踪模型来追踪目标图像300中的目标对象302A～302C，该目标跟踪模型可以是基于深度学习的实时人脸框与人脸关键点的检测追踪模型，模型结构包括但不限于各种形式的卷积神经网络(Convolutional Neural Network)以及各种形式的Transformer网络。这样，通过图像人脸检测与跟踪实现分屏的每个子画面都能够实时跟随屏幕中的人脸。Furthermore, considering that the position of the participants in the conference room is constantly changing, if the detection frame is fixed after the detection frame is obtained, the participant may not be able to follow the position of the face in time when the participant moves. Therefore, in In some embodiments, the target tracking technology can be used to track the position change of the target object in real time, and correspondingly, the detection frame will also change accordingly. In this way, even if the participant moves during the video conference, it can also be detected. The target object is tracked. Optionally, a pre-trained target tracking model can be used to track the target objects 302A-302C in the target image 300, and the target tracking model can be a real-time face frame and face key point detection and tracking model based on deep learning, Model structures include but are not limited to various forms of Convolutional Neural Networks and various forms of Transformer networks. In this way, through image face detection and tracking, each sub-picture of the split screen can follow the face on the screen in real time.

在一些实施例中，在检测到目标图像300中的目标对象之后，可以进入步骤206，服务器106可以直接根据所述目标对象来确定分屏布局，进而基于分屏布局对视频会议画面进行划分。In some embodiments, after the target object in the target image 300 is detected, step 206 may be entered, and the server 106 may directly determine the split-screen layout based on the target object, and then divide the video conference screen based on the split-screen layout.

图3C示出了根据本公开实施例的一个示例性视频会议画面310的示意图。如图3C所示，视频会议画面310被划分为了多个子画面。具体地，在划分子画面时，在一些实施例中，考虑到视频会议场景需要多方互动，可以根据当前参与到视频会议中的各终端设备所采集的目标图像中的目标对象的数量的总数来划分视频会议画面。例如，以图1A所示的场景为例，可以根据终端设备102、104各自采集的图像中的目标对象的总数来确定子画面的数量，该数量在本示例中为4。在确定子画面的数量之后，可以确定分屏布局。分屏布局可以有很多种。例如，为了维持现有视频会议的基本分屏布局，可以先根据终端设备的数量n将视频会议画面310划分为n个子画面，然后，将各终端设备与n个子画面相对应。接着，根据各终端设备采集的画面中的目标对象的数量进一步划分该终端设备对应的子画面，这样，就得到了分屏布局。如图3C所示，有左右两个子画面分别对应了两个终端设备102和104所采集的图像，其中，左侧的画面又进一步包括三个子画面，分别对应目标对象302A～302C。右侧子画面显示终端设备104采集的画面，可以包括目标对象312。这样，就构成了一个完整的视频会议画面310。并且，由于画面310中对不同终端设备对应的子画面进行划分(例如，等大小划分)，用户可以通过该分屏布局知晓具体参与到视频会议的终端数量。FIG. 3C shows a schematic diagram of an exemplary video conference screen 310 according to an embodiment of the present disclosure. As shown in FIG. 3C , the video conference screen 310 is divided into multiple sub-pictures. Specifically, when dividing sub-pictures, in some embodiments, considering that the video conference scene requires multi-party interaction, the total number of target objects in the target images collected by each terminal device currently participating in the video conference can be determined. Divide the video conference screen. For example, taking the scene shown in FIG. 1A as an example, the number of sub-pictures may be determined according to the total number of target objects in the images captured by the terminal devices 102 and 104 respectively, and the number is four in this example. After the number of sprites is determined, a split-screen layout can be determined. There are many types of split screen layouts. For example, in order to maintain the basic split-screen layout of the existing video conference, the video conference screen 310 may be divided into n sub-pictures according to the number n of terminal devices, and then each terminal device is corresponding to the n sub-pictures. Next, according to the number of target objects in the screens collected by each terminal device, the sub-pictures corresponding to the terminal device are further divided, so that a split-screen layout is obtained. As shown in FIG. 3C , there are two left and right sub-pictures respectively corresponding to the images collected by the two terminal devices 102 and 104 , wherein the left picture further includes three sub-pictures corresponding to the target objects 302A- 302C respectively. The sub-picture on the right displays the picture collected by the terminal device 104 and may include the target object 312 . In this way, a complete video conference picture 310 is formed. Moreover, since the sub-pictures corresponding to different terminal devices are divided (for example, divided into equal sizes) in the picture 310, the user can know the specific number of terminals participating in the video conference through the split-screen layout.

可以理解，除了采用前述实施例的分屏布局外，还可以有其他的分屏布局方式，例如，直接根据所有目标对象的数量来划分子画面。图3D示出了根据本公开实施例的另一个示例性视频会议画面320的示意图。如图3D所示，视频会议画面320根据目标对象的总数被划分为4个大小相等的子画面，分别对应了目标对象302A～302C和目标对象312。这样，可以让所有参会人，无论是在会议室中的多个参会人还是单独参加的参会人，都能占有一个与其他人相等大小的子画面，使得各参会人都能很清楚地与每个参会人进行互动。It can be understood that, in addition to adopting the split-screen layout of the foregoing embodiments, there may be other split-screen layout manners, for example, dividing sub-pictures directly according to the number of all target objects. FIG. 3D shows a schematic diagram of another exemplary video conference screen 320 according to an embodiment of the present disclosure. As shown in FIG. 3D , the video conference screen 320 is divided into four equal-sized sub-pictures according to the total number of target objects, corresponding to the target objects 302A-302C and the target object 312 respectively. In this way, all participants, whether they are multiple participants in the meeting room or participants alone, can occupy a sprite screen of the same size as others, so that each participant can quickly Clearly interact with each participant.

在一些实施例中，在进行分屏布局时，根据目标对象的数量的不同，布局的方式还可以有不同的选择，例如，在布局时根据数量将画面划分为等边的子画面阵列(例如，N×N个子画面)或者不等边的子画面阵列(例如，N×M个子画面)。例如，如图3D所示，当数量为4时，可以划分得到2×2个子画面。又比如，当数量为3时，可以将画面320划分为两行，第一行放一个子画面，第二行放两个子画面，参考图3C左侧画面所示。In some embodiments, when performing split-screen layout, according to the number of target objects, the layout method can also have different options, for example, the screen is divided into equal-sided sub-picture arrays according to the number of layouts (such as , N×N sub-pictures) or an array of unequal-sided sub-pictures (for example, N×M sub-pictures). For example, as shown in FIG. 3D , when the number is 4, 2×2 sub-pictures can be obtained by division. For another example, when the number is 3, the picture 320 can be divided into two rows, one sub-picture is placed in the first row, and two sub-pictures are placed in the second row, as shown in the left picture of FIG. 3C .

可以理解，当目标对象的数量更多时，也可以采用前述的方式来进行分屏布局。例如，当目标对象的数量为7时，可以采用3×3个子画面来与目标对象对应，其中，可以留白两个子画面，如图3E所示。It can be understood that, when the number of target objects is larger, the foregoing method may also be used to perform split-screen layout. For example, when the number of target objects is 7, 3×3 sub-pictures may be used to correspond to the target objects, wherein two sub-pictures may be left blank, as shown in FIG. 3E .

在一些实施例中，考虑到通常的显示屏的比例并非是正方形的而是长大于宽的矩形(例如，16:9)，因此，在进行分屏布局时可以考虑在长度方向上增加子画面的数量。比如，视频会议画面最多支持人数为12，当N<5时，可以被将画面分为1行N列，当N≤8时，可以被将画面分为2行，第一行为N/2列(可以向上取整或向下取整)，第二行为N-N/2列(可以向下取整或向上取整)，当N≤12时，可以被将画面分为3行，前两行分为N/3列(可以向上取整或向下取整)，最后一行为N-N/3×2列(可以向下取整或向上取整)。最后一行每个子画面的宽度与之前行的子画面宽度一致，并可以居中排布。图3F示出了根据本公开实施例的又一示例性视频会议画面的示意图。如图3F所示，当目标对象的数量为7时，采用上述方式可以分屏为4+3的布局。In some embodiments, considering that the ratio of the usual display screen is not a square but a rectangle (for example, 16:9), it may be considered to increase the sub-picture in the length direction when performing a split-screen layout quantity. For example, the maximum number of people supported by a video conference screen is 12. When N<5, the screen can be divided into 1 row and N columns. When N≤8, the screen can be divided into 2 rows, and the first row can be divided into N/2 columns. (can be rounded up or down), the second row is N-N/2 columns (can be rounded down or up), when N≤12, the screen can be divided into 3 lines, the first two lines are divided into It is N/3 columns (can be rounded up or down), and the last row is N-N/3×2 columns (can be rounded down or up). The width of each sprite in the last row is the same as that of the previous row, and can be centered. Fig. 3F shows a schematic diagram of another exemplary video conference screen according to an embodiment of the present disclosure. As shown in FIG. 3F , when the number of target objects is 7, the screen can be divided into a 4+3 layout by using the above method.

由此可见，通过检测目标图像中的目标对象，进而根据目标对象数量来将视频会议画面划分为至少两个子画面，并将相应的目标对象对应显示在子画面中，从而可以在采集的目标图像中包含多个参会人的情况下，自动对视频会议画面进行分屏，从而有助于在视频会议过程中提升会议室参会人与其他线上参会人的互动感，进而提升用户体验。It can be seen that by detecting the target object in the target image, the video conference screen is divided into at least two sub-pictures according to the number of target objects, and the corresponding target objects are correspondingly displayed in the sub-pictures, so that the captured target image can When there are multiple participants in the conference, the video conference screen is automatically divided into screens, which helps to improve the interaction between the conference room participants and other online participants during the video conference, thereby improving the user experience .

在一些实施例中，在进行分屏布局时，除了考虑目标对象的数量外，还可以对参会人是否正在说话进行检测，进而基于检测到的正在说话的参会人对应的子画面进行相应的处理。因此，如图2所示，方法200还进一步包括步骤208，对目标对象进行说话人检测。可选地，该步骤可以与目标对象检测并行处理，从而提高处理速度。In some embodiments, when performing split-screen layout, in addition to considering the number of target objects, it is also possible to detect whether a participant is speaking, and then make a corresponding response based on the sub-picture corresponding to the detected participant who is speaking. processing. Therefore, as shown in FIG. 2 , the method 200 further includes step 208 of performing speaker detection on the target object. Optionally, this step can be processed in parallel with the target object detection, thereby increasing the processing speed.

可选地，当确定目标图像中的目标对象正在说话(亦即，目标对象对应的参会人正在说话)，可以在正在说话的参会人对应的子画面中显示指示标识，例如，用于指示参会人正在说话的图标。如图3D所示，可以在正在说话的参会人对应的子画面中显示一个话筒样式的图标，从而提醒其他人该参会人正在说话。Optionally, when it is determined that the target object in the target image is speaking (that is, the participant corresponding to the target object is speaking), an indication may be displayed in the sub-picture corresponding to the speaking participant, for example, for Icon to indicate that a participant is speaking. As shown in FIG. 3D , a microphone-like icon may be displayed on the sub-picture corresponding to the speaking participant, so as to remind other people that the participant is speaking.

作为一个可选实施例，还可以根据是否检测到正在说话的参会人的结果来对分屏布局进行改变。As an optional embodiment, the split-screen layout may also be changed according to whether a speaking participant is detected.

可选地，若根据检测结果确定所述目标图像中的目标对象正在说话，可以按照第一分屏模式将所述视频会议画面划分为至少两个子画面；若根据检测结果确定视频会议的所有参会人均没有说话，按照第二分屏模式将所述视频会议画面划分为至少两个子画面。该第二分屏模式可以是前述实施例中的分屏布局的任一种。Optionally, if it is determined according to the detection result that the target object in the target image is speaking, the video conference screen can be divided into at least two sub-pictures according to the first split-screen mode; if it is determined according to the detection result that all participants of the video conference If no participant speaks, divide the video conference picture into at least two sub-pictures according to the second split-screen mode. The second split-screen mode may be any one of the split-screen layouts in the foregoing embodiments.

进一步地，按照第一分屏模式将所述视频会议画面划分为至少两个子画面，包括：将所述至少两个子画面中的第一子画面放大显示，并在所述第一子画面的至少一侧并列显示所述至少两个子画面中的其他子画面，所述第一子画面则可以用于显示正在说话的目标对象。Further, dividing the video conference picture into at least two sub-pictures according to the first split-screen mode includes: enlarging and displaying the first sub-picture of the at least two sub-pictures, and displaying the first sub-picture in at least the first sub-picture Other sub-pictures of the at least two sub-pictures are displayed in parallel on one side, and the first sub-picture can be used to display the speaking target object.

图3G示出了根据本公开实施例的又一示例性视频会议画面330的示意图。如图3G所示，画面330包括四个子画面，分别对应了目标对象302A～302C和目标对象312，其中，第一子画面3302被放大了并与正在说话的参会人110A的目标对象302A相对应，其他子画面则并列显示在了第一子画面3302的一侧。这样，根据说话人检测结果将正在说话的参会人的子画面排布在画面中间并占据更大的画面，非发言人的子画面排布在侧边，并占据更小的画面，从而能够更好地提升互动性。FIG. 3G shows a schematic diagram of yet another exemplary video conference screen 330 according to an embodiment of the present disclosure. As shown in FIG. 3G , the screen 330 includes four sub-pictures, corresponding to the target objects 302A-302C and the target object 312 respectively, wherein the first sub-picture 3302 is enlarged and corresponds to the target object 302A of the speaking participant 110A. Correspondingly, other sub-pictures are displayed in parallel on one side of the first sub-picture 3302 . In this way, according to the speaker detection results, the sub-screens of the speaking participants are arranged in the middle of the screen and occupy a larger screen, and the sub-pictures of non-speakers are arranged on the side and occupy a smaller screen, so that Better enhance interactivity.

这样，当有人说话时，采用发言人模式(第一分屏模式)，在发言人模式中，当前正在发言的人放在最大子画面，其余参会人并列排布在最大子画面的至少一侧(数量较多时可以放置在两侧或者更多侧)。当没有人说话时，采用普通分屏模式(第二分屏模式)，每个子画面相同大小。通过根据说话人检测结果来选择不同分屏模式，能增加视频会议互动性，提升用户体验。Like this, when someone speaks, adopt speaker mode (first split-screen mode), in speaker mode, the person who is currently speaking is placed on the largest sub-screen, and the remaining participants are arranged side by side on at least one of the largest sub-screens. Side (when the quantity is large, it can be placed on both sides or more sides). When no one speaks, the normal split-screen mode (the second split-screen mode) is adopted, and each sub-picture has the same size. By selecting different split-screen modes according to the speaker detection results, the interactivity of the video conference can be increased and the user experience can be improved.

一般地，线上参会人的视频流和音频流都是独立的，在相关技术中，视频会议软件可以根据音频流确认对应的视频中是否有人说话，但是在会议室中参会的人员是共享视频流和音频流的，单从会议室一侧的终端设备所采集的音频流无法确定其中的人是否在说话，也无法分辨当前具体是谁在发言，降低了会议的互动性。Generally, the video stream and audio stream of online participants are independent. In related technologies, video conferencing software can confirm whether someone is speaking in the corresponding video according to the audio stream, but the participants in the conference room are If the video stream and audio stream are shared, the audio stream collected from the terminal device on the side of the meeting room cannot determine whether the person in it is speaking, nor can it tell who is currently speaking, which reduces the interactivity of the meeting.

鉴于此，在一些实施例中，通过图像处理的方式来进行说话人检测，可以避免通过音频流无法分辨当前具体是谁在发言的问题。In view of this, in some embodiments, speaker detection is performed by means of image processing, which can avoid the problem of being unable to distinguish who is currently speaking through the audio stream.

作为一个可选实施例，服务器106可以对检测到的各目标对象进行关键点检测，然后根据关键点检测结果确定目标图像中的目标对象对应的参会人是否正在说话。As an optional embodiment, the server 106 may perform key point detection on each detected target object, and then determine whether the participant corresponding to the target object in the target image is speaking according to the key point detection result.

如图3H所示，人脸关键点检测可以采用68个关键点的检测方法，这些关键点分布在人脸的各部位，其中，0-16点对应于下巴，16-21点对应于右眼眉(这里是镜像图像，是图中人的右眼眉)，22-26点对应于左眼眉，27-35点对应于鼻子，36-41点对应于右眼，42-47点对应于左眼，48-67点对应于嘴唇。通过检测关键点可以对人脸进行识别，并且，根据关键点在连续多帧目标对象中的变化，可以确定对应的参会人是否正在说话。As shown in Figure 3H, the face key point detection can adopt the detection method of 68 key points, and these key points are distributed in various parts of the face, among which, 0-16 points correspond to the chin, and 16-21 points correspond to the right eyebrow (Here is the mirror image, which is the right eyebrow of the person in the picture), 22-26 points correspond to the left eyebrow, 27-35 points correspond to the nose, 36-41 points correspond to the right eye, 42-47 points correspond to the left eye, Points 48-67 correspond to the lips. The human face can be recognized by detecting the key points, and according to the changes of the key points in the target object in consecutive multiple frames, it can be determined whether the corresponding participant is speaking.

需要说明的是，68个关键点的人脸关键点检测方法仅是一种示例，可以理解，人脸关键点检测还可以有其他的关键点数量，例如，21个关键点、29个关键点，等等。It should be noted that the face key point detection method with 68 key points is only an example, and it can be understood that the face key point detection can also have other key point numbers, for example, 21 key points, 29 key points ,etc.

作为一个可选实施例，可以采用106个关键点来实现关键点检测，从而能够得到更准确的检测结果。As an optional embodiment, 106 key points may be used to implement key point detection, so that more accurate detection results can be obtained.

在一些实施例中，在检测所述目标对象的关键点之后；可以基于所述目标对象的关键点，确定嘴唇高度，并且，可以基于所述目标对象的关键点，确定嘴唇宽度；然后根据所述嘴唇高度以及所述嘴唇宽度，得到目标对象的嘴唇高宽比，进而基于所述嘴唇高宽比的变化信息，确定所述目标对象是否正在说话。这样，就实现了利用图像处理的方式来进行说话人检测，可以避免通过音频流无法分辨当前具体是谁在发言的问题。In some embodiments, after detecting the key points of the target object; the lip height can be determined based on the key points of the target object, and the lip width can be determined based on the key points of the target object; and then according to the The lip height and the lip width are obtained to obtain the lip aspect ratio of the target object, and then based on the change information of the lip aspect ratio, it is determined whether the target object is speaking. In this way, the speaker detection is realized by means of image processing, which can avoid the problem of being unable to distinguish who is currently speaking through the audio stream.

进一步地，考虑到人脸的转动可能会造成嘴唇关键点位置的变化，在一些实施例中，在进行关键点检测时，可以基于人脸的转动角度来对检测到的关键点进行修正。Further, considering that the rotation of the human face may cause changes in the positions of the key points of the lips, in some embodiments, when performing key point detection, the detected key points may be corrected based on the rotation angle of the human face.

如图3I所示，对于人脸来说，在三维空间中，存在三种转动角度，偏航角(Yaw)、翻滚角(Roll)、俯仰角(Pitch)。As shown in FIG. 3I , for a human face, in three-dimensional space, there are three rotation angles, yaw angle (Yaw), roll angle (Roll), and pitch angle (Pitch).

作为一个可选实施例，可以通过仿射变换来抵消Roll转动的影响，然后通过人脸检测的Pitch和Yaw信息来抵消Pitch和Yaw旋转的影响。As an optional embodiment, affine transformation can be used to cancel the influence of Roll rotation, and then Pitch and Yaw information of face detection can be used to cancel the influence of Pitch and Yaw rotation.

具体地，可以根据目标对象检测得到多个关键点(例如，106个关键点的坐标)。然后将该目标对象的关键点与标准(平均)人脸的关键点(也就是标准关键点)进行对应，从而可以得到一个仿射变换矩阵(映射关系)，基于将该仿射变换矩阵作用于根据目标对象检测得到多个关键点，就可以对所述多个关键点进行翻滚角(Roll)修正，进而得到修正后的多个关键点，亦即当前检测的目标对象在Roll＝0时的关键点坐标。Specifically, multiple key points (for example, coordinates of 106 key points) may be obtained according to the target object detection. Then the key points of the target object are corresponding to the key points (that is, standard key points) of the standard (average) face, so that an affine transformation matrix (mapping relationship) can be obtained, based on applying the affine transformation matrix to Obtain a plurality of key points according to target object detection, just can carry out rolling angle (Roll) correction to described a plurality of key points, and then obtain a plurality of key points after correction, that is, the current detected target object when Roll=0 key point coordinates.

进一步地，可以选取所述修正后的多个关键点中与嘴唇高度和嘴唇宽度分别对应的多个第一关键点和多个第二关键点，对所述多个第一关键点进行俯仰角修正，得到修正后的嘴唇高度，对所述多个第二关键点进行偏航角修正，得到修正后的嘴唇宽度，这样，修正后的嘴唇高度和修正后的嘴唇宽度就抵消了Pitch和Yaw旋转的影响。作为一个可选实施例，以106个关键点为例，可以计算98点-102点的线段长度表示嘴唇高度并除以cos(Pitch)以抵消Pitch的影响，从而得到修正后的嘴唇高度。类似地，计算96点-100点的线段长度表示嘴唇宽度并除以cos(Yaw)以抵消Yaw的影响，从而得到修正后的嘴唇宽度。其中，Pitch、Yaw的角度信息可以由人脸检测模块提供。Further, a plurality of first key points and a plurality of second key points respectively corresponding to the lip height and lip width among the corrected key points may be selected, and the pitch angle of the plurality of first key points may be selected. Correction, obtain the corrected lip height, and perform yaw angle correction on the multiple second key points to obtain the corrected lip width, so that the corrected lip height and the corrected lip width cancel out Pitch and Yaw The effect of rotation. As an optional embodiment, taking 106 key points as an example, the length of the line segment between 98 points and 102 points can be calculated to represent the lip height and divided by cos(Pitch) to offset the influence of Pitch, thereby obtaining the corrected lip height. Similarly, the length of the line segment from 96 points to 100 points is calculated to represent the lip width and divided by cos(Yaw) to cancel the influence of Yaw, thereby obtaining the corrected lip width. Among them, the angle information of Pitch and Yaw can be provided by the face detection module.

接着，根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度，就可以得到所述关键点检测结果(针对嘴唇高度和嘴唇宽度的检测结果)，从而以转正后嘴唇的高宽比来表示嘴巴张大的幅度。Then, according to the corrected lip height and the corrected lip width, the key point detection results (detection results for lip height and lip width) can be obtained, so that the corrected lip aspect ratio Indicates how wide the mouth is opened.

考虑到说话是一个动态过程，单依靠当前时刻嘴唇的高宽比可能无法准确判断当前参会人是否在说话。因此，在一些实施例中，可以维护一段时间内嘴唇高宽比的变化，并以此段时间内高宽比的方差来判断当前该主体是否正在说话。Considering that speaking is a dynamic process, relying solely on the aspect ratio of the lips at the current moment may not be able to accurately determine whether the current participant is speaking. Therefore, in some embodiments, the change of the lip aspect ratio within a period of time can be maintained, and the variance of the aspect ratio within the period of time can be used to determine whether the subject is currently speaking.

作为一个可选实施例，可以根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度，计算所述目标对象的嘴唇高宽比并存储所述嘴唇高宽比。As an optional embodiment, the lip aspect ratio of the target object may be calculated according to the corrected lip height and the corrected lip width, and the lip aspect ratio may be stored.

然后，在根据关键点检测结果确定所述目标图像中的目标对象对应的参会人是否正在说话时，可以结合所述关键点检测结果，确定所述目标图像中的目标对象的嘴唇高宽比对应的变化信息，根据所述变化信息确定所述目标图像中的目标对象对应的参会人是否正在说话。例如，可以根据预设时间段内(例如，1s内)的嘴唇高宽比的方差是否大于方差阈值来判断是否正在说话，若大于则可以判断正在说话。这样，在判断是否正在说话时，结果能够更加准确。Then, when determining whether the participant corresponding to the target object in the target image is speaking according to the key point detection result, the lip aspect ratio of the target object in the target image can be determined in combination with the key point detection result Corresponding change information, determining whether the participant corresponding to the target object in the target image is speaking according to the change information. For example, it can be judged whether the lips are speaking according to whether the variance of the lip aspect ratio within a preset time period (for example, within 1s) is greater than the variance threshold, and if it is greater, it can be judged to be speaking. In this way, when judging whether a person is speaking, the result can be more accurate.

考虑到发言人在说话时可能会有间歇性的停顿，为了使说话检测的效果更加稳定，在一些实施例中，可以维护一个计数器用以记录最近一段时间内该发言人被判定为正在说话的次数，然后根据该次数与预设计数阈值的大小关系来判断是否正在说话。可选地，基于所述嘴唇高宽比的变化信息，确定所述目标对象是否正在说话，包括：设定预设时间段(例如，2s内)；统计所述预设时间段内所述嘴唇高宽比的变化次数；当所述变化次数达到预设数量时，确定所述目标对象正在说话。这样，利用了关键点的时序信息来增强发言人检测在时序上的稳定性，减少检测状态的波动。Considering that the speaker may have intermittent pauses when speaking, in order to make the effect of speaking detection more stable, in some embodiments, a counter can be maintained to record the number of times the speaker has been judged to be speaking in the most recent period of time. The number of times, and then judge whether it is speaking according to the size relationship between the number of times and the preset counting threshold. Optionally, determining whether the target object is speaking based on the change information of the lip aspect ratio includes: setting a preset time period (for example, within 2s); counting the lips within the preset time period The number of times the aspect ratio changes; when the number of changes reaches a preset number, it is determined that the target object is speaking. In this way, the timing information of key points is used to enhance the stability of speaker detection in timing and reduce the fluctuation of the detection state.

作为一个可选实施例，响应于根据所述变化信息确定所述目标图像中的目标对象对应的参会人正在说话(亦即当前帧该主体被判定为正在说话)，计数值+1；响应于根据所述变化信息确定所述目标图像中的目标对象对应的参会人不是正在说话(亦即当前帧该主体被判定为没有在说话)，计数值-1。As an optional embodiment, in response to determining according to the change information that the participant corresponding to the target object in the target image is speaking (that is, the subject in the current frame is determined to be speaking), the count value +1; When it is determined according to the change information that the participant corresponding to the target object in the target image is not speaking (that is, the subject is determined not to be speaking in the current frame), the count value is -1.

然后，可以根据预设时间段内(例如，2s内)的所述计数值，确定所述目标图像中的目标对象对应的参会人是否正在说话。例如，当计数器的值大于预设计数阈值(例如，2次)时，可以显示该主体正在发言的效果(例如，放大发言人对应的子画面和/或显示话筒图标)，当计数器的值小于预设计数阈值时，可以取消显示该主体正在发言的效果(例如，恢复发言人对应的子画面到与其他子画面尺寸相同和/或隐藏话筒图标)。Then, it may be determined whether the participant corresponding to the target object in the target image is speaking according to the count value within a preset time period (for example, within 2s). For example, when the value of the counter is greater than the preset count threshold (for example, 2 times), the effect that the subject is speaking (for example, zooming in on the sub-screen corresponding to the speaker and/or displaying the microphone icon) can be displayed; when the value of the counter is less than When the counting threshold is preset, the effect of displaying that the subject is speaking can be canceled (for example, restore the sprite corresponding to the speaker to the same size as other sprites and/or hide the microphone icon).

这样，根据分屏后人脸的关键点检测来确定人脸嘴唇的位置，并根据嘴唇关键点的相对位置关系来判断当前参会人是否正在说话；同时为了改善人脸移动，转动等动作造成的说话检测误判，在进行说话判断之前会将人脸检测的关键点映射到未转动的标准人脸上，减少人脸移动带来的影响。此外，利用了关键点的时序信息来增强发言人检测在时序上的稳定性，减少检测状态的波动。In this way, the position of the face and lips is determined according to the key point detection of the face after the split screen, and whether the current participant is speaking is judged according to the relative positional relationship of the key points of the lips; The speech detection misjudgment, before the speech judgment, the key points of the face detection will be mapped to the unrotated standard face to reduce the impact of face movement. In addition, the timing information of key points is used to enhance the stability of speaker detection in timing and reduce the fluctuation of detection state.

考虑到单纯利用图像处理技术来进行说话人判断可能存在误差，在一些实施例中，在确定所述变化次数达到预设数量之后，还可以获取视频会议的音频数据，然后，根据关键点检测结果，结合视频会议的音频数据，确定所述目标图像中的目标对象是否正在说话。这样，同时结合关键点检测结果和当前视频会议的音频数据，可以进一步提高说话人判断的准确性。作为一个可选实施例，当采集音频的拾音器是双声道拾音器时，可以根据双声道所分别采集的两组音频数据来对说话人进行定位，进而可以进一步提高说话人判断的准确性。Considering that there may be errors in speaker judgment by simply using image processing technology, in some embodiments, after determining that the number of changes has reached a preset number, the audio data of the video conference can also be obtained, and then, according to the key point detection results , combined with the audio data of the video conference, determine whether the target object in the target image is speaking. In this way, the accuracy of speaker judgment can be further improved by combining the key point detection result and the audio data of the current video conference. As an optional embodiment, when the pickup for collecting audio is a binaural pickup, the speaker can be located according to two sets of audio data respectively collected by the binaural, thereby further improving the accuracy of speaker judgment.

在相关技术中，有的视频会议自动分屏软件通过人体检测确定当前画面中参会人的位置，并依据人体位置进行剪裁和布局进而实现软件自动分屏的功能。当有参会人进入或者离开会议室时，难以实现实时的分屏变化。In related technologies, some video conference automatic screen splitting software determines the position of participants in the current screen through human body detection, and performs cutting and layout according to the body position to realize the function of automatic screen splitting of the software. When a participant enters or leaves the meeting room, it is difficult to achieve real-time split-screen changes.

因此，为了实现实时地增加或减少分屏数量，在确定分屏布局之后，如图2所示，可以进入步骤210，根据检测框位置和分屏布局将目标对象与检测框对应。Therefore, in order to increase or decrease the number of split screens in real time, after the split screen layout is determined, as shown in FIG. 2 , step 210 may be entered to correspond the target object to the detection frame according to the position of the detection frame and the split screen layout.

根据前面实施例所述，可以利用目标检测或目标跟踪技术检测所述目标图像中的目标对象，得到与所述目标对象对应的检测框，例如，图3B的检测框304A～304C。这样，每个被检测到的目标对象都可以对应有一个检测框，然后，可以根据分屏布局来确定每个目标对象在分屏布局中的位置，进而将该目标对象的检测框与分屏布局中的位置进行对应。这样，通过对目标对象300进行目标对象的检测，通过目标对象的检测框的位置来判断不同人脸的相对位置关系，并将分屏的位置和人脸的位置相对应，然后将检测框与分屏布局进行对应，可以根据检测结果来增加或减少分屏的数量，并改变分屏的布局，使得当有人加入或离开会议室时能够保持原来参会者的相对位置不变，并可以实现实时地增加或减少分屏数量。According to the foregoing embodiments, the target object in the target image may be detected by using target detection or target tracking technology to obtain a detection frame corresponding to the target object, for example, detection frames 304A to 304C in FIG. 3B . In this way, each detected target object can have a corresponding detection frame, and then, the position of each target object in the split-screen layout can be determined according to the split-screen layout, and then the detection frame of the target object and the split-screen corresponding to the position in the layout. In this way, by detecting the target object 300, the relative positional relationship of different human faces is judged through the position of the detection frame of the target object, and the position of the split screen is corresponding to the position of the human face, and then the detection frame is compared with the position of the human face. Corresponding to the split-screen layout, the number of split-screens can be increased or decreased according to the detection results, and the layout of the split-screens can be changed, so that when someone joins or leaves the meeting room, the relative position of the original participant can be kept unchanged, and can realize Increase or decrease the number of split screens in real time.

可选地，可以根据每个参会人的检测框在原图(目标图像)中的坐标，来确定在分屏布局中每个参会人对应的ROI(Region of Interest)，从而根据每个人的检测框的位置以及分屏布局确定人、检测框和子画面的一一对应关系。Optionally, the ROI (Region of Interest) corresponding to each participant in the split-screen layout can be determined according to the coordinates of each participant's detection frame in the original image (target image), so that according to each person's The position of the detection frame and the split-screen layout determine the one-to-one correspondence between the person, the detection frame and the sub-picture.

然后，如图2所示，在步骤212，可以将检测框的内容与子画面进行匹配。可选地，可以先确定目标对象对应的检测框的坐标以及目标对象对应的子画面的坐标；然后，根据目标对象对应的检测框的坐标以及目标对象对应的子画面的坐标，将检测框对应的图像平移和/或缩放至目标对象对应的子画面中。通过对目标对象进行缩放，使得参会人都能够清晰地看到其他人，提升会议的互动性。Then, as shown in FIG. 2, in step 212, the content of the detection frame may be matched with the sub-picture. Optionally, first determine the coordinates of the detection frame corresponding to the target object and the coordinates of the sub-picture corresponding to the target object; then, according to the coordinates of the detection frame corresponding to the target object and the coordinates of the sub-picture corresponding to the target object, the detection frame corresponds to The image of the target object is panned and/or zoomed into the sprite corresponding to the target object. By zooming the target object, all participants can clearly see other people, improving the interactivity of the meeting.

作为一个可选实施例，可以在每个检测框和分屏子画面对应之后，先将检测框按照一定比例向两边扩展(例如，高扩展20％，宽扩展40％)，在此基础上将检测框的宽和高向两侧扩展至其宽高比和对应子屏宽高比相同，如果扩展后的ROI超过了画面边界则平移至画面范围内，从而可以实现检测检测框的内容与子画面的匹配。As an optional embodiment, after each detection frame corresponds to the split-screen sub-picture, the detection frame can be expanded to both sides according to a certain ratio (for example, the height is expanded by 20%, and the width is expanded by 40%). The width and height of the detection frame are extended to both sides until its aspect ratio is the same as that of the corresponding sub-screen. If the expanded ROI exceeds the screen boundary, it will be translated to the screen range, so that the content of the detection frame and the sub-screen can be detected. screen matching.

在一些实施例中，在对检测框对应的图像进行平移并缩放时，可以通过线性插值的方式计算当前时刻每个检测框对应的子画面的ROI，从当前画面到目标人脸的平滑移动缩放效果，实现分屏效果切换时的类似监控摄像头的水平变换-垂直变换-缩放功能。In some embodiments, when the image corresponding to the detection frame is translated and zoomed, the ROI of the sub-picture corresponding to each detection frame at the current moment can be calculated by linear interpolation, and the smooth movement and zooming from the current picture to the target face Effect, realize the horizontal transformation-vertical transformation-zooming function similar to that of a surveillance camera when switching between split-screen effects.

具体地，假设某一子画面的原始坐标，用矩形左上和右下两个顶点的坐标表示为(x1_0,y1_0,x2_0,y2_0)，平移缩放持续时间为T，目标人脸的检测框的ROI的坐标，用矩形左上和右下两个顶点的坐标表示为(x1_T,y1_T,x2_T,y2_T)，则：Specifically, assuming that the original coordinates of a sub-picture are represented by the coordinates of the upper left and lower right vertices of the rectangle as (x1_0, y1_0, x2_0, y2_0), the duration of panning and zooming is T, and the ROI of the detection frame of the target face The coordinates of the rectangle are expressed as (x1_T, y1_T, x2_T, y2_T) by the coordinates of the upper left and lower right vertices of the rectangle, then:

首先，可以确定一个时间间隔Δt，然后根据平移缩放持续时间T和时间间隔Δt确定线性插值的数量。First, a time interval Δt can be determined, and then the amount of linear interpolation can be determined according to the translation scaling duration T and the time interval Δt.

接着，所述线性插值的数量、根据子画面的原始坐标和检测框的ROI的坐标，确定每个插值对应的子画面的更新坐标。针对每个插值，更新坐标相对前一插值对应的坐标可以是等间隔变换的。Next, the number of linear interpolations determines the update coordinates of the sub-picture corresponding to each interpolation according to the original coordinates of the sub-picture and the coordinates of the ROI of the detection frame. For each interpolation, the updated coordinates may be transformed at equal intervals relative to the coordinates corresponding to the previous interpolation.

然后，按照等时间间隔的方式，根据每个插值对应的子画面的更新坐标，逐渐对子画面进行水平变换-垂直变换-缩放处理，直到持续时间达到T。Then, according to the update coordinates of the sub-picture corresponding to each interpolation, gradually perform horizontal transformation-vertical transformation-scaling processing on the sub-picture in the manner of equal time intervals until the duration reaches T.

这样，从开启分屏功能到完成子画面的平移缩放，能够有一个时间T的过渡时间，进而从视觉效果上可以有一种类似监控摄像头的水平变换-垂直变换-缩放功能，提升用户体验，改善了相关技术直接切换或者对剪裁出的框进行简单的缩放而导致切换效果较为生硬的问题。In this way, from turning on the split-screen function to completing the panning and zooming of the sub-picture, there can be a transition time of time T, and then there can be a horizontal transformation-vertical transformation-zooming function similar to a surveillance camera in terms of visual effects, which improves user experience and improves The problem that the switching effect is relatively rigid due to the direct switching of related technologies or the simple scaling of the clipped frame.

在一些实施例中，采用前述方式处理后的子画面的清晰度可能受到影响，因此，可以采用超分辨率技术来提升子画面的分辨率，进而改善清晰度。In some embodiments, the sharpness of the sub-picture processed in the foregoing manner may be affected. Therefore, super-resolution technology may be used to increase the resolution of the sub-picture, thereby improving the sharpness.

在相关技术中，通常没有支持子画面的虚拟背景功能。本公开实施例提供了虚拟背景功能以填补空白。In the related art, there is generally no virtual background function supporting sprites. Embodiments of the present disclosure provide a virtual background function to fill in the gaps.

因此，如图2所示，在步骤214，可以先确定是否开启虚拟背景功能。如果开启虚拟背景功能，则进入步骤216，对当前图像进行基于目标对象的语义分割(可与人脸检测并行处理从而提升处理效率)，这样，利用语义分割能力实现每个分屏的虚拟背景功能。可选地，可以使用预先训练好的语义分割模型来将目标对象与背景图像进行分割。该语义分割模型可以是基于深度学习的实时人像语义分割模型，模型结构包括但不限于各种形式的卷积神经网络(Convolutional Neural Network)以及各种形式的Transformer网络。Therefore, as shown in FIG. 2 , in step 214, it may first be determined whether to enable the virtual background function. If the virtual background function is turned on, then enter step 216, and carry out semantic segmentation based on the target object on the current image (can be processed in parallel with face detection to improve processing efficiency), so that the virtual background function of each split screen can be realized by using the semantic segmentation capability . Optionally, a pre-trained semantic segmentation model can be used to segment the target object from the background image. The semantic segmentation model may be a real-time portrait semantic segmentation model based on deep learning, and the model structure includes but is not limited to various forms of convolutional neural networks (Convolutional Neural Network) and various forms of Transformer networks.

在一些实施例中，可以应用人像分割功能对当前输入图整个做人像分割。在分屏功能开启后每个子屏的分割结果对应于全图分割结果在该子屏对应的ROI下的分割结果。然后针对每个子屏的每个像素点，根据分割结果在该像素点处的值(归一化至[0,1])、输入图中该像素点对应的值、待替换的新背景在该像素点处的像素值，计算虚拟背景结果在该像素点处的像素值。In some embodiments, the portrait segmentation function can be used to perform portrait segmentation on the entire current input image. After the split screen function is enabled, the segmentation result of each sub-screen corresponds to the segmentation result of the whole image under the ROI corresponding to the sub-screen. Then, for each pixel of each sub-screen, according to the value of the segmentation result at the pixel (normalized to [0,1]), the value corresponding to the pixel in the input image, and the new background to be replaced at the The pixel value at the pixel point, calculate the pixel value of the virtual background result at the pixel point.

具体地，针对每个像素点，可以将分割结果在该像素点处的第一值(归一化至[0,1])乘以输入图中该像素点对应的值，加上分割结果在该像素点处的第二值(1减去第一值)乘以待替换的新背景在该像素点处的像素值，来得到虚拟背景结果在该像素点处的像素值。Specifically, for each pixel, the first value of the segmentation result at the pixel (normalized to [0,1]) can be multiplied by the value corresponding to the pixel in the input image, plus the segmentation result at The second value (1 minus the first value) at the pixel point is multiplied by the pixel value of the new background to be replaced at the pixel point to obtain the pixel value of the virtual background result at the pixel point.

这样，就完成了虚拟背景替换真实背景的处理。In this way, the process of replacing the real background with the virtual background is completed.

接着，可以进入步骤218，对视频会议画面进行渲染。Next, step 218 may be entered to render the video conference picture.

可选地，可以用每个子画面对应的原图ROI区域的内容渲染该子屏，如果开启了虚拟背景功能，则结合人像分割的结果以及待替换的背景做背景替换，如果需要对检测到的说话人做任何特殊处理，也在该步骤完成。例如，在第一子画面3302中显示正在说话的参会人对应的目标对象。在第二子画面3304中显示虚拟背景。Optionally, each sub-screen can be rendered with the content of the ROI area of the original image corresponding to each sub-screen. If the virtual background function is enabled, the background replacement will be performed in combination with the results of the portrait segmentation and the background to be replaced. If necessary, the detected Any special processing done by the speaker is also done in this step. For example, the target object corresponding to the speaking participant is displayed on the first sub-screen 3302 . A virtual background is displayed in the second sprite 3304 .

随后当前帧处理结束，进入下一帧流程，可以开始处理下一帧。Then the processing of the current frame ends, and the next frame process is entered, and the next frame can be processed.

从上述实施例可以看出，本公开实施例通过了一种视频会议自动分屏系统，应用视频自动分屏功能可以实现坐在同一会议室里的人也能和远程参会的同事“面对面”沟通。在一些实施例中，说话人检测可以方便的标识出当前会议室中谁正在发言，并把该说话人的视频流放到明显位置，提升视频会议体验。在一些场景中，例如，户外直播时，有时候因为场地、距离的关系拍摄主体在画面中的占比可能非常小，通过软件实现的PTZ功能可以不用任何操作就实现镜头自动追踪对焦的功能。It can be seen from the above embodiments that the embodiment of the present disclosure adopts an automatic screen splitting system for video conferences, and the automatic video splitting function can be used to realize that people sitting in the same conference room can also "face-to-face" with colleagues participating in the conference remotely. communicate. In some embodiments, the speaker detection can conveniently identify who is speaking in the current meeting room, and put the video stream of the speaker in an obvious position to improve the video conference experience. In some scenes, for example, during outdoor live broadcast, sometimes the proportion of the subject in the frame may be very small due to the relationship between the venue and the distance. The PTZ function implemented through the software can realize the automatic tracking and focusing function of the lens without any operation.

需要说明的是，前述实施例以服务器106为执行主体进行描述，实际上，前述的处理步骤可以不用限定执行主体，例如，终端设备102、104也可以实现这些处理步骤，因此，将终端设备102、104作为前述实施例的执行主体也是可以的。It should be noted that the foregoing embodiments are described with the server 106 as the execution subject. In fact, the foregoing processing steps may not be limited to the execution subject. For example, the terminal devices 102 and 104 may also implement these processing steps. Therefore, the terminal device 102 , 104 may also be used as the execution subject of the foregoing embodiments.

本公开实施例还提供了一种视频会议画面的分屏方法。图4示出了本公开实施例所提供的示例性方法400的流程示意图。该方法400可以应用于图1A的服务器106，也可以应用于图1A的终端设备102、104。如图4所示，该方法400可以进一步包括以下步骤。The embodiment of the present disclosure also provides a screen splitting method of a video conference picture. FIG. 4 shows a schematic flowchart of an exemplary method 400 provided by an embodiment of the present disclosure. The method 400 can be applied to the server 106 in FIG. 1A , and can also be applied to the terminal devices 102 and 104 in FIG. 1A . As shown in FIG. 4 , the method 400 may further include the following steps.

在步骤402，获取采集单元采集的目标图像。In step 402, the target image collected by the collection unit is acquired.

以图1A为例，该采集单元可以是终端设备102、104中设置的摄像头，该目标图像可以是摄像头采集的图像1022和1042。Taking FIG. 1A as an example, the acquisition unit may be a camera installed in the terminal devices 102 and 104, and the target image may be the images 1022 and 1042 collected by the camera.

在步骤404，检测所述目标图像(例如，图3A的图像300)中的目标对象(例如，图3A的目标对象302A～302C)。In step 404, target objects (eg, target objects 302A-302C in FIG. 3A ) in the target image (eg, image 300 in FIG. 3A ) are detected.

作为一个可选实施例，可以采用目标检测(Object Detection)技术来检测目标图像300中的目标对象。可选地，可以采用预先训练好的目标检测模型来检测目标图像300中的目标对象302A～302C，并可以得到与目标对象302A～302C对应的检测框304A～304C，如图3B所示。As an optional embodiment, an object detection (Object Detection) technology may be used to detect the target object in the target image 300 . Optionally, a pre-trained target detection model can be used to detect the target objects 302A-302C in the target image 300, and detection frames 304A-304C corresponding to the target objects 302A-302C can be obtained, as shown in FIG. 3B.

在步骤406，根据所述目标对象将视频会议画面划分为至少两个子画面。In step 406, the video conference screen is divided into at least two sub-pictures according to the target object.

在步骤408，将所述目标图像中的目标对象对应显示在所述至少两个子画面中，参考图3C～图3G所示。In step 408, the target object in the target image is correspondingly displayed in the at least two sub-pictures, as shown in FIG. 3C to FIG. 3G .

本公开实施例提供了的视频会议画面的分屏方法，通过检测采集单元采集的目标图像中的目标对象，进而根据目标对象来将视频会议画面划分为至少两个子画面，并将相应的目标对象对应显示在子画面中，从而可以在采集的目标图像中包含多个参会人的情况下，自动对视频会议画面进行分屏，从而有助于在视频会议过程中提升会议室参会人与其他线上参会人的互动感，进而提升用户体验。The video conference screen splitting method provided by the embodiment of the present disclosure, by detecting the target object in the target image collected by the acquisition unit, and then dividing the video conference screen into at least two sub-pictures according to the target object, and dividing the corresponding target object Correspondingly displayed in the sub-picture, so that when the collected target image contains multiple participants, the video conference screen can be automatically divided into screens, which helps to improve the interaction between the participants in the conference room during the video conference. The sense of interaction of other online participants will improve the user experience.

在一些实施例中，根据所述目标对象的数量将所述视频会议画面划分为至少两个子画面，包括：确定所述目标图像中的目标对象是否正在说话；响应于确定所述目标图像中的目标对象正在说话，按照第一分屏模式将所述视频会议画面划分为至少两个子画面，如图3G所示。这样，当有人说话时，采用发言人模式(第一分屏模式)，在发言人模式中，当前正在发言的人放在最大子画面，其余参会人并列排布在最大子画面的至少一侧(数量较多时可以放置在两侧或者更多侧)。当没有人说话时，采用普通分屏模式(第二分屏模式)，每个子画面相同大小。通过根据说话人检测结果来选择不同分屏模式，能增加视频会议互动性，提升用户体验。In some embodiments, dividing the video conference screen into at least two sub-pictures according to the number of the target objects includes: determining whether the target object in the target image is speaking; in response to determining whether the target object in the target image The target object is speaking, and the video conference screen is divided into at least two sub-pictures according to the first split-screen mode, as shown in FIG. 3G . Like this, when someone speaks, adopt speaker mode (first split-screen mode), in speaker mode, the person who is currently speaking is placed on the largest sub-screen, and the remaining participants are arranged side by side on at least one of the largest sub-screens. Side (when the quantity is large, it can be placed on both sides or more sides). When no one speaks, the normal split-screen mode (the second split-screen mode) is adopted, and each sub-picture has the same size. By selecting different split-screen modes according to the speaker detection results, the interactivity of the video conference can be increased and the user experience can be improved.

在一些实施例中，按照第一分屏模式将所述视频会议画面划分为至少两个子画面，包括：将所述至少两个子画面中的第一子画面(例如，图3G的子画面3302)放大显示，并在所述第一子画面的至少一侧并列显示所述至少两个子画面中的其他子画面，所述第一子画面用于显示正在说话的目标对象。In some embodiments, dividing the video conference screen into at least two sub-pictures according to the first split-screen mode includes: dividing the first sub-picture (for example, sub-picture 3302 in FIG. 3G ) of the at least two sub-pictures Enlarge the display, and display other sub-pictures of the at least two sub-pictures side by side on at least one side of the first sub-picture, and the first sub-picture is used to display the speaking target object.

这样，根据说话人检测结果将正在说话的参会人的子画面排布在画面中间并占据更大的画面，非发言人的子画面排布在侧边，并占据更小的画面，从而能够更好地提升互动性。In this way, according to the speaker detection results, the sub-screens of the speaking participants are arranged in the middle of the screen and occupy a larger screen, and the sub-pictures of non-speakers are arranged on the side and occupy a smaller screen, so that Better enhance interactivity.

在一些实施例中，确定所述目标图像中的目标对象是否正在说话，包括：检测所述目标对象的关键点；基于所述目标对象的关键点，确定嘴唇高度；基于所述目标对象的关键点，确定嘴唇宽度；根据所述嘴唇高度以及所述嘴唇宽度，得到目标对象的嘴唇高宽比；基于所述嘴唇高宽比的变化信息，确定所述目标对象是否正在说话，从而可以通过图像处理的方式来进行说话人检测，可以避免通过音频流无法分辨当前具体是谁在发言的问题。In some embodiments, determining whether the target object in the target image is speaking includes: detecting key points of the target object; determining lip height based on the key points of the target object; point, determine the lip width; according to the lip height and the lip width, the lip aspect ratio of the target object is obtained; based on the change information of the lip aspect ratio, it is determined whether the target object is speaking, so that the image can be The speaker detection is performed in a processing way, which can avoid the problem that it is impossible to distinguish who is currently speaking through the audio stream.

在一些实施例中，确定所述目标图像中的目标对象是否正在说话，包括：对所述目标对象进行关键点检测；根据关键点检测结果确定所述目标图像中的目标对象是否正在说话，从而可以通过图像处理的方式来进行说话人检测，可以避免通过音频流无法分辨当前具体是谁在发言的问题。In some embodiments, determining whether the target object in the target image is speaking includes: performing key point detection on the target object; determining whether the target object in the target image is speaking according to the key point detection result, thereby Speaker detection can be performed by means of image processing, which can avoid the problem of being unable to distinguish who is currently speaking through the audio stream.

在一些实施例中，对所述目标对象进行关键点检测，包括：In some embodiments, performing key point detection on the target object includes:

根据所述目标对象检测得到多个关键点；Obtaining a plurality of key points according to the target object detection;

根据所述关键点与标准关键点的对应关系，对所述多个关键点进行翻滚角修正，得到修正后的多个关键点；Correcting roll angles for the plurality of key points according to the corresponding relationship between the key points and the standard key points to obtain a plurality of corrected key points;

选取所述修正后的多个关键点中与嘴唇高度对应的多个第一关键点，对所述多个第一关键点进行俯仰角修正，得到修正后的嘴唇高度；Selecting a plurality of first key points corresponding to the lip height among the corrected key points, and correcting the pitch angle of the plurality of first key points to obtain the corrected lip height;

选取所述修正后的多个关键点中与嘴唇宽度对应的多个第二关键点，对所述多个第二关键点进行偏航角修正，得到修正后的嘴唇宽度；selecting a plurality of second key points corresponding to the width of the lips among the plurality of key points after correction, and correcting the yaw angle of the plurality of second key points to obtain the corrected lip width;

根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度，得到所述关键点检测结果。The key point detection result is obtained according to the corrected lip height and the corrected lip width.

为了改善人脸移动，转动等动作造成的说话检测误判，在进行说话判断之前会将人脸检测的关键点映射到未转动的标准人脸上，减少人脸移动带来的影响。In order to improve the misjudgment of speech detection caused by face movement, rotation and other actions, the key points of face detection will be mapped to the non-rotated standard face before speech judgment to reduce the impact of face movement.

在一些实施例中，所述方法还包括：根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度，计算所述目标对象的嘴唇高宽比并存储所述嘴唇高宽比；In some embodiments, the method further includes: calculating the lip aspect ratio of the target object according to the corrected lip height and the corrected lip width, and storing the lip aspect ratio;

根据关键点检测结果确定所述目标图像中的目标对象是否正在说话，包括：结合所述关键点检测结果，确定所述目标图像中的目标对象的嘴唇高宽比对应的变化信息，根据所述变化信息确定所述目标图像中的目标对象是否正在说话。Determining whether the target object in the target image is speaking according to the key point detection result includes: combining with the key point detection result, determining change information corresponding to the lip aspect ratio of the target object in the target image, according to the The change information determines whether the target object in the target image is speaking.

这样，利用了关键点的时序信息来增强发言人检测在时序上的稳定性，减少检测状态的波动。In this way, the timing information of key points is used to enhance the stability of speaker detection in timing and reduce the fluctuation of the detection state.

在一些实施例中，确定所述目标图像中的目标对象对应的参会人是否正在说话，包括：设定预设时间段；统计所述预设时间段内所述嘴唇高宽比的变化次数；当所述变化次数达到预设数量时，确定所述目标对象正在说话。这样，利用了关键点的时序信息来增强发言人检测在时序上的稳定性，减少检测状态的波动。In some embodiments, determining whether the participant corresponding to the target object in the target image is speaking includes: setting a preset time period; counting the number of changes of the lip aspect ratio within the preset time period ; When the number of changes reaches a preset number, determine that the target object is speaking. In this way, the timing information of key points is used to enhance the stability of speaker detection in timing and reduce the fluctuation of the detection state.

在一些实施例中，当所述变化次数达到预设数量时，确定所述目标对象正在说话，包括：确定所述变化次数达到预设数量；获取视频会议的音频数据；结合视频会议的音频数据，确定所述目标对象正在说话。这样，同时结合关键点检测结果和当前视频会议的音频数据，可以进一步提高说话人判断的准确性。作为一个可选实施例，当采集音频的拾音器是双声道拾音器时，可以根据双声道所分别采集的两组音频数据来对说话人进行定位，进而可以进一步提高说话人判断的准确性。In some embodiments, when the number of changes reaches a preset number, determining that the target object is speaking includes: determining that the number of changes reaches a preset number; acquiring audio data of the video conference; combining the audio data of the video conference , to determine that the target object is speaking. In this way, the accuracy of speaker judgment can be further improved by combining the key point detection result and the audio data of the current video conference. As an optional embodiment, when the pickup for collecting audio is a binaural pickup, the speaker can be located according to two sets of audio data respectively collected by the binaural, thereby further improving the accuracy of speaker judgment.

在一些实施例中，检测所述目标图像中的目标对象，包括：利用目标检测或目标跟踪技术检测所述目标图像中的目标对象，得到所述目标对象的检测框；In some embodiments, detecting the target object in the target image includes: using target detection or target tracking technology to detect the target object in the target image, and obtaining a detection frame of the target object;

将所述目标图像中的目标对象对应显示在所述至少两个子画面中，包括：确定所述目标对象对应的检测框的坐标；确定所述目标对象对应的子画面的坐标；根据所述检测框的坐标以及所述子画面的坐标，将所述检测框对应的图像平移和/或缩放至所述目标对象对应的子画面中。Correspondingly displaying the target object in the target image in the at least two sub-pictures includes: determining the coordinates of the detection frame corresponding to the target object; determining the coordinates of the sub-picture corresponding to the target object; according to the detection The coordinates of the frame and the coordinates of the sub-picture are used to translate and/or zoom the image corresponding to the detection frame into the sub-picture corresponding to the target object.

通过对目标对象进行缩放，使得参会人都能够清晰地看到其他人，提升会议的互动性。By zooming the target object, all participants can clearly see other people, improving the interactivity of the meeting.

在一些实施例中，将所述目标图像中的目标对象对应显示在所述至少两个子画面中，进一步包括：响应于确定所述子画面中的虚拟背景功能被打开，利用分割技术将所述目标对象与背景进行分割，得到分割结果；根据所述分割结果，在所述子画面中显示虚拟背景，填补了相关技术中没有在子画面中显示虚拟背景的技术空白点。In some embodiments, correspondingly displaying the target object in the target image in the at least two sub-pictures further includes: in response to determining that the virtual background function in the sub-picture is turned on, dividing the The target object and the background are segmented to obtain a segmentation result; according to the segmentation result, a virtual background is displayed in the sub-picture, which fills a technical gap in the related art that does not display a virtual background in the sub-picture.

在一些实施例中，将所述目标图像中的目标对象对应显示在所述至少两个子画面中，包括：响应于确定所述至少两个子画面中的第二子画面的虚拟背景功能被打开，在所述第二子画面(例如，图3G的子画面3304)中显示虚拟背景，填补了相关技术中没有在子画面中显示虚拟背景的技术空白点。In some embodiments, correspondingly displaying the target object in the target image in the at least two sub-pictures includes: in response to determining that the virtual background function of the second sub-picture of the at least two sub-pictures is turned on, Displaying a virtual background in the second sub-picture (for example, sub-picture 3304 in FIG. 3G ) fills a technical gap in the related art that does not display a virtual background in a sub-picture.

在一些实施例中，所述方法还包括：利用语义分割技术将所述目标图像中的目标对象与背景进行分割，得到分割结果；In some embodiments, the method further includes: using semantic segmentation technology to segment the target object and the background in the target image to obtain a segmentation result;

在所述第二子画面中显示虚拟背景，包括：根据所述分割结果，在所述第二子画面中显示虚拟背景。Displaying the virtual background in the second sub-picture includes: displaying the virtual background in the second sub-picture according to the segmentation result.

这样，利用语义分割技术实现目标对象与实际背景的分割，进而很好地实现虚拟背景的替换。In this way, the segmentation of the target object and the actual background is realized by using the semantic segmentation technology, and then the replacement of the virtual background is well realized.

在一些实施例中，将所述目标图像中的目标对象对应显示在所述至少两个子画面中，进一步包括：响应于确定所述目标对象正在说话，在所述目标对象对应的子画面中显示指示标识，从而提醒其他人该图标对应的子画面中的参会人正在说话，提升互动性。In some embodiments, correspondingly displaying the target object in the target image in the at least two sub-pictures further includes: in response to determining that the target object is speaking, displaying in the sub-picture corresponding to the target object The indicator is used to remind others that the participant in the sub-picture corresponding to the icon is speaking, which improves interactivity.

需要说明的是，本公开实施例的方法可以由单个设备执行，例如一台计算机或服务器等。本实施例的方法也可以应用于分布式场景下，由多台设备相互配合来完成。在这种分布式场景的情况下，这多台设备中的一台设备可以只执行本公开实施例的方法中的某一个或多个步骤，这多台设备相互之间会进行交互以完成所述的方法。It should be noted that the methods in the embodiments of the present disclosure may be executed by a single device, such as a computer or a server. The method of this embodiment can also be applied in a distributed scenario, and is completed by cooperation of multiple devices. In the case of such a distributed scenario, one of the multiple devices may only perform one or more steps in the method of the embodiment of the present disclosure, and the multiple devices will interact with each other to complete all described method.

需要说明的是，上述对本公开的一些实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的动作或步骤可以按照不同于上述实施例中的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。It should be noted that some embodiments of the present disclosure are described above. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from those in the above-described embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

本公开实施例还提供了一种计算机设备，用于实现上述的方法200或400。图5示出了本公开实施例所提供的示例性计算机设备500的硬件结构示意图。计算机设备500可以用于实现图1A的服务器106，也可以用于实现图1A的终端设备102、104。在一些场景中，该计算机设备500也可以用于实现图1A的数据库服务器108。An embodiment of the present disclosure also provides a computer device, configured to implement the above-mentioned method 200 or 400 . FIG. 5 shows a schematic diagram of a hardware structure of an exemplary computer device 500 provided by an embodiment of the present disclosure. The computer device 500 can be used to realize the server 106 in FIG. 1A , and can also be used to realize the terminal devices 102 and 104 in FIG. 1A . In some scenarios, the computer device 500 may also be used to implement the database server 108 of FIG. 1A.

如图5所示，计算机设备500可以包括：处理器502、存储器504、网络模块506、外围接口508和总线510。其中，处理器502、存储器504、网络模块506和外围接口508通过总线510实现彼此之间在计算机设备500的内部的通信连接。As shown in FIG. 5 , a computer device 500 may include: a processor 502 , a memory 504 , a network module 506 , a peripheral interface 508 and a bus 510 . Wherein, the processor 502 , the memory 504 , the network module 506 and the peripheral interface 508 are connected to each other within the computer device 500 through the bus 510 .

处理器502可以是中央处理器(Central Processing Unit，CPU)、图像处理器、神经网络处理器(NPU)、微控制器(MCU)、可编程逻辑器件、数字信号处理器(DSP)、应用专用集成电路(Application Specific Integrated Circuit，ASIC)、或者一个或多个集成电路。处理器502可以用于执行与本公开描述的技术相关的功能。在一些实施例中，处理器502还可以包括集成为单一逻辑组件的多个处理器。例如，如图5所示，处理器502可以包括多个处理器502a、502b和502c。The processor 502 may be a central processing unit (Central Processing Unit, CPU), an image processor, a neural network processor (NPU), a microcontroller (MCU), a programmable logic device, a digital signal processor (DSP), an application-specific An integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits. Processor 502 may be configured to perform functions related to the techniques described in this disclosure. In some embodiments, processor 502 may also include multiple processors integrated into a single logical component. For example, as shown in FIG. 5, processor 502 may include multiple processors 502a, 502b, and 502c.

存储器504可以配置为存储数据(例如，指令、计算机代码等)。如图5所示，存储器504存储的数据可以包括程序指令(例如，用于实现本公开实施例的方法200或400的程序指令)以及要处理的数据(例如，存储器可以存储其他模块的配置文件等)。处理器502也可以访问存储器504存储的程序指令和数据，并且执行程序指令以对要处理的数据进行操作。存储器504可以包括易失性存储装置或非易失性存储装置。在一些实施例中，存储器504可以包括随机访问存储器(RAM)、只读存储器(ROM)、光盘、磁盘、硬盘、固态硬盘(SSD)、闪存、存储棒等。Memory 504 may be configured to store data (eg, instructions, computer code, etc.). As shown in FIG. 5 , the data stored in the memory 504 may include program instructions (for example, program instructions for implementing the method 200 or 400 of the embodiment of the present disclosure) and data to be processed (for example, the memory may store configuration files of other modules wait). Processor 502 can also access program instructions and data stored in memory 504 and execute the program instructions to operate on the data to be processed. Memory 504 may include volatile storage or non-volatile storage. In some embodiments, memory 504 may include random access memory (RAM), read only memory (ROM), optical disks, magnetic disks, hard disks, solid state drives (SSD), flash memory, memory sticks, and the like.

网络接口506可以配置为经由网络向计算机设备500提供与其他外部设备的通信。该网络可以是能够传输和接收数据的任何有线或无线的网络。例如，该网络可以是有线网络、本地无线网络(例如，蓝牙、WiFi、近场通信(NFC)等)、蜂窝网络、因特网、或上述的组合。可以理解的是，网络的类型不限于上述具体示例。The network interface 506 may be configured to provide the computer device 500 with communication with other external devices via a network. The network can be any wired or wireless network capable of transmitting and receiving data. For example, the network may be a wired network, a local wireless network (eg, Bluetooth, WiFi, Near Field Communication (NFC), etc.), a cellular network, the Internet, or a combination thereof. It can be understood that the type of the network is not limited to the above specific examples.

外围接口508可以配置为将计算机设备500与一个或多个外围装置连接，以实现信息输入及输出。例如，外围装置可以包括键盘、鼠标、触摸板、触摸屏、麦克风、各类传感器等输入设备以及显示器、扬声器、振动器、指示灯等输出设备。The peripheral interface 508 may be configured to connect the computer device 500 with one or more peripheral devices for information input and output. For example, peripheral devices may include input devices such as keyboards, mice, touchpads, touch screens, microphones, and various sensors, and output devices such as displays, speakers, vibrators, and indicator lights.

总线510可以被配置为在计算机设备500的各个组件(例如处理器502、存储器504、网络接口506和外围接口508)之间传输信息，诸如内部总线(例如，处理器-存储器总线)、外部总线(USB端口、PCI-E总线)等。Bus 510 may be configured to transfer information between various components of computer device 500 (e.g., processor 502, memory 504, network interface 506, and peripheral interface 508), such as an internal bus (e.g., a processor-memory bus), an external bus (USB port, PCI-E bus), etc.

需要说明的是，尽管上述计算机设备500的架构仅示出了处理器502、存储器504、网络接口506、外围接口508和总线510，但是在具体实施过程中，该计算机设备500的架构还可以包括实现正常运行所必需的其他组件。此外，本领域的技术人员可以理解的是，上述计算机设备500的架构中也可以仅包含实现本公开实施例方案所必需的组件，而不必包含图中所示的全部组件。It should be noted that although the architecture of the computer device 500 above only shows the processor 502, the memory 504, the network interface 506, the peripheral interface 508 and the bus 510, in the specific implementation process, the architecture of the computer device 500 may also include Other components necessary for proper operation. In addition, those skilled in the art can understand that the architecture of the computer device 500 may only include components necessary to implement the solutions of the embodiments of the present disclosure, instead of all the components shown in the figure.

本公开实施例还提供了一种交互装置。图6示出了本公开实施例所提供的示例性装置600的示意图。如图6所示，该装置600可以用于实现方法200或400，并可以进一步包括以下模块。The embodiment of the present disclosure also provides an interaction device. FIG. 6 shows a schematic diagram of an exemplary device 600 provided by an embodiment of the present disclosure. As shown in FIG. 6, the apparatus 600 may be used to implement the method 200 or 400, and may further include the following modules.

获取模块602，被配置为：获取采集单元采集的目标图像。The obtaining module 602 is configured to: obtain the target image collected by the collection unit.

检测模块604，被配置为：检测所述目标图像(例如，图3A的图像300)中的目标对象(例如，图3A的目标对象302A～302C)。The detection module 604 is configured to: detect a target object (for example, the target objects 302A-302C in FIG. 3A ) in the target image (for example, the image 300 in FIG. 3A ).

划分模块606，被配置为：根据所述目标对象将视频会议画面划分为至少两个子画面。The division module 606 is configured to: divide the video conference picture into at least two sub-pictures according to the target object.

显示模块608，被配置为：将所述目标图像中的目标对象对应显示在所述至少两个子画面中。The display module 608 is configured to: correspondingly display the target object in the target image in the at least two sub-pictures.

在一些实施例中，划分模块606，被配置为：确定所述目标图像中的目标对象是否正在说话；响应于确定所述目标图像中的目标对象正在说话，按照第一分屏模式将所述视频会议画面划分为至少两个子画面，如图3G所示。这样，当有人说话时，采用发言人模式(第一分屏模式)，在发言人模式中，当前正在发言的人放在最大子画面，其余参会人并列排布在最大子画面的至少一侧(数量较多时可以放置在两侧或者更多侧)。当没有人说话时，采用普通分屏模式(第二分屏模式)，每个子画面相同大小。通过根据说话人检测结果来选择不同分屏模式，能增加视频会议互动性，提升用户体验。In some embodiments, the division module 606 is configured to: determine whether the target object in the target image is speaking; in response to determining that the target object in the target image is speaking, divide the The video conference picture is divided into at least two sub-pictures, as shown in FIG. 3G . Like this, when someone speaks, adopt speaker mode (first split-screen mode), in speaker mode, the person who is currently speaking is placed on the largest sub-screen, and the remaining participants are arranged side by side on at least one of the largest sub-screens. Side (when the quantity is large, it can be placed on both sides or more sides). When no one speaks, the normal split-screen mode (the second split-screen mode) is adopted, and each sub-picture has the same size. By selecting different split-screen modes according to the speaker detection results, the interactivity of the video conference can be increased and the user experience can be improved.

在一些实施例中，划分模块606，被配置为：将所述至少两个子画面中的第一子画面(例如，图3G的子画面3302)放大显示，并在所述第一子画面的至少一侧并列显示所述至少两个子画面中的其他子画面，所述第一子画面用于显示正在说话的目标对象；In some embodiments, the division module 606 is configured to: enlarge and display the first sub-picture (for example, the sub-picture 3302 in FIG. 3G ) among the at least two sub-pictures, and at least One side displays other sub-pictures of the at least two sub-pictures side by side, and the first sub-picture is used to display the speaking target object;

显示模块608，被配置为：在所述第一子画面中显示正在说话的参会人对应的目标对象。The display module 608 is configured to: display the target object corresponding to the speaking participant in the first sub-screen.

在一些实施例中，检测模块604，被配置为：检测所述目标对象的关键点；基于所述目标对象的关键点，确定嘴唇高度；基于所述目标对象的关键点，确定嘴唇宽度；根据所述嘴唇高度以及所述嘴唇宽度，得到目标对象的嘴唇高宽比；基于所述嘴唇高宽比的变化信息，确定所述目标对象是否正在说话，从而可以通过图像处理的方式来进行说话人检测，可以避免通过音频流无法分辨当前具体是谁在发言的问题。In some embodiments, the detection module 604 is configured to: detect the key points of the target object; determine the lip height based on the key points of the target object; determine the lip width based on the key points of the target object; The lip height and the lip width are used to obtain the lip aspect ratio of the target object; based on the change information of the lip aspect ratio, it is determined whether the target object is speaking, so that the speaker can be determined by image processing Detection can avoid the problem that it is impossible to distinguish who is currently speaking through the audio stream.

在一些实施例中，检测模块604，被配置为：对所述目标对象进行关键点检测；根据关键点检测结果确定所述目标图像中的目标对象是否正在说话，从而可以通过图像处理的方式来进行说话人检测，可以避免通过音频流无法分辨当前具体是谁在发言的问题。In some embodiments, the detection module 604 is configured to: perform key point detection on the target object; determine whether the target object in the target image is speaking according to the key point detection result, so that the Performing speaker detection can avoid the problem of being unable to distinguish who is currently speaking through the audio stream.

在一些实施例中，检测模块604，被配置为：In some embodiments, the detection module 604 is configured to:

在一些实施例中，检测模块604，被配置为：根据所述修正后的嘴唇高度以及所述修正后的嘴唇宽度，计算所述目标对象的嘴唇高宽比并存储所述嘴唇高宽比；In some embodiments, the detection module 604 is configured to: calculate the lip aspect ratio of the target object according to the corrected lip height and the corrected lip width, and store the lip aspect ratio;

结合所述关键点检测结果，确定所述目标图像中的目标对象的嘴唇高宽比对应的变化信息，根据所述变化信息确定所述目标图像中的目标对象对应的参会人是否正在说话。Combining with the key point detection result, determine the change information corresponding to the lip aspect ratio of the target object in the target image, and determine whether the participant corresponding to the target object in the target image is speaking according to the change information.

在一些实施例中，检测模块604，被配置为：设定预设时间段；统计所述预设时间段内所述嘴唇高宽比的变化次数；当所述变化次数达到预设数量时，确定所述目标对象正在说话。In some embodiments, the detection module 604 is configured to: set a preset time period; count the number of changes in the lip aspect ratio within the preset time period; when the number of changes reaches a preset number, It is determined that the target object is speaking.

在一些实施例中，检测模块604，被配置为：当所述变化次数达到预设数量时，确定所述目标对象正在说话，包括：确定所述变化次数达到预设数量；获取视频会议的音频数据；结合视频会议的音频数据，确定所述目标对象正在说话。这样，同时结合关键点检测结果和当前视频会议的音频数据，可以进一步提高说话人判断的准确性。作为一个可选实施例，当采集音频的拾音器是双声道拾音器时，可以根据双声道所分别采集的两组音频数据来对说话人进行定位，进而可以进一步提高说话人判断的准确性。In some embodiments, the detection module 604 is configured to: determine that the target object is speaking when the number of changes reaches a preset number, including: determining that the number of changes reaches a preset number; acquiring audio of the video conference data; in combination with the audio data of the video conference, it is determined that the target object is speaking. In this way, the accuracy of speaker judgment can be further improved by combining the key point detection result and the audio data of the current video conference. As an optional embodiment, when the pickup for collecting audio is a binaural pickup, the speaker can be located according to two sets of audio data respectively collected by the binaural, thereby further improving the accuracy of speaker judgment.

在一些实施例中，检测模块604，被配置为：利用目标检测或目标跟踪技术检测所述目标图像中的目标对象，得到与所述目标对象对应的检测框；In some embodiments, the detection module 604 is configured to: use target detection or target tracking technology to detect the target object in the target image, and obtain a detection frame corresponding to the target object;

显示模块608，被配置为：确定所述目标对象对应的检测框的坐标；确定所述目标对象对应的子画面的坐标；根据所述检测框的坐标以及所述子画面的坐标，将所述检测框对应的图像平移和/或缩放至所述目标对象对应的子画面中。The display module 608 is configured to: determine the coordinates of the detection frame corresponding to the target object; determine the coordinates of the sub-picture corresponding to the target object; according to the coordinates of the detection frame and the coordinates of the sub-picture, display the The image corresponding to the detection frame is translated and/or zoomed into the sub-picture corresponding to the target object.

在一些实施例中，显示模块608，被配置为：响应于确定所述子画面中的虚拟背景功能被打开，利用分割技术将所述目标对象与背景进行分割，得到分割结果；根据所述分割结果，在所述子画面中显示虚拟背景，填补了相关技术中没有在子画面中显示虚拟背景的技术空白点。In some embodiments, the display module 608 is configured to: in response to determining that the virtual background function in the sub-picture is turned on, segment the target object from the background using a segmentation technique to obtain a segmentation result; according to the segmentation As a result, the virtual background is displayed in the sub-picture, which fills the technical gap in the related art that the virtual background is not displayed in the sub-picture.

在一些实施例中，显示模块608，被配置为：响应于确定所述至少两个子画面中的第二子画面的虚拟背景功能被打开，在所述第二子画面(例如，图3G的子画面3304)中显示虚拟背景，填补了相关技术中没有在子画面中显示虚拟背景的技术空白点。In some embodiments, the display module 608 is configured to: in response to determining that the virtual background function of the second sub-picture in the at least two sub-pictures is turned on, on the second sub-picture (for example, the sub-picture in FIG. 3G The virtual background is displayed in the sub-picture 3304), which fills the technical gap in the related art that the virtual background is not displayed in the sub-picture.

在一些实施例中，显示模块608，被配置为：利用语义分割技术将所述目标图像中的目标对象与背景进行分割，得到分割结果；根据所述分割结果，在所述第二子画面中显示虚拟背景。这样，利用语义分割技术实现目标对象与实际背景的分割，进而很好地实现虚拟背景的替换。In some embodiments, the display module 608 is configured to: use semantic segmentation technology to segment the target object and the background in the target image to obtain a segmentation result; according to the segmentation result, in the second sub-picture Displays a virtual background. In this way, the segmentation of the target object and the actual background is realized by using the semantic segmentation technology, and then the replacement of the virtual background is well realized.

在一些实施例中，显示模块608，被配置为：响应于确定所述目标对象正在说话，在所述目标对象对应的子画面中显示指示标识，从而提醒其他人该图标对应的子画面中的参会人正在说话，提升互动性。In some embodiments, the display module 608 is configured to: in response to determining that the target object is speaking, display an indicator in the sub-picture corresponding to the target object, so as to remind other people that the target object in the sub-picture corresponding to the icon Participants are speaking, enhancing interactivity.

为了描述的方便，描述以上装置时以功能分为各种模块分别描述。当然，在实施本公开时可以把各模块的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above devices, functions are divided into various modules and described separately. Of course, when implementing the present disclosure, the functions of each module can be implemented in one or more pieces of software and/or hardware.

上述实施例的装置用于实现前述任一实施例中相应的方法400，并且具有相应的方法实施例的有益效果，在此不再赘述。The apparatus in the foregoing embodiments is used to implement the corresponding method 400 in any of the preceding embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

基于同一发明构思，与上述任意实施例方法相对应的，本公开还提供了一种非暂态计算机可读存储介质，所述非暂态计算机可读存储介质存储计算机指令，所述计算机指令用于使所述计算机执行如上任一实施例所述的方法400。Based on the same inventive concept, the present disclosure also provides a non-transitory computer-readable storage medium corresponding to the method in any of the above-mentioned embodiments, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions use To make the computer execute the method 400 described in any one of the above embodiments.

本实施例的计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。The computer-readable medium in this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

上述实施例的存储介质存储的计算机指令用于使所述计算机执行如上任一实施例所述的方法200或400，并且具有相应的方法实施例的有益效果，在此不再赘述。The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to execute the method 200 or 400 described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

基于同一发明构思，与上述任意实施例方法200或400相对应的，本公开还提供了一种计算机程序产品，其包括计算机程序。在一些实施例中，所述计算机程序由一个或多个处理器可执行以使得所述处理器执行所述的方法200或400。对应于方法200或400各实施例中各步骤对应的执行主体，执行相应步骤的处理器可以是属于相应执行主体的。Based on the same inventive concept, corresponding to the method 200 or 400 in any of the foregoing embodiments, the present disclosure further provides a computer program product, which includes a computer program. In some embodiments, the computer program is executable by one or more processors to cause the processors to execute the method 200 or 400 . Corresponding to the execution subject corresponding to each step in each embodiment of the method 200 or 400, the processor executing the corresponding step may belong to the corresponding execution subject.

上述实施例的计算机程序产品用于使处理器执行如上任一实施例所述的方法400，并且具有相应的方法实施例的有益效果，在此不再赘述。The computer program product of the foregoing embodiment is used to cause a processor to execute the method 400 described in any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

所属领域的普通技术人员应当理解：以上任何实施例的讨论仅为示例性的，并非旨在暗示本公开的范围(包括权利要求)被限于这些例子；在本公开的思路下，以上实施例或者不同实施例中的技术特征之间也可以进行组合，步骤可以以任意顺序实现，并存在如上所述的本公开实施例的不同方面的许多其它变化，为了简明它们没有在细节中提供。Those of ordinary skill in the art should understand that: the discussion of any of the above embodiments is exemplary only, and is not intended to imply that the scope of the present disclosure (including claims) is limited to these examples; under the idea of the present disclosure, the above embodiments or Combinations between technical features in different embodiments are also possible, steps may be implemented in any order, and there are many other variations of the different aspects of the disclosed embodiments as described above, which are not provided in detail for the sake of brevity.

另外，为简化说明和讨论，并且为了不会使本公开实施例难以理解，在所提供的附图中可以示出或可以不示出与集成电路(IC)芯片和其它部件的公知的电源/接地连接。此外，可以以框图的形式示出装置，以便避免使本公开实施例难以理解，并且这也考虑了以下事实，即关于这些框图装置的实施方式的细节是高度取决于将要实施本公开实施例的平台的(即，这些细节应当完全处于本领域技术人员的理解范围内)。在阐述了具体细节(例如，电路)以描述本公开的示例性实施例的情况下，对本领域技术人员来说显而易见的是，可以在没有这些具体细节的情况下或者这些具体细节有变化的情况下实施本公开实施例。因此，这些描述应被认为是说明性的而不是限制性的。In addition, for simplicity of illustration and discussion, and so as not to obscure the embodiments of the present disclosure, well-known power/supply circuits associated with integrated circuit (IC) chips and other components may or may not be shown in the provided figures. ground connection. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the embodiments of the disclosure, and this also takes into account the fact that details regarding the implementation of these block diagram devices are highly dependent on the implementation of the embodiments of the disclosure to be implemented. platform (ie, the details should be well within the purview of those skilled in the art). Where specific details (eg, circuits) have been set forth to describe example embodiments of the present disclosure, it will be apparent to those skilled in the art that other applications may be made without or with variations from these specific details. Embodiments of the present disclosure are implemented below. Accordingly, these descriptions should be regarded as illustrative rather than restrictive.

尽管已经结合了本公开的具体实施例对本公开进行了描述，但是根据前面的描述，这些实施例的很多替换、修改和变型对本领域普通技术人员来说将是显而易见的。例如，其它存储器架构(例如，动态RAM(DRAM))可以使用所讨论的实施例。Although the disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications and variations of those embodiments will be apparent to those of ordinary skill in the art from the foregoing description. For example, other memory architectures such as dynamic RAM (DRAM) may use the discussed embodiments.

本公开实施例旨在涵盖落入所附权利要求的宽泛范围之内的所有这样的替换、修改和变型。因此，凡在本公开实施例的精神和原则之内，所做的任何省略、修改、等同替换、改进等，均应包含在本公开的保护范围之内。The disclosed embodiments are intended to embrace all such alterations, modifications and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the embodiments of the present disclosure shall fall within the protection scope of the present disclosure.

Claims

1. A split screen method for video conference pictures, comprising:

acquiring a target image acquired by an acquisition unit;

detecting a target object in the target image;

dividing a video conference picture into at least two sub-pictures according to the target object;

and correspondingly displaying the target object in the target image in the at least two sub-pictures.

2. The method of claim 1, wherein dividing the video conference picture into at least two sub-pictures according to the target object, further comprises:

determining whether a target object in the target image is speaking;

in response to determining that the target object is speaking, the video conference screen is divided into at least two sub-screens in a first split screen mode.

3. The method of claim 2, wherein dividing the video conference picture into at least two sub-pictures in a first split-screen mode comprises:

Magnifying and displaying a first sub-picture in the at least two sub-pictures, wherein the first sub-picture is used for displaying a speaking target object;

and displaying other sub-pictures in the at least two sub-pictures in parallel on at least one side of the first sub-picture.

4. The method of claim 2, wherein determining whether a target object in the target image is speaking comprises:

detecting key points of the target object;

determining a lip height based on the keypoints of the target object;

determining a lip width based on the keypoints of the target object;

obtaining the lip height-width ratio of the target object according to the lip height and the lip width;

based on the information of the change in the lip aspect ratio, it is determined whether the target object is speaking.

5. The method of claim 4, wherein determining whether the target object is speaking based on the change information of the lip aspect ratio comprises:

setting a preset time period;

counting the change times of the lip height-width ratio in the preset time period;

and when the number of changes reaches a preset number, determining that the target object is speaking.

6. The method of claim 5, wherein determining that the target object is speaking when the number of changes reaches a preset number comprises:

Determining that the number of the changes reaches a preset number;

acquiring audio data of a video conference;

in connection with audio data of a video conference, it is determined that the target object is speaking.

7. The method of claim 1, wherein detecting a target object in the target image comprises:

detecting a target object in the target image by utilizing a target detection or target tracking technology to obtain a detection frame of the target object;

correspondingly displaying the target object in the target image in the at least two sub-pictures, wherein the method comprises the following steps:

determining coordinates of a detection frame corresponding to the target object;

determining the coordinates of the sub-picture corresponding to the target object;

and translating and/or scaling the image corresponding to the detection frame into the sub-picture corresponding to the target object according to the coordinates of the detection frame and the coordinates of the sub-picture.

8. The method of claim 1, wherein displaying the target object correspondence in the target image in the at least two sprites further comprises:

responding to the fact that the virtual background function in the sub-picture is opened, and dividing the target object and the background by utilizing a dividing technology to obtain a dividing result;

And displaying a virtual background in the sub-picture according to the segmentation result.

9. The method of claim 1, wherein displaying the target object correspondence in the target image in the at least two sprites further comprises:

and in response to determining that the target object is speaking, displaying an indication identifier in a sub-picture corresponding to the target object.

10. A split screen device for video conference pictures, comprising:

an acquisition module configured to: acquiring a target image acquired by an acquisition unit;

a detection module configured to: detecting a target object in the target image;

a partitioning module configured to: dividing a video conference picture into at least two sub-pictures according to the target object;

a display module configured to: and correspondingly displaying the target object in the target image in the at least two sub-pictures.

11. A computer device comprising one or more processors, memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, the programs comprising instructions for performing the method of any of claims 1-9.

12. A non-transitory computer readable storage medium containing a computer program which, when executed by one or more processors, causes the processors to perform the method of any of claims 1-9.

13. A computer program product comprising computer program instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-9.