CN114299940A

CN114299940A - Display device and voice interaction method

Info

Publication number: CN114299940A
Application number: CN202110577525.1A
Authority: CN
Inventors: 王峰
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2022-04-08

Abstract

The embodiment of the application provides a display device and a voice interaction method, wherein the display device comprises a display used for presenting a user interface; a controller connected with the display, the controller configured to: acquiring user identity information of a target person, and acquiring a voice real-time instruction, wherein the target person comprises a person who sends the awakening instruction or a registered user; detecting face information in an image acquired by a camera; if the face information is the face information of the target person, face tracking and lip movement detection are carried out on the target person, and if the face of the target person moves, and the voice real-time instruction comprises the voice of the target person, the voice real-time instruction is responded; and if the human face of the target person does not have lip movement or the voice real-time instruction does not comprise the voice of the target person, the voice real-time instruction is not responded. The technical problem that voice interaction experience is poor is solved.

Description

Display device and voice interaction method

技术领域technical field

本申请涉及语音交互技术领域，尤其涉及一种显示设备及语音交互方法。The present application relates to the technical field of voice interaction, and in particular, to a display device and a voice interaction method.

背景技术Background technique

随着智能家居的兴起，通过语音交互对智能电视等家居设备进行控制成为了越来越普及的控制方式。唤醒率和语音识别准确率是影响语音交互用户体验的两个重要指标。在语音交互技术的发展早期，语音交互通常为近场交互，在近场语音交互场景下，人机距离较小，噪音干扰的影响较小，唤醒率和语音识别准确率较高。然而，人们在观看电视时，通常距离电视较远，近场交互不能满足人们的需求，为提高语音交互的便利性，远场语音交互技术应运而生。在远场语音交互场景下，人机距离较大，噪音干扰的影响也变大，导致唤醒率和语音识别准确率也会下降，使得语音交互体验不佳。With the rise of smart homes, controlling home devices such as smart TVs through voice interaction has become an increasingly popular control method. Wake-up rate and speech recognition accuracy are two important indicators that affect the user experience of voice interaction. In the early stage of the development of voice interaction technology, voice interaction is usually near-field interaction. In near-field voice interaction scenarios, the human-machine distance is small, the impact of noise interference is small, and the wake-up rate and voice recognition accuracy are high. However, when people watch TV, they are usually far away from the TV, and near-field interaction cannot meet people's needs. In order to improve the convenience of voice interaction, far-field voice interaction technology emerges as the times require. In the far-field voice interaction scenario, the distance between the human and the machine is large, and the impact of noise interference also increases, resulting in a decrease in the wake-up rate and voice recognition accuracy, resulting in a poor voice interaction experience.

发明内容SUMMARY OF THE INVENTION

为解决语音交互体验不佳的技术问题，本申请提供了一种显示设备及语音交互方法。In order to solve the technical problem of poor voice interaction experience, the present application provides a display device and a voice interaction method.

第一方面，本申请提供了一种显示设备，该显示设备包括：In a first aspect, the present application provides a display device, the display device comprising:

显示器，用于呈现用户界面；a display for presenting the user interface;

摄像头，用于采集图像；camera, for capturing images;

控制器，与所述显示器连接，所述控制器被配置为：a controller, connected to the display, the controller being configured to:

采集语音唤醒指令；Collect voice wake-up commands;

响应于所述语音唤醒指令，获取目标人的用户身份信息，并采集语音实时指令，所述目标人包括发出所述唤醒指令的人或注册用户；In response to the voice wake-up command, obtain the user identity information of the target person, and collect the voice real-time command, and the target person includes the person who issued the wake-up command or a registered user;

在摄像头采集的图像中检测人脸信息；Detect face information in the image captured by the camera;

若检测到所述目标人的人脸信息，对所述目标人进行人脸追踪和唇动检测，若所述目标人的人脸发生了唇动，且所述语音实时指令包括所述目标人的语音，对所述语音实时指令进行响应；If the face information of the target person is detected, face tracking and lip movement detection are performed on the target person. If the face of the target person has lip movement, and the voice real-time instruction includes the target person voice, and respond to the voice real-time command;

若所述目标人的人脸没有发生唇动，或所述语音实时指令不包括所述目标人的语音，不对所述语音实时指令进行响应。If there is no lip movement on the face of the target person, or the real-time voice command does not include the voice of the target person, no response is made to the real-time voice command.

在一些实施例中，在所述摄像头采集的图像中检测人脸信息，包括：In some embodiments, detecting the face information in the image collected by the camera includes:

对所述语音唤醒指令进行声源定位，得到唤醒声源位置；Perform sound source localization on the voice wake-up command to obtain the wake-up sound source position;

朝向所述唤醒声源位置转动摄像头，在转动过程中，在所述摄像头采集的图像中检测人脸信息，若检测到所述目标人的人脸信息，则控制所述摄像头停止转动。Rotate the camera toward the position of the wake-up sound source. During the rotation process, face information is detected in the image collected by the camera. If the face information of the target person is detected, the camera is controlled to stop rotating.

在一些实施例中，In some embodiments,

对所述目标人进行人脸追踪和唇动检测，包括：Perform face tracking and lip motion detection on the target person, including:

获取所述摄像头拍摄的图像中目标人脸部的实时坐标范围；Obtain the real-time coordinate range of the face of the target person in the image captured by the camera;

根据所述实时坐标范围的变化趋势控制所述摄像头进行转动，使所述目标人的人脸位于所述摄像头采集的图像中的预设区域内；Control the camera to rotate according to the change trend of the real-time coordinate range, so that the face of the target person is located in a preset area in the image collected by the camera;

对所述目标人脸部的图像进行唇动检测。Perform lip motion detection on the image of the target person's face.

第二方面，本申请提供了一种语音交互方法，该方法包括：In a second aspect, the present application provides a voice interaction method, the method comprising:

采集语音唤醒指令；Collect voice wake-up commands;

本申请提供的显示设备及语音交互方法的有益效果包括：The beneficial effects of the display device and the voice interaction method provided by the present application include:

本申请提供的显示设备，在接收到语音唤醒指令后，获取唤醒源的用户身份信息，通过对用户身份信息对应的目标人进行人脸跟踪，可排除非目标人的干扰，；在追踪到目标人的人脸后，根据目标人发生了唇动，且接收到的语音实时指令包括所述目标人的语音，再对语音实时指令进行响应，在目标人没有发生唇动，或所述语音实时指令不包括所述目标人的语音，不对所述语音实时指令进行响应，降低了噪音的采集概率，能够提高语音识别准确率，提升用户体验。The display device provided by this application, after receiving the voice wake-up command, obtains the user identity information of the wake-up source, and performs face tracking on the target person corresponding to the user identity information, so that the interference of non-target persons can be excluded; After the face of the person, according to the target person's lip movement, and the received voice real-time command includes the target person's voice, and then respond to the voice real-time command, if the target person does not have lip movement, or the voice real-time command The command does not include the voice of the target person, and does not respond to the voice real-time command, which reduces the probability of noise collection, improves the accuracy of voice recognition, and improves user experience.

附图说明Description of drawings

为了更清楚地说明本申请的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，对于本领域普通技术人员而言，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the present application more clearly, the accompanying drawings that need to be used in the embodiments will be briefly introduced below. Other drawings can also be obtained from these drawings.

图1中示例性示出了根据一些实施例的显示设备与控制装置之间操作场景的示意图；FIG. 1 exemplarily shows a schematic diagram of an operation scene between a display device and a control apparatus according to some embodiments;

图2中示例性示出了根据一些实施例的控制装置100的硬件配置框图；FIG. 2 exemplarily shows a hardware configuration block diagram of the control apparatus 100 according to some embodiments;

图3中示例性示出了根据一些实施例的显示设备200的硬件配置框图；FIG. 3 exemplarily shows a hardware configuration block diagram of the display device 200 according to some embodiments;

图4中示例性示出了根据一些实施例的显示设备200中软件配置示意图；FIG. 4 exemplarily shows a schematic diagram of software configuration in the display device 200 according to some embodiments;

图5中示例性示出了根据一些实施例的语音交互原理的示意图；FIG. 5 exemplarily shows a schematic diagram of the principle of voice interaction according to some embodiments;

图6中示例性示出了根据一些实施例的语音交互的场景示意图；FIG. 6 exemplarily shows a scene diagram of voice interaction according to some embodiments;

图7中示例性示出了根据一些实施例的语音交互的信号处理示意图；FIG. 7 exemplarily shows a schematic diagram of signal processing for voice interaction according to some embodiments;

图8中示例性示出了根据一些实施例的语音交互的时序示意图；FIG. 8 exemplarily shows a timing diagram of voice interaction according to some embodiments;

图9示例性示出了根据一些实施例的语音交互方法的整体流程示意图；FIG. 9 exemplarily shows the overall flow diagram of the voice interaction method according to some embodiments;

图10中示例性示出了根据一些实施例的语音唤醒指令的处理方法的流程示意图；FIG. 10 exemplarily shows a schematic flowchart of a method for processing a voice wake-up instruction according to some embodiments;

图11中示例性示出了根据一些实施例的语音交互过程中检测到单人人脸时的处理方法的流程示意图；FIG. 11 exemplarily shows a schematic flowchart of a processing method when a single human face is detected in a voice interaction process according to some embodiments;

图12中示例性示出了根据一些实施例的语音交互过程中检测到多人人脸时的处理方法的流程示意图。FIG. 12 exemplarily shows a schematic flowchart of a method for processing when multiple human faces are detected in a voice interaction process according to some embodiments.

具体实施方式Detailed ways

为使本申请的目的和实施方式更加清楚，下面将结合本申请示例性实施例中的附图，对本申请示例性实施方式进行清楚、完整地描述，显然，描述的示例性实施例仅是本申请一部分实施例，而不是全部的实施例。In order to make the purpose and implementation of the present application clearer, the exemplary embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the exemplary embodiments of the present application. Obviously, the described exemplary embodiments are only the Some embodiments are claimed, but not all embodiments.

需要说明的是，本申请中对于术语的简要说明，仅是为了方便理解接下来描述的实施方式，而不是意图限定本申请的实施方式。除非另有说明，这些术语应当按照其普通和通常的含义理解。It should be noted that the brief description of the terms in the present application is only for the convenience of understanding the embodiments described below, rather than intended to limit the embodiments of the present application. Unless otherwise specified, these terms are to be understood according to their ordinary and ordinary meanings.

本申请中说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”等是用于区别类似或同类的对象或实体，而不必然意味着限定特定的顺序或先后次序，除非另外注明。应该理解这样使用的用语在适当情况下可以互换。The terms "first", "second", "third", etc. in the description and claims of this application and the above drawings are used to distinguish similar or similar objects or entities, and are not necessarily meant to limit specific Sequential or sequential, unless otherwise noted. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

术语“包括”和“具有”以及他们的任何变形，意图在于覆盖但不排他的包含，例如，包含了一系列组件的产品或设备不必限于清楚地列出的所有组件，而是可包括没有清楚地列出的或对于这些产品或设备固有的其它组件。The terms "comprising" and "having", and any variations thereof, are intended to cover but not exclusively include, for example, a product or device that incorporates a series of components is not necessarily limited to all components explicitly listed, but may include no explicit other components listed or inherent to these products or devices.

术语“模块”是指任何已知或后来开发的硬件、软件、固件、人工智能、模糊逻辑或硬件或/和软件代码的组合，能够执行与该元件相关的功能。The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code capable of performing the functions associated with that element.

图1为根据实施例中显示设备与控制装置之间操作场景的示意图。如图1所示，用户可通过智能设备300或控制装置100操作显示设备200。FIG. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in FIG. 1 , a user can operate the display device 200 through the smart device 300 or the control device 100 .

在一些实施例中，控制装置100可以是遥控器，遥控器和显示设备的通信包括红外协议通信或蓝牙协议通信，及其他短距离通信方式，通过无线或有线方式来控制显示设备200。用户可以通过遥控器上按键、语音输入、控制面板输入等输入用户指令，来控制显示设备200。In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or Bluetooth protocol communication, and other short-range communication methods, and the display device 200 is controlled wirelessly or wiredly. The user can control the display device 200 by inputting user instructions through keys on the remote control, voice input, control panel input, and the like.

在一些实施例中，也可以使用智能设备300(如移动终端、平板电脑、计算机、笔记本电脑等)以控制显示设备200。例如，使用在智能设备上运行的应用程序控制显示设备200。In some embodiments, a smart device 300 (eg, a mobile terminal, a tablet computer, a computer, a notebook computer, etc.) can also be used to control the display device 200 . For example, the display device 200 is controlled using an application running on the smart device.

在一些实施例中，显示设备200还可以采用除了控制装置100和智能设备300之外的方式进行控制，例如，可以通过显示设备200设备内部配置的获取语音指令的模块直接接收用户的语音指令控制，也可以通过显示设备200设备外部设置的语音控制设备来接收用户的语音指令控制。In some embodiments, the display device 200 can also be controlled in a manner other than the control apparatus 100 and the smart device 300. For example, the module for acquiring voice commands configured inside the display device 200 can directly receive the user's voice command for control. , the user's voice command control can also be received through a voice control device provided outside the display device 200 device.

在一些实施例中，显示设备200还与服务器400进行数据通信。可允许显示设备200通过局域网(LAN)、无线局域网(WLAN)和其他网络进行通信连接。服务器400可以向显示设备200提供各种内容和互动。服务器400可以是一个集群，也可以是多个集群，可以包括一类或多类服务器。In some embodiments, the display device 200 is also in data communication with the server 400 . The display device 200 may be allowed to communicate via local area network (LAN), wireless local area network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200 . The server 400 may be a cluster or multiple clusters, and may include one or more types of servers.

图2示例性示出了根据示例性实施例中控制装置100的配置框图。如图2所示，控制装置100包括控制器110、通信接口130、用户输入/输出接口140、存储器、供电电源。控制装置100可接收用户的输入操作指令，且将操作指令转换为显示设备200可识别和响应的指令，起用用户与显示设备200之间交互中介作用。FIG. 2 exemplarily shows a configuration block diagram of the control apparatus 100 according to an exemplary embodiment. As shown in FIG. 2 , the control device 100 includes a controller 110 , a communication interface 130 , a user input/output interface 140 , a memory, and a power supply. The control device 100 can receive the user's input operation instruction, and convert the operation instruction into an instruction that the display device 200 can recognize and respond to, and play an intermediary role between the user and the display device 200 .

图3示出了根据示例性实施例中显示设备200的硬件配置框图。FIG. 3 is a block diagram showing a hardware configuration of the display apparatus 200 according to an exemplary embodiment.

在一些实施例中，显示设备200包括调谐解调器210、通信器220、检测器230、外部装置接口240、控制器250、显示器260、音频输出接口270、存储器、供电电源、用户接口中的至少一种。In some embodiments, display device 200 includes tuner 210, communicator 220, detector 230, external device interface 240, controller 250, display 260, audio output interface 270, memory, power supply, user interface at least one.

在一些实施例中控制器包括处理器，视频处理器，音频处理器，图形处理器，RAM，ROM，用于输入/输出的第一接口至第n接口。In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, a RAM, a ROM, a first interface to an nth interface for input/output.

在一些实施例中，显示器260包括用于呈现画面的显示屏组件，以及驱动图像显示的驱动组件，用于接收源自控制器输出的图像信号，进行显示视频内容、图像内容以及菜单操控界面的组件以及用户操控UI界面。In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving the image display, for receiving the image signal output from the controller, for displaying the video content, the image content and the menu manipulation interface Components and user manipulation UI interface.

在一些实施例中，显示器260可为液晶显示器、OLED显示器、以及投影显示器，还可以为一种投影装置和投影屏幕。In some embodiments, the display 260 may be a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and projection screen.

在一些实施例中，通信器220是用于根据各种通信协议类型与外部设备或服务器进行通信的组件。例如：通信器可以包括Wifi模块，蓝牙模块，有线以太网模块等其他网络通信协议芯片或近场通信协议芯片，以及红外接收器中的至少一种。显示设备200可以通过通信器220与外部控制设备100或服务器400建立控制信号和数据信号的发送和接收。In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example, the communicator may include at least one of a Wifi module, a Bluetooth module, a wired Ethernet module and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220 .

在一些实施例中，用户接口，可用于接收控制装置100(如：红外遥控器等)的控制信号。In some embodiments, the user interface may be used to receive control signals from the control device 100 (eg, an infrared remote control, etc.).

在一些实施例中，检测器230用于采集外部环境或与外部交互的信号。例如，检测器230包括光接收器，用于采集环境光线强度的传感器；或者，检测器230包括图像采集器，如摄像头，可以用于采集外部环境场景、用户的属性或用户交互手势，再或者，检测器230包括声音采集器，如麦克风等，用于接收外部声音。In some embodiments, the detector 230 is used to collect signals from the external environment or interaction with the outside. For example, the detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which can be used to collect external environmental scenes, user attributes or user interaction gestures, or , the detector 230 includes a sound collector, such as a microphone, for receiving external sound.

在一些实施例中，外部装置接口240可以包括但不限于如下：高清多媒体接口接口(HDMI)、模拟或数据高清分量输入接口(分量)、复合视频输入接口(CVBS)、USB输入接口(USB)、RGB端口等任一个或多个接口。也可以是上述多个接口形成的复合性的输入/输出接口。In some embodiments, the external device interface 240 may include, but is not limited to, the following: High Definition Multimedia Interface (HDMI), Analog or Data High Definition Component Input (Component), Composite Video Input (CVBS), USB Input (USB) , RGB port, etc. any one or more interfaces. It may also be a composite input/output interface formed by a plurality of the above-mentioned interfaces.

在一些实施例中，调谐解调器210通过有线或无线接收方式接收广播电视信号，以及从多个无线或有线广播电视信号中解调出音视频信号，如以及EPG数据信号。In some embodiments, the tuner demodulator 210 receives broadcast television signals through wired or wireless reception, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or cable broadcast television signals.

在一些实施例中，控制器250和调谐解调器210可以位于不同的分体设备中，即调谐解调器210也可在控制器250所在的主体设备的外置设备中，如外置机顶盒等。In some embodiments, the controller 250 and the tuner 210 may be located in different separate devices, that is, the tuner 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box Wait.

在一些实施例中，控制器250，通过存储在存储器上中各种软件控制程序，来控制显示设备的工作和响应用户的操作。控制器250控制显示设备200的整体操作。例如：响应于接收到用于选择在显示器260上显示UI对象的用户命令，控制器250便可以执行与由用户命令选择的对象有关的操作。In some embodiments, the controller 250, through various software control programs stored on the memory, controls the operation of the display device and responds to user operations. The controller 250 controls the overall operation of the display apparatus 200 . For example, in response to receiving a user command for selecting a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

在一些实施例中，所述对象可以是可选对象中的任何一个，例如超链接、图标或其他可操作的控件。与所选择的对象有关操作有：显示连接到超链接页面、文档、图像等操作，或者执行与所述图标相对应程序的操作。In some embodiments, the object may be any of the selectable objects, such as hyperlinks, icons, or other operable controls. The operations related to the selected object include: displaying operations connected to hyperlinked pages, documents, images, etc., or executing operations of programs corresponding to the icons.

在一些实施例中控制器包括中央处理器(Central Processing Unit，CPU)，视频处理器，音频处理器，图形处理器(Graphics Processing Unit，GPU)，RAM Random AccessMemory，RAM)，ROM(Read-Only Memory,ROM)，用于输入/输出的第一接口至第n接口，通信总线(Bus)等中的至少一种。In some embodiments, the controller includes a central processing unit (Central Processing Unit, CPU), a video processor, an audio processor, a graphics processing unit (Graphics Processing Unit, GPU), RAM (Random Access Memory, RAM), ROM (Read-Only). Memory, ROM), at least one of the first interface to the nth interface for input/output, a communication bus (Bus), and the like.

CPU处理器。用于执行存储在存储器中操作系统和应用程序指令，以及根据接收外部输入的各种交互指令，来执行各种应用程序、数据和内容，以便最终显示和播放各种音视频内容。CPU处理器，可以包括多个处理器。如，包括一个主处理器以及一个或多个子处理器。CPU processor. It is used to execute the operating system and application program instructions stored in the memory, and to execute various application programs, data and contents according to various interactive instructions received from external input, so as to finally display and play various audio and video contents. CPU processor, which can include multiple processors. For example, it includes a main processor and one or more sub-processors.

在一些实施例中，图形处理器，用于产生各种图形对象，如：图标、操作菜单、以及用户输入指令显示图形等。图形处理器包括运算器，通过接收用户输入各种交互指令进行运算，根据显示属性显示各种对象；还包括渲染器，对基于运算器得到的各种对象，进行渲染，上述渲染后的对象用于显示在显示器上。In some embodiments, the graphics processor is used to generate various graphic objects, such as: icons, operation menus, and user input instructions to display graphics, etc. The graphics processor includes an operator, which performs operations by receiving various interactive instructions input by the user, and displays various objects according to the display attributes; it also includes a renderer, which renders various objects obtained based on the operator, and the rendered objects are used for rendering. displayed on the display.

在一些实施例中，视频处理器，用于将接收外部视频信号，根据输入信号的标准编解码协议，进行解压缩、解码、缩放、降噪、帧率转换、分辨率转换、图像合成等视频处理，可得到直接可显示设备200上显示或播放的信号。In some embodiments, the video processor is used to decompress, decode, scale, reduce noise, convert frame rate, convert resolution, and synthesize images according to the standard codec protocol of the received external video signal. After processing, a signal directly displayed or played on the displayable device 200 can be obtained.

在一些实施例中，视频处理器，包括解复用模块、视频解码模块、图像合成模块、帧率转换模块、显示格式化模块等。其中，解复用模块，用于对输入音视频数据流进行解复用处理。视频解码模块，用于对解复用后的视频信号进行处理，包括解码和缩放处理等。图像合成模块，如图像合成器，其用于将图形生成器根据用户输入或自身生成的GUI信号，与缩放处理后视频图像进行叠加混合处理，以生成可供显示的图像信号。帧率转换模块，用于对转换输入视频帧率。显示格式化模块，用于将接收帧率转换后视频输出信号，改变信号以符合显示格式的信号，如输出RGB数据信号。In some embodiments, the video processor includes a demultiplexing module, a video decoding module, an image synthesis module, a frame rate conversion module, a display formatting module, and the like. The demultiplexing module is used for demultiplexing the input audio and video data stream. The video decoding module is used to process the demultiplexed video signal, including decoding and scaling. The image synthesizing module, such as an image synthesizer, is used for superimposing and mixing the GUI signal generated by the graphics generator according to the user's input or itself, and the zoomed video image, so as to generate an image signal that can be displayed. The frame rate conversion module is used to convert the input video frame rate. The display formatting module is used to convert the received frame rate into the video output signal, and change the signal to conform to the display format signal, such as outputting the RGB data signal.

在一些实施例中，音频处理器，用于接收外部的音频信号，根据输入信号的标准编解码协议，进行解压缩和解码，以及降噪、数模转换、和放大处理等处理，得到可以在扬声器中播放的声音信号。In some embodiments, the audio processor is configured to receive an external audio signal, and perform decompression and decoding according to a standard codec protocol of the input signal, as well as noise reduction, digital-to-analog conversion, and amplification, etc. The sound signal played in the loudspeaker.

在一些实施例中，用户可在显示器260上显示的图形用户界面(GUI)输入用户命令，则用户输入接口通过图形用户界面(GUI)接收用户输入命令。或者，用户可通过输入特定的声音或手势进行输入用户命令，则用户输入接口通过传感器识别出声音或手势，来接收用户输入命令。In some embodiments, the user may input user commands on a graphical user interface (GUI) displayed on the display 260, and the user input interface receives the user input commands through the graphical user interface (GUI). Alternatively, the user may input a user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through a sensor to receive the user input command.

在一些实施例中，“用户界面”，是应用程序或操作系统与用户之间进行交互和信息交换的介质接口，它实现信息的内部形式与用户可以接受形式之间的转换。用户界面常用的表现形式是图形用户界面(Graphic User Interface，GUI)，是指采用图形方式显示的与计算机操作相关的用户界面。它可以是在电子设备的显示屏中显示的一个图标、窗口、控件等界面元素，其中控件可以包括图标、按钮、菜单、选项卡、文本框、对话框、状态栏、导航栏、Widget等可视的界面元素。In some embodiments, a "user interface" is a medium interface for interaction and information exchange between an application program or an operating system and a user, which enables conversion between an internal form of information and a form acceptable to the user. A commonly used form of user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations displayed in a graphical manner. It can be an icon, window, control and other interface elements displayed on the display screen of the electronic device, wherein the control can include icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, Widgets, etc. visual interface elements.

在一些实施例中，显示设备的系统可以包括内核(Kernel)、命令解析器(shell)、文件系统和应用程序。内核、shell和文件系统一起组成了基本的操作系统结构，它们让用户可以管理文件、运行程序并使用系统。上电后，内核启动，激活内核空间，抽象硬件、初始化硬件参数等，运行并维护虚拟内存、调度器、信号及进程间通信(IPC)。内核启动后，再加载Shell和用户应用程序。应用程序在启动后被编译成机器码，形成一个进程。In some embodiments, the system of the display device may include a kernel (Kernel), a command parser (shell), a file system and an application program. Together, the kernel, shell, and file system make up the basic operating system structures that allow users to manage files, run programs, and use the system. After power-on, the kernel starts, activates the kernel space, abstracts hardware, initializes hardware parameters, etc., runs and maintains virtual memory, scheduler, signals and inter-process communication (IPC). After the kernel starts, the shell and user applications are loaded. An application is compiled into machine code after startup, forming a process.

显示设备的系统可以包括内核(Kernel)、命令解析器(shell)、文件系统和应用程序。内核、shell和文件系统一起组成了基本的操作系统结构，它们让用户可以管理文件、运行程序并使用系统。上电后，内核启动，激活内核空间，抽象硬件、初始化硬件参数等，运行并维护虚拟内存、调度器、信号及进程间通信(IPC)。内核启动后，再加载Shell和用户应用程序。应用程序在启动后被编译成机器码，形成一个进程。The system of the display device may include a kernel (Kernel), a command parser (shell), a file system, and an application program. Together, the kernel, shell, and file system make up the basic operating system structures that allow users to manage files, run programs, and use the system. After power-on, the kernel starts, activates the kernel space, abstracts hardware, initializes hardware parameters, etc., runs and maintains virtual memory, scheduler, signals and inter-process communication (IPC). After the kernel starts, the shell and user applications are loaded. An application is compiled into machine code after startup, forming a process.

参见图4，在一些实施例中，将系统分为四层，从上至下分别为应用程序(Applications)层(简称“应用层”)，应用程序框架(Application Framework)层(简称“框架层”)，安卓运行时(Android runtime)和系统库层(简称“系统运行库层”)，以及内核层。Referring to FIG. 4 , in some embodiments, the system is divided into four layers, from top to bottom, the applications layer (referred to as “application layer”), the application framework layer (referred to as “framework layer”) "), the Android runtime (Android runtime) and the system library layer (referred to as "system runtime library layer"), and the kernel layer.

在一些实施例中，应用程序层中运行有至少一个应用程序，这些应用程序可以是操作系统自带的窗口(Window)程序、系统设置程序或时钟程序等；也可以是第三方开发者所开发的应用程序。在具体实施时，应用程序层中的应用程序包不限于以上举例。In some embodiments, at least one application program runs in the application program layer, and these application programs may be a Window program, a system setting program, or a clock program that comes with the operating system; they may also be developed by third-party developers. s application. During specific implementation, the application package in the application layer is not limited to the above examples.

框架层为应用程序提供应用编程接口(application programming interface，API)和编程框架。应用程序框架层包括一些预先定义的函数。应用程序框架层相当于一个处理中心，这个中心决定让应用层中的应用程序做出动作。应用程序通过API接口，可在执行中访问系统中的资源和取得系统的服务。The framework layer provides an application programming interface (application programming interface, API) and a programming framework for the application. The application framework layer includes some predefined functions. The application framework layer is equivalent to a processing center, which decides to let the applications in the application layer take action. The application program can access the resources in the system and obtain the services of the system during execution through the API interface.

如图4所示，本申请实施例中应用程序框架层包括管理器(Managers)，内容提供者(Content Provider)等，其中管理器包括以下模块中的至少一个：活动管理器(ActivityManager)用与和系统中正在运行的所有活动进行交互；位置管理器(Location Manager)用于给系统服务或应用提供了系统位置服务的访问；文件包管理器(Package Manager)用于检索当前安装在设备上的应用程序包相关的各种信息；通知管理器(NotificationManager)用于控制通知消息的显示和清除；窗口管理器(Window Manager)用于管理用户界面上的括图标、窗口、工具栏、壁纸和桌面部件。As shown in FIG. 4 , the application framework layer in the embodiment of the present application includes managers (Managers), content providers (Content Provider), etc., wherein the manager includes at least one of the following modules: the activity manager (ActivityManager) is used with Interact with all activities running in the system; Location Manager is used to provide system services or applications with access to system location services; Package Manager is used to retrieve files currently installed on the device. Various information related to the application package; Notification Manager (Notification Manager) is used to control the display and clearing of notification messages; Window Manager (Window Manager) is used to manage icons, windows, toolbars, wallpapers and desktops on the user interface part.

在一些实施例中，活动管理器用于管理各个应用程序的生命周期以及通常的导航回退功能，比如控制应用程序的退出、打开、后退等。窗口管理器用于管理所有的窗口程序，比如获取显示屏大小，判断是否有状态栏，锁定屏幕，截取屏幕，控制显示窗口变化(例如将显示窗口缩小显示、抖动显示、扭曲变形显示等)等。In some embodiments, the activity manager is used to manage the life cycle of the various applications and general navigation rollback functions, such as controlling the exit, opening, rollback, etc. of the application. The window manager is used to manage all window programs, such as obtaining the size of the display screen, judging whether there is a status bar, locking the screen, taking screenshots, and controlling the change of the display window (such as reducing the display window, shaking display, distorting display, etc.), etc.

在一些实施例中，系统运行库层为上层即框架层提供支撑，当框架层被使用时，安卓操作系统会运行系统运行库层中包含的C/C++库以实现框架层要实现的功能。In some embodiments, the system runtime layer provides support for the upper layer, that is, the framework layer. When the framework layer is used, the Android operating system will run the C/C++ library included in the system runtime layer to implement the functions to be implemented by the framework layer.

在一些实施例中，内核层是硬件和软件之间的层。如图4所示，内核层至少包含以下驱动中的至少一种：音频驱动、显示驱动、蓝牙驱动、摄像头驱动、WIFI驱动、USB驱动、HDMI驱动、传感器驱动(如指纹传感器，温度传感器，压力传感器等)、以及电源驱动等。In some embodiments, the kernel layer is the layer between hardware and software. As shown in Figure 4, the kernel layer at least includes at least one of the following drivers: audio driver, display driver, Bluetooth driver, camera driver, WIFI driver, USB driver, HDMI driver, sensor driver (such as fingerprint sensor, temperature sensor, pressure sensors, etc.), and power drives, etc.

在一些实施例中的硬件或软件架构可以基于上述实施例中的介绍，在一些实施例中可以是基于相近的其他硬件或软件架构，可以实现本申请的技术方案即可。The hardware or software architecture in some embodiments may be based on the introduction in the foregoing embodiments, and may be based on other similar hardware or software architectures in some embodiments, and the technical solutions of the present application may be implemented.

为清楚说明本申请的实施例，下面结合图5对本申请实施例提供的一种语音识别网络架构进行描述。To clearly illustrate the embodiments of the present application, a speech recognition network architecture provided by the embodiments of the present application is described below with reference to FIG. 5 .

参见图5，图5为本申请实施例提供的一种语音识别网络架构示意图。图5中，智能设备用于接收输入的信息以及输出对该信息的处理结果。语音识别服务设备为部署有语音识别服务的电子设备，语义服务设备为部署有语义服务的电子设备，业务服务设备为部署有业务服务的电子设备。这里的电子设备可包括服务器、计算机等，这里的语音识别服务、语义服务(也可称为语义引擎)和业务服务为可部署在电子设备上的web服务，其中，语音识别服务用于将音频识别为文本，语义服务用于对文本进行语义解析，业务服务用于提供具体的服务如墨迹天气的天气查询服务、QQ音乐的音乐查询服务等。在一个实施例中，图5所示架构中可存在部署有不同业务服务的多个实体服务设备，也可以一个或多个实体服务设备中集合一项或多项功能服务。Referring to FIG. 5, FIG. 5 is a schematic diagram of a speech recognition network architecture provided by an embodiment of the present application. In FIG. 5 , the smart device is used to receive the input information and output the processing result of the information. The speech recognition service device is an electronic device deployed with a speech recognition service, the semantic service device is an electronic device deployed with a semantic service, and the business service device is an electronic device deployed with a business service. The electronic device here may include a server, a computer, etc. Here, the speech recognition service, the semantic service (also referred to as a semantic engine) and the business service are web services that can be deployed on the electronic device, wherein the speech recognition service is used to convert audio Recognized as text, the semantic service is used for semantic analysis of the text, and the business service is used to provide specific services such as the weather query service of Moji Weather and the music query service of QQ Music. In one embodiment, in the architecture shown in FIG. 5, there may be multiple entity service devices deployed with different business services, or one or more functional services may be aggregated in one or more entity service devices.

一些实施例中，下面对基于图5所示架构处理输入智能设备的信息的过程进行举例描述，以输入智能设备的信息为通过语音输入的查询语句为例，上述过程可包括如下三个过程：In some embodiments, the following describes the process of processing the information input to the smart device based on the architecture shown in FIG. 5 . Taking the information input to the smart device as an example of a query sentence input by voice, the above process may include the following three processes: :

[语音识别][Speech Recognition]

智能设备可在接收到通过语音输入的查询语句后，将该查询语句的音频上传至语音识别服务设备，以由语音识别服务设备通过语音识别服务将该音频识别为文本后返回至智能设备。在一个实施例中，将查询语句的音频上传至语音识别服务设备前，智能设备可对查询语句的音频进行去噪处理，这里的去噪处理可包括去除回声和环境噪声等步骤。After receiving the query sentence input by voice, the smart device can upload the audio of the query sentence to the voice recognition service device, so that the voice recognition service device can recognize the audio as text through the voice recognition service and return it to the smart device. In one embodiment, before uploading the audio of the query sentence to the speech recognition service device, the smart device may perform denoising processing on the audio of the query sentence, where the denoising processing may include steps such as removing echoes and ambient noise.

[语义理解][Semantic understanding]

智能设备将语音识别服务识别出的查询语句的文本上传至语义服务设备，以由语义服务设备通过语义服务对该文本进行语义解析，得到文本的业务领域、意图等。The smart device uploads the text of the query sentence recognized by the speech recognition service to the semantic service device, so that the semantic service device performs semantic analysis on the text through the semantic service to obtain the business field and intent of the text.

[语义响应][semantic response]

语义服务设备根据对查询语句的文本的语义解析结果，向相应的业务服务设备下发查询指令以获取业务服务给出的查询结果。智能设备可从语义服务设备获取该查询结果并输出。作为一个实施例，语义服务设备还可将对查询语句的语义解析结果发送至智能设备，以由智能设备输出该语义解析结果中的反馈语句。According to the semantic analysis result of the text of the query sentence, the semantic service device sends a query instruction to the corresponding business service device to obtain the query result given by the business service. The smart device can obtain and output the query result from the semantic service device. As an embodiment, the semantic service device may also send the semantic parsing result of the query statement to the smart device, so that the smart device outputs the feedback statement in the semantic parsing result.

需要说明的是，图5所示架构只是一种示例，并非对本申请保护范围的限定。本申请实施例中，也可采用其他架构来实现类似功能，例如：三个过程全部或部分可以由智能终端来完成，在此不做赘述。It should be noted that the architecture shown in FIG. 5 is only an example, and does not limit the protection scope of the present application. In the embodiments of the present application, other architectures may also be used to implement similar functions. For example, all or part of the three processes may be completed by an intelligent terminal, which will not be repeated here.

在一些实施例中，图5所示的智能设备可为显示设备，如智能电视，语音识别服务设备的功能可由显示设备上设置的声音采集器和控制器配合实现，语义服务设备和业务服务设备的功能可由显示设备的控制器实现，或者由显示设备的服务器来实现。In some embodiments, the smart device shown in FIG. 5 can be a display device, such as a smart TV, the function of the voice recognition service device can be realized by a sound collector and a controller set on the display device, the semantic service device and the business service device The functions of the display device can be implemented by the controller of the display device, or by the server of the display device.

在一些实施例中，用户通过语音输入显示设备的查询语句或其他交互语句可称为语音指令。In some embodiments, a query sentence or other interactive sentence input by the user to the display device through voice may be referred to as a voice command.

在一些实施例中，显示设备从语义服务设备获取到的是业务服务给出的查询结果，显示设备可对该查询结果进行分析，生成语音指令的响应数据，然后根据响应数据控制显示设备执行相应的动作。In some embodiments, what the display device obtains from the semantic service device is the query result given by the business service, and the display device can analyze the query result, generate response data for the voice command, and then control the display device to execute the corresponding response data according to the response data. Actions.

在一些实施例中，显示设备从语义服务设备获取到的是语音指令的语义解析结果，显示设备可对该语义解析结果进行分析，生成响应数据，然后根据响应数据控制显示设备执行相应的动作。In some embodiments, what the display device obtains from the semantic service device is the semantic parsing result of the voice command, the display device can analyze the semantic parsing result, generate response data, and then control the display device to perform corresponding actions according to the response data.

在一些实施例中，显示设备的遥控器上可设置有语音控制按键，用户按住遥控器上的语音控制按键后，显示设备的控制器可控制显示设备的显示器显示语音交互界面，并控制声音采集器，如麦克风，采集显示设备周围的声音。此时，用户可向显示设备输入语音指令。In some embodiments, a voice control button may be provided on the remote control of the display device. After the user presses the voice control button on the remote control, the controller of the display device may control the display of the display device to display a voice interaction interface and control sound A collector, such as a microphone, picks up the sound around the display device. At this time, the user may input a voice instruction to the display device.

在一些实施例中，显示设备可支持语音唤醒功能，显示设备的声音采集器可处于持续采集声音的状态。用户说出唤醒词后，显示设备对用户输入的语音指令进行语音识别，识别出语音指令为唤醒词后，可控制显示设备的显示器显示语音交互界面，此时，用户可继续向显示设备输入语音指令。该唤醒词可称为语音唤醒指令，用户继续输入的语音指令可称为语音实时指令。In some embodiments, the display device may support a voice wake-up function, and the sound collector of the display device may be in a state of continuously collecting sounds. After the user speaks the wake-up word, the display device performs speech recognition on the voice command input by the user, and after recognizing that the voice command is the wake-up word, it can control the display of the display device to display the voice interaction interface. At this time, the user can continue to input voice to the display device. instruction. The wake-up word may be called a voice wake-up command, and the voice command that the user continues to input may be called a voice real-time command.

在一些实施例中，在用户输入一个语音指令后，在显示设备获取语音指令的响应数据或显示设备根据响应数据进行响应的过程中，显示设备的声音采集器可保持声音采集的状态，用户可随时按住遥控器上的语音控制按键重新输入语音指令，或者说出唤醒词，此时，显示设备可结束上一次的语音交互进程，根据用户新输入的语音指令，开启新的语音交互进程，从而保障语音交互的实时性。In some embodiments, after the user inputs a voice command, during the process that the display device obtains the response data of the voice command or the display device responds according to the response data, the sound collector of the display device can keep the sound collecting state, and the user can Press and hold the voice control button on the remote control at any time to re-enter the voice command, or say the wake-up word. At this time, the display device can end the last voice interaction process, and start a new voice interaction process according to the new voice command input by the user. This ensures the real-time nature of voice interaction.

在一些实施例中，在显示设备的当前界面为语音交互界面时，显示设备对用户输入的语音指令进行语音识别后，得到语音指令对应的文本，显示设备自己或显示设备的服务器对该文本进行语义理解后得到用户意图，对用户意图进行处理得到语义解析结果，根据语义解析结果生成响应数据。In some embodiments, when the current interface of the display device is a voice interaction interface, after the display device performs voice recognition on the voice command input by the user, the text corresponding to the voice command is obtained, and the display device itself or the server of the display device performs the text processing on the text. After semantic understanding, the user intent is obtained, the user intent is processed to obtain a semantic parsing result, and response data is generated according to the semantic parsing result.

示例性的，在显示设备根据接收到语音唤醒指令开启语音对话的语音交互方式下，发出语音唤醒指令的用户可称为目标人。Exemplarily, in the voice interaction mode in which the display device starts a voice dialogue according to the received voice wake-up command, the user who issues the voice wake-up command may be referred to as the target person.

在一些实施例中，目标人也可为显示设备上的注册用户，用户在显示设备上进行注册时，可在显示设备上录入声纹信息和人脸图像。In some embodiments, the target person can also be a registered user on the display device. When the user registers on the display device, the user can enter voiceprint information and a face image on the display device.

在部分语音交互场景下，由于环境噪声干扰和非目标人的声音干扰等原因，显示设备可能出现被误唤醒，或者不能根据收集的音频准确识别目标人的意图的情况，这将严重影响用户的语音交互体验。In some voice interaction scenarios, due to environmental noise interference and non-target people's voice interference, the display device may be awakened by mistake, or the target person's intention cannot be accurately recognized based on the collected audio, which will seriously affect the user's experience. Voice interaction experience.

为解决上述问题，本申请实施例示出了一种语音交互方案，通过结合音频信号处理和视频信号处理的方法，能够有效提高在复杂语音交互场景下的语音交互体验。In order to solve the above problem, the embodiment of the present application shows a voice interaction solution, which can effectively improve the voice interaction experience in complex voice interaction scenarios by combining audio signal processing and video signal processing methods.

在一些实施例中，为采集在语音交互过程中所需的视频信号，显示设备可设置有摄像头，或连接外置摄像头。In some embodiments, in order to collect video signals required in the process of voice interaction, the display device may be provided with a camera, or an external camera may be connected.

以显示设备设置有摄像头为例，参见图6，为根据一些实施例的语音交互场景示意图。如图6所示，在一些实施例中，显示设备200可设置有摄像头201，摄像头201可拍摄图像。若摄像头是固定在显示设备200上的，只能拍摄一定视场内的图像，如图6中A区域范围内的图像，A区域的视场角可为α，α小于180度。当用户站在A区域时，摄像头201能拍摄到用户，而当用户站在B区域和C区域时，摄像头201无法拍摄到用户，其中，B区域是A区域的左侧区域，C区域是A区域的右侧区域。Taking a display device provided with a camera as an example, see FIG. 6 , which is a schematic diagram of a voice interaction scenario according to some embodiments. As shown in FIG. 6, in some embodiments, the display device 200 may be provided with a camera 201, and the camera 201 may capture images. If the camera is fixed on the display device 200, it can only capture images within a certain field of view, such as the image within the area of A area in FIG. When the user is standing in the A area, the camera 201 can photograph the user, but when the user is standing in the B area and the C area, the camera 201 cannot photograph the user, wherein the B area is the left area of the A area, and the C area is the A area Right area of the area.

在一些实施例中，为扩大摄像头201的视场，摄像头201可设置有云台或其他能调整摄像头的视场角的结构。云台可调整摄像头201的视场，显示设备的控制器可与摄像头连接，通过云台控制摄像头的视场动态变化，使摄像头201在转动后的视场可达到0-180度，从而能拍摄到0-180度范围内的用户，其中，0度表示用户站在显示设备200的左侧，且与显示设备200在同一平面，180度表示用户站在显示设备200的右侧，且与显示设备200在同一平面。可见，通过转动摄像头，可使摄像头拍摄到位于B区域和C区域的用户，达到了只要是用户站在显示设备200前方，就能通过转动摄像头201拍摄到用户的效果。In some embodiments, in order to expand the field of view of the camera 201 , the camera 201 may be provided with a pan/tilt or other structures capable of adjusting the field of view of the camera. The PTZ can adjust the field of view of the camera 201, and the controller of the display device can be connected to the camera, and the PTZ can control the dynamic change of the field of view of the camera, so that the field of view of the camera 201 can reach 0-180 degrees after rotation, so that it can shoot To the user within the range of 0-180 degrees, where 0 degrees means that the user is standing on the left side of the display device 200 and is on the same plane as the display device 200, and 180 degrees means that the user is standing on the right side of the display device 200 and is on the same plane as the display device 200. The devices 200 are in the same plane. It can be seen that by rotating the camera, the camera can capture the users located in the B area and the C area, so that as long as the user is standing in front of the display device 200 , the user can be captured by rotating the camera 201 .

在一些实施例中，显示设备通过外置摄像头进行图像采集，该外置摄像头可为设置有云台的摄像头，从而能实现动态视场的图像采集。In some embodiments, the display device performs image acquisition through an external camera, and the external camera may be a camera provided with a pan/tilt head, so as to realize image acquisition of a dynamic field of view.

在一些实施例中，摄像头201自身没有云台，可安装在一个与显示设备通信连接的云台上，通过显示设备对云台的控制，也可实现动态视场的图像采集。In some embodiments, the camera 201 itself does not have a pan/tilt, and can be installed on a pan/tilt that is communicatively connected to the display device, and the image capture of the dynamic field of view may also be achieved through the display device's control of the pan/tilt.

在一些实施例中，显示设备结合音频信号处理和视频信号处理的方法可参见图7，为根据一些实施例的语音交互过程的信号处理示意图。In some embodiments, for a method of combining audio signal processing and video signal processing by a display device, reference may be made to FIG. 7 , which is a schematic diagram of signal processing in a voice interaction process according to some embodiments.

如图7所示，在一些实施例中，显示设备通过麦克风采集的音频得到音频信号，对音频信号的处理包括声源定位、用户属性识别、声纹识别、语音识别和降噪增强。As shown in FIG. 7 , in some embodiments, the display device obtains an audio signal through audio collected by a microphone, and the processing of the audio signal includes sound source localization, user attribute recognition, voiceprint recognition, speech recognition, and noise reduction enhancement.

其中，声源定位可包括确定音频来源与显示设备之间的角度，参见图6，若α为90度，则用户位于B区域时，唤醒角度在0-45度之间，用户位于A区域时，唤醒角度在45-135度之间，用户位于C区域时，唤醒角度在135-180度之间。声源定位可通过多种算法实现，如时间到达差法、波束成形法等等。为实现声源定位，显示设备可设置有麦克风阵列，麦克风阵列包括多个设置在显示设备不同位置的麦克风，每个麦克风均与显示设备的控制器连接，显示设备通过麦克风阵列的多个麦克风收集多路音频信号，通过对多路音频信号进行综合分析，得到唤醒角度。例如，在时间到达差法中，显示设备可根据多个麦克风接收到音频信号的时间差，以及多个麦克风之间的相对位置关系，计算出音频来源相对显示设备的位置。在波束成形法中，显示设备可根据各个麦克风所采集到的音频信号进行滤波、加权叠加后形成声压分布波束，根据声压波束的分布特点，得到音频来源相对显示设备的位置。根据音频来源相对显示设备的位置，得到音频来源与显示设备之间的角度。The sound source localization may include determining the angle between the audio source and the display device. Referring to Figure 6, if α is 90 degrees, when the user is located in the B area, the wake-up angle is between 0 and 45 degrees, and when the user is located in the A area , the wake-up angle is between 45-135 degrees, and when the user is in the C area, the wake-up angle is between 135-180 degrees. Sound source localization can be achieved by a variety of algorithms, such as time difference of arrival method, beamforming method and so on. In order to achieve sound source localization, the display device can be provided with a microphone array. The microphone array includes a plurality of microphones arranged at different positions of the display device. Each microphone is connected to the controller of the display device. For multi-channel audio signals, the wake-up angle is obtained by comprehensive analysis of the multi-channel audio signals. For example, in the time difference of arrival method, the display device can calculate the position of the audio source relative to the display device according to the time difference between the audio signals received by the multiple microphones and the relative positional relationship between the multiple microphones. In the beamforming method, the display device can filter and weight the audio signals collected by each microphone to form a sound pressure distribution beam. According to the distribution characteristics of the sound pressure beam, the position of the audio source relative to the display device can be obtained. Based on the position of the audio source relative to the display device, the angle between the audio source and the display device is obtained.

用户属性识别确定的用户属性可包括用户性别和用户年龄，其中，该年龄可为一个年龄段，如1-10岁，11-20岁，20-40岁，40-60岁等等。用户属性识别可基于预先训练的模型来实现。通过采集大量不同用户属性的音频样本，可基于神经网络训练一个能够预测用户属性的模型，将音频信号输入该模型后，可得到用户属性。User attributes determined by user attribute identification may include user gender and user age, where the age may be an age group, such as 1-10 years old, 11-20 years old, 20-40 years old, 40-60 years old and so on. User attribute recognition can be implemented based on pre-trained models. By collecting a large number of audio samples with different user attributes, a model capable of predicting user attributes can be trained based on a neural network. After inputting audio signals into the model, user attributes can be obtained.

降噪增强可包括对音频信号中的目标人的语音进行增强以及对非目标人的音频进行降噪处理。降噪增强可通过语音定向增强技术实现。通过语音增强波束，动态调整为以目标人为中心的增强波束，对目标人进行语音增强，对波束外的声音进行抑制。The noise reduction enhancement may include enhancing the target person's voice in the audio signal and performing noise reduction processing on the non-target person's audio. Noise reduction enhancement can be achieved through speech-directed enhancement techniques. Through the speech enhancement beam, it is dynamically adjusted to an enhancement beam centered on the target person, which enhances the target person's speech and suppresses the sound outside the beam.

可见，对音频信号的处理实现了对用户位置的确定、用户身份的确定和音频内容的确定。用户位置确定后，可控制摄像头进行转动，以快速定位目标人。用户身份确定后可区分进行语音交互的目标人和其他人。音频内容确定后可得到用户的意图。It can be seen that the processing of the audio signal realizes the determination of the user's position, the determination of the user's identity and the determination of the audio content. After the user's position is determined, the camera can be controlled to rotate to quickly locate the target person. After the user's identity is determined, the target of the voice interaction can be distinguished from other people. After the audio content is determined, the user's intention can be obtained.

如图7所示，在一些实施例中，显示设备通过摄像头拍摄的图像得到视频信号，对视频信号的处理包括人脸检测与跟踪、人脸识别、唇动检测和唇语识别。As shown in FIG. 7 , in some embodiments, the display device obtains a video signal through an image captured by a camera, and the processing of the video signal includes face detection and tracking, face recognition, lip movement detection and lip language recognition.

其中，在人脸检测与跟踪过程中，可控制摄像头进行转动，以保障目标人始终在摄像头的视场内。Among them, in the process of face detection and tracking, the camera can be controlled to rotate to ensure that the target person is always within the field of view of the camera.

唇动检测可实现在摄像头拍摄的人脸中，判断唇部是否产生了变化，若唇部产生了变化，可确定人物正在讲话，若唇部未变化，可确定人物没有在讲话。若有人物正在讲话，可结合人脸识别技术，确定正在讲话的人物是否为目标人，若正在讲话的人物包括目标人，可确定接收到的音频信号包含了该目标人的语音，若正在讲话的人物不包括目标人，可确定接收到的音频信号不包含该目标人的语音。Lip motion detection can be implemented in the face captured by the camera to determine whether the lips have changed. If the lips have changed, it can be determined that the person is speaking. If the lips have not changed, it can be determined that the person is not speaking. If a person is speaking, it can be combined with face recognition technology to determine whether the person who is speaking is the target person. If the person who is speaking includes the target person, it can be determined that the received audio signal contains the target person's voice. The person does not include the target person, and it can be determined that the received audio signal does not contain the target person's voice.

唇语识别可识别正在讲话的人物的讲话内容，该讲话内容可用于与根据音频信号的语音识别得到的音频内容进行比对分析，若一致或大体一致，可认为音频信号的音频内容来源于摄像头拍摄的图像中的人物，相反地，若相差较大，可认为音频信号的音频内容来源于摄像头拍摄的图像之外的人物或环境。在一些实施例中，也可不进行唇语识别，只通过唇动检测来判断确定音频信号的音频内容是否来源于摄像头拍摄的图像中的人物，可减小视频信号处理的资源占用，提高处理效率。Lip recognition can identify the speech content of the person who is speaking, and the speech content can be used for comparison and analysis with the audio content obtained from the speech recognition of the audio signal. If they are consistent or roughly consistent, the audio content of the audio signal can be considered to come from the camera. On the contrary, if the difference between the people in the captured images is relatively large, it can be considered that the audio content of the audio signal comes from people or environments other than the images captured by the camera. In some embodiments, lip language recognition may not be performed, and only lip motion detection is used to determine whether the audio content of the audio signal comes from a person in the image captured by the camera, which can reduce the resource occupation of video signal processing and improve processing efficiency. .

可见，对视频信号的处理实现了对用户位置的追踪和用户是否在讲话的判定。It can be seen that the processing of the video signal realizes the tracking of the user's position and the determination of whether the user is speaking.

如图7所示，在一些实施例中，显示设备在获得音频信号的处理结果和视频信号的处理结果后，还可获取应用场景信息。应用场景信息可包括前台运行应用预设的交互控制信息，该交互控制信息可包括音频采集控制参数和视频采集控制参数，示例性的，音频采集控制参数取值为1表示当前可进行音频数据的采集，取值为0表示当前不可进行音频数据的采集，视频采集控制参数取值为1示当前可进行视频数据的采集，取值为0表示当前不可进行视频数据的采集。As shown in FIG. 7 , in some embodiments, after obtaining the processing result of the audio signal and the processing result of the video signal, the display device may also obtain application scene information. The application scene information may include interactive control information preset by the foreground running application, and the interactive control information may include audio capture control parameters and video capture control parameters. Exemplarily, the value of the audio capture control parameter is 1, indicating that audio data can be currently processed. Collection, a value of 0 indicates that the current audio data collection cannot be performed, a value of the video capture control parameter value of 1 indicates that the current video data collection can be performed, and a value of 0 indicates that the current video data collection cannot be performed.

例如，当前台运行应用为视频聊天应用时，则音频采集控制参数和视频采集控制参数均为1，融合决策引擎可根据音频采集控制参数和视频采集控制参数均为1，对音频信号的处理结果和视频信号的处理结果进行综合分析，得到识别结果。For example, when the application running in the foreground is a video chat application, the audio capture control parameter and the video capture control parameter are both 1, and the fusion decision engine can process the audio signal according to the audio capture control parameter and the video capture control parameter are both 1. And the processing result of the video signal is analyzed comprehensively to get the recognition result.

当前台运行应用为在线教学应用时，若显示设备为学生端，则在一些时刻可能存在不允许学生讲话的情况，此时，音频采集控制参数可为0。融合决策引擎可根据音频采集控制参数和视频采集控制参数确定是否要对音频信号的处理结果和视频信号的的处理结果进行采纳，以得到识别结果，或者确定控制显示设备是否采集音频数据和视频数据。When the application running in the foreground is an online teaching application, if the display device is the student terminal, the students may not be allowed to speak at some moments. In this case, the audio acquisition control parameter can be 0. The fusion decision engine can determine whether to adopt the processing result of the audio signal and the processing result of the video signal according to the audio acquisition control parameters and the video acquisition control parameters to obtain the identification result, or determine whether to control the display device to acquire audio data and video data. .

在一些实施例中，将音频信号的处理结果、视频信号的处理结果和应用场景信息输入特征融合决策引擎，通过特征融合决策引擎可输出多模态的识别结果。该多模态的识别结果可包括音频信号的语音内容、正在讲话的人物、未讲话的人物以及人物与语音内容之间的对应关系。根据该对应关系，可确定目标人是否进行了讲话，以及若目标人讲话，可确定其讲话的内容，从而提高了语音识别的准确性，降低了误唤醒的几率。In some embodiments, the processing result of the audio signal, the processing result of the video signal and the application scene information are input into the feature fusion decision engine, and the multimodal recognition result can be output through the feature fusion decision engine. The multimodal recognition result may include the speech content of the audio signal, the person who is speaking, the person who is not speaking, and the correspondence between the person and the speech content. According to the corresponding relationship, it can be determined whether the target person has spoken, and if the target person has spoken, the content of his speech can be determined, thereby improving the accuracy of speech recognition and reducing the probability of false awakening.

在一些实施例中，图7所所示的音频信号的处理和视频信号的处理均可由显示设备在本地实现，也可由显示设备将音频信号和视频信号发送到服务器，由服务器在云端进行处理，然后将处理结果返回显示设备，还可由显示设备本地实现部分功能，由服务器实现部分功能。In some embodiments, the processing of the audio signal and the processing of the video signal shown in FIG. 7 can be implemented locally by the display device, or the display device can send the audio signal and the video signal to the server, and the server processes it in the cloud, The processing result is then returned to the display device, and part of the functions can also be implemented locally by the display device and part of the functions implemented by the server.

在一些实施例中，用户的声纹信息和人脸信息为隐私信息，显示设备可被配置为在开机导航和用户第一次使用语音助手功能时，显示声纹识别和人脸识别等隐私功能的选项控件，并显示确认使用摄像头和远场语音的提示，用户在查看该提示后可点击该选项控件，选择开启上述隐私功能，以提高语音交互效果，也可选择不触发该选项控件，从而提高隐私安全。In some embodiments, the user's voiceprint information and face information are private information, and the display device can be configured to display privacy functions such as voiceprint recognition and face recognition when the user is powered on for navigation and when the user uses the voice assistant function for the first time After viewing the prompt, the user can click the option control and choose to enable the above privacy function to improve the voice interaction effect, or choose not to trigger the option control, thus Improve privacy security.

为对图7中的所示的语音交互过程的信号处理过程做进一步描述，图8示出了根据一些实施例的语音交互过程的时序图。需要说明的是，该时序图仅为图7所示的信号处理过程的一种示例性时序图，在实际实施例中，图7所示的信号处理过程还可包括其他时序。To further describe the signal processing process of the voice interaction process shown in FIG. 7 , FIG. 8 shows a sequence diagram of the voice interaction process according to some embodiments. It should be noted that this timing diagram is only an exemplary timing diagram of the signal processing process shown in FIG. 7 , and in an actual embodiment, the signal processing process shown in FIG. 7 may further include other timing sequences.

参见图8，在语音交互过程中，显示设备可分别与用户、服务器进行交互，为用户提供语音控制服务。Referring to FIG. 8 , in the process of voice interaction, the display device can interact with the user and the server respectively to provide voice control services for the user.

图8中，用户可为想要控制显示设备的目标人，除了目标人，显示设备的麦克风可能还会采集到其他用户的语音或环境噪音。In FIG. 8 , the user may be a target person who wants to control the display device. In addition to the target person, the microphone of the display device may also collect other users' voices or ambient noise.

在一些实施例中，用户向显示设备输入的唤醒词可为语音唤醒指令，显示设备在接收到语音唤醒指令后，可根据语音唤醒指令进行声源定位，计算唤醒声源位置，得到唤醒声源位置信息，该唤醒声源位置信息可包括唤醒角度。In some embodiments, the wake-up word input by the user to the display device may be a voice wake-up command. After receiving the voice wake-up command, the display device may locate the sound source according to the voice wake-up command, calculate the position of the wake-up sound source, and obtain the wake-up sound source. Position information, the position information of the wake-up sound source may include the wake-up angle.

在一些实施例中，显示设备在计算出唤醒角度后，可开启摄像头，并根据唤醒角度，朝向唤醒声源转动摄像头，在转动的过程中可持续进行拍摄，以获得动态视场内的图像。该唤醒声源即为目标人。当然，若没有用户向显示设备输入唤醒词，即显示设备被误唤醒，该唤醒声源可能实际并不存在。In some embodiments, after calculating the wake-up angle, the display device can turn on the camera, rotate the camera toward the wake-up sound source according to the wake-up angle, and continue to shoot during the rotation to obtain images in the dynamic field of view. The wake-up sound source is the target person. Of course, if the user does not input a wake-up word to the display device, that is, the display device is woken up by mistake, the wake-up sound source may not actually exist.

在一些实施例中，显示设备在采集到语音唤醒指令后，为避免本次语音唤醒指令是误唤醒的语音指令，可获取语音唤醒指令的用户身份信息，然后在采集到的动态视场内的图像中检测目标人，若能在采集的图像中定位到目标人，则可确定本次唤醒不是误唤醒，若不能定位到目标人，则可确定本次唤醒是误唤醒。In some embodiments, after the display device collects the voice wake-up command, in order to prevent the voice wake-up command from being a false wake-up voice command, the display device can obtain the user identity information of the voice wake-up command, and then display the user identity information of the voice wake-up command in the collected dynamic field of view. The target person is detected in the image. If the target person can be located in the collected image, it can be determined that the wake-up is not a false wake-up. If the target person cannot be located, it can be determined that the wake-up is a false wake-up.

在一些实施例中，显示设备为获取用户身份信息，可生成包含唤醒词语音的唤醒词识别请求，向服务器发送该唤醒词识别请求。服务器接收到该唤醒词识别请求后，从该请求中提取出唤醒词语音，对唤醒词语音进行用户属性识别和声纹识别，得到目标人的用户身份信息。In some embodiments, in order to obtain the user identity information, the display device may generate a wake-up word recognition request including the wake-up word voice, and send the wake-up word recognition request to the server. After receiving the wake-up word recognition request, the server extracts the wake-up word voice from the request, performs user attribute recognition and voiceprint recognition on the wake-up word voice, and obtains the user identity information of the target person.

在一些实施例中，服务器在进行用户属性识别和声纹识别时，可得到声纹特征，然后将声纹特征与数据库中的声纹特征进行匹配，根据匹配结果得到语音唤醒指令的用户身份信息，其中，数据库中的声纹特征可为根据用户预先录入的音频数据得到的。用户身份信息可包括目标人的声纹标记U1、性别US1和年龄UA1，其中，性别US1和年龄UA1可为用户属性，年龄UA1可表示一个年龄段，如1-10岁，11-20岁，20-40岁，40-60岁等等。In some embodiments, the server may obtain voiceprint features when performing user attribute identification and voiceprint identification, and then matches the voiceprint features with the voiceprint features in the database, and obtains the user identity information of the voice wake-up command according to the matching result , wherein the voiceprint feature in the database can be obtained according to the audio data pre-entered by the user. The user identity information may include the target person's voiceprint mark U1, gender US1 and age UA1, where gender US1 and age UA1 may be user attributes, and age UA1 may represent an age group, such as 1-10 years old, 11-20 years old, 20-40 years old, 40-60 years old and so on.

在一些实施例中，用户属性识别和声纹识别也可由显示设备在本地实现，用户可预先在显示设备上录入音频，显示设备可通过声纹识别，将语音唤醒指令与用户预先录入的音频进行比对，从而得到语音唤醒指令对应的用户身份信息。In some embodiments, user attribute recognition and voiceprint recognition can also be implemented locally by the display device, the user can pre-record audio on the display device, and the display device can perform voice wake-up instruction with the audio pre-recorded by the user through voiceprint recognition By comparing, the user identity information corresponding to the voice wake-up command is obtained.

在一些实施例中，用户身份信息也可包括目标人的人脸图像，便于后续进行人脸识别，该头像可存储在服务器和/或显示设备中，显示设备在获取语音唤醒指令对应的用户身份信息时，可获取该人脸图像。In some embodiments, the user identity information may also include a face image of the target person to facilitate subsequent face recognition. The avatar may be stored in the server and/or the display device, and the display device is acquiring the user identity corresponding to the voice wake-up command. information, the face image can be obtained.

在一些实施例中，显示设备在向服务器请求用户身份信息的过程中，可先不调整摄像头的视场，直接控制摄像头采集图像，在本地对摄像头采集的图像进行人脸识别。此时，若目标人正好站在摄像头当前的视场内，则可拍摄到目标人，若用户在摄像头当前的视场外，则不能拍摄到目标人。为减小显示设备的资源消耗，在人脸识别时，可只检测人脸的年龄特征和性别特征。在接收到服务器的用户身份信息后，可提取用户身份信息中的性别US1和年龄UA1，将其与人脸识别的结果进行比对，若检测不到人脸，或检测到的人脸的年龄或性别与用户身份信息不匹配，再调整摄像头的视场，重新拍摄图像进行人脸检测。当然，若用户身份信息包括人脸图像，在人脸识别时，也可判断摄像头采集的图像中的人脸是否与用户身份信息中的人脸图像一致，与只识别年龄与性别相比，可提高目标人识别的准确性，但可能导致目标人识别的速度降低。In some embodiments, in the process of requesting the user identity information from the server, the display device may directly control the camera to capture images without adjusting the field of view of the camera, and locally perform face recognition on the images captured by the camera. At this time, if the target person is standing in the current field of view of the camera, the target person can be photographed, and if the user is outside the current field of view of the camera, the target person cannot be photographed. In order to reduce the resource consumption of the display device, during face recognition, only the age feature and gender feature of the face can be detected. After receiving the user identity information from the server, the gender US1 and age UA1 in the user identity information can be extracted and compared with the results of face recognition. If no face is detected, or the age of the detected face Or the gender does not match the user's identity information, adjust the field of view of the camera, and re-take the image for face detection. Of course, if the user identity information includes a face image, during face recognition, it can also be judged whether the face in the image collected by the camera is consistent with the face image in the user identity information. Improve the accuracy of target person recognition, but may lead to a decrease in the speed of target person recognition.

在一些实施例中，若显示设备不能从摄像头在当前的视场下拍摄的图像中识别到目标人，在得到唤醒角度后，显示设备可根据唤醒角度调整摄像头的视场，以拍摄到目标人。In some embodiments, if the display device cannot identify the target person from the image captured by the camera in the current field of view, after obtaining the wake-up angle, the display device may adjust the field of view of the camera according to the wake-up angle to capture the target person .

在一些实施例中，显示设备可将摄像头的视场调整为覆盖唤醒角度。例如，唤醒角度为30度，显示设备的当前视场为40-140度，则可将摄像头向左旋转10度以上，从而能在拍摄到的图像中定位到目标人。在旋转过程中，摄像头可连续拍照并进行人脸识别，若识别到与用户身份信息匹配的人脸，则实现了对目标人的定位，此时，摄像头可停止旋转。In some embodiments, the display device may adjust the camera's field of view to cover the wake-up angle. For example, if the wake-up angle is 30 degrees and the current field of view of the display device is 40-140 degrees, the camera can be rotated more than 10 degrees to the left, so that the target person can be located in the captured image. During the rotation process, the camera can continuously take pictures and perform face recognition. If a face matching the user's identity information is recognized, the target person can be located. At this time, the camera can stop rotating.

在一些实施例中，显示设备根据唤醒角度调整摄像头的视场后，从拍摄的图像中可能仍然不能识别到与用户身份信息匹配的人脸，则可确定唤醒角度计算错误或者出现了误唤醒。为解决唤醒角度计算错误的问题，显示设备可控制摄像头旋转，以覆盖最大视场，即0-180度的视场，若在0-180度的视场内仍然无法识别到目标人，则确认语音唤醒指令时误唤醒。或者，显示设备可根据在语音唤醒指令之后接收到的语音实时指令，重新测算目标人相对显示设备的角度，再根据重新测算的角度，调整摄像头的视场，若调整视场后仍然无法识别到目标人，可确定出现了误唤醒，此时，显示设备可退出本次语音交互进程。In some embodiments, after the display device adjusts the field of view of the camera according to the wake-up angle, the captured image may still fail to recognize the face matching the user's identity information, and it may be determined that the wake-up angle is incorrectly calculated or a false wake-up occurs. In order to solve the problem of incorrect calculation of the wake-up angle, the display device can control the rotation of the camera to cover the maximum field of view, that is, the field of view of 0-180 degrees. If the target person cannot be recognized within the field of view of 0-180 degrees, confirm Accidental wake-up during voice wake-up command. Alternatively, the display device can re-measure the angle of the target person relative to the display device according to the voice real-time command received after the voice wake-up command, and then adjust the field of view of the camera according to the re-measured angle. The target person can determine that a false wake-up has occurred, and at this time, the display device can exit the voice interaction process.

在一些实施例中，在摄像头拍摄到的图像中，显示设备识别出目标人后，可对该目标人进行人脸追踪。由于目标人可能来回走动，通过人脸追踪，进而动态调整摄像头的视场，使目标人始终位于摄像头的视场内。在人脸追踪时，可在摄像头拍摄的图像中确定一个预设区域，然后获取摄像头拍摄的图像中目标人脸部的实时坐标范围，根据实时坐标范围的变化趋势控制摄像头进行转动。若实时坐标范围在预设区域内，可不转动摄像头，若实时坐标范围的一个边界预设区域的边界重合，可根随实时坐标范围的变化趋势转动摄像头，使人脸重新位于预设区域内。例如，若实时坐标范围的变化趋势为向左平移，则控制摄像头朝左旋转。In some embodiments, in the image captured by the camera, after the display device recognizes the target person, the face tracking of the target person can be performed. Since the target person may move back and forth, the field of view of the camera is dynamically adjusted through face tracking, so that the target person is always within the field of view of the camera. During face tracking, a preset area can be determined in the image captured by the camera, and then the real-time coordinate range of the target face in the image captured by the camera can be obtained, and the camera can be controlled to rotate according to the change trend of the real-time coordinate range. If the real-time coordinate range is within the preset area, the camera does not need to be rotated. If a boundary of the real-time coordinate range coincides with the boundary of the preset area, the camera can be rotated according to the change trend of the real-time coordinate range, so that the face is located in the preset area again. For example, if the change trend of the real-time coordinate range is to translate to the left, control the camera to rotate to the left.

在一些实施例中，显示设备在语音唤醒指令之后接收到的语音实时指令，可能包括目标人的语音、其他人的语音和环境噪音中的任意一种或多种。在摄像头拍摄的图像中，通过对目标人进行唇动检测，可判断目标人是否正在讲话，若目标人正在讲话，可对语音实时指令进行语义识别，若目标人没有讲话，可不对语音实时指令进行语义识别。In some embodiments, the real-time voice command received by the display device after the voice wake-up command may include any one or more of the target person's voice, other people's voice, and ambient noise. In the image captured by the camera, by detecting the lip movement of the target person, it can be judged whether the target person is speaking. If the target person is speaking, the real-time voice command can be semantically recognized. If the target person is not speaking, the voice real-time command can be ignored. Semantic recognition.

在一些实施例中，显示设备可生成包含实时语音的语音识别请求，将该请求发送给服务器。服务器在接收到该语音识别请求后，对该语音识别请求中的实时语音进行语义识别，将语义识别结果返回给显示设备。In some embodiments, the display device may generate a speech recognition request containing real-time speech and send the request to the server. After receiving the speech recognition request, the server performs semantic recognition on the real-time speech in the speech recognition request, and returns the semantic recognition result to the display device.

在一些实施例中，服务器可先对实时语音进行声纹识别，若声纹识别的结果确定实时语音中包含了目标人的语音，则再对实时语音进行语义识别，将语义识别结果返回给显示设备，或者，服务器可同时进行声纹识别和语义识别，将声纹识别结果和语义识别结果一并返回给显示设备。In some embodiments, the server may first perform voiceprint recognition on the real-time voice, and if the result of the voiceprint recognition determines that the real-time voice contains the target person's voice, then perform semantic recognition on the real-time voice, and return the semantic recognition result to the display The device, or the server can perform voiceprint recognition and semantic recognition at the same time, and return the voiceprint recognition result and the semantic recognition result to the display device.

在一些实施例中，服务器在进行语义识别时，为提高语义识别准确性，可对语音实时指令进行降噪处理。In some embodiments, when the server performs semantic recognition, in order to improve the accuracy of semantic recognition, the server may perform noise reduction processing on the real-time voice command.

在一些实施例中，若摄像头拍摄的图像中除了存在目标人，还存在其他人，降噪处理可通过人声分离机制实现。基于混合人声分离出单路人声，并对单路音频进行声纹识别，判断单路音频是否属于目标说话人语音，然后对目标说话人的语音进行语义识别，对非目标说话人语音进行丢弃。人声分离机制采用基于目标人的身份信息中的声纹特征，并基于噪声源建模，实现对目标人声的增强，目标说话人语音以外的声音进行抑制，实现对这种场景下的目标人声识别效果优化。In some embodiments, if there are other people in addition to the target person in the image captured by the camera, the noise reduction process may be implemented by a human voice separation mechanism. Separate the single-channel human voice based on the mixed human voice, perform voiceprint recognition on the single-channel audio, determine whether the single-channel audio belongs to the target speaker's voice, then perform semantic recognition on the target speaker's voice, and discard the non-target speaker's voice . The human voice separation mechanism adopts the voiceprint features based on the identity information of the target person, and based on noise source modeling, to achieve the enhancement of the target voice and the suppression of sounds other than the target speaker's voice to achieve the goal of this scenario. The effect of human voice recognition is optimized.

在一些实施例中，显示设备在接收到来自服务器的识别结果后，可根据识别结果进行响应。示例性的，识别结果可包括：R＝{U＝U1，V＝调大音量}，其中，R表示识别结果，U表示声纹，U1表示目标人的声纹，该目标人的语音内容为“调大音量”。显示设备根据该识别结果，可将显示设备的音量调大。In some embodiments, after receiving the identification result from the server, the display device may respond according to the identification result. Exemplarily, the recognition result may include: R={U=U1, V=turn up the volume}, where R represents the recognition result, U represents the voiceprint, U1 represents the voiceprint of the target person, and the voice content of the target person is "Turn up the volume". The display device may increase the volume of the display device according to the recognition result.

在一些实施例中，本申请的语音识别方法的具体流程可参见图9-图12，下面结合语音识别方法的具体流程对本申请的技术方案进行介绍。In some embodiments, the specific flow of the speech recognition method of the present application can be referred to FIG. 9 to FIG. 12 , and the technical solution of the present application will be introduced below with reference to the specific flow of the speech recognition method.

参见图9，为根据一些实施例的语音交互方法的整体流程示意图。用户在向显示设备输入语音唤醒指令后，如图9所示，显示设备可采集用户的语音唤醒指令，根据语音唤醒指令计算出唤醒角度，然后根据唤醒角度旋转摄像头，使摄像头朝向唤醒角度，从而使摄像头的采集范围覆盖唤醒角度。Referring to FIG. 9 , it is a schematic diagram of an overall flow of a voice interaction method according to some embodiments. After the user inputs the voice wake-up command to the display device, as shown in Figure 9, the display device can collect the user's voice wake-up command, calculate the wake-up angle according to the voice wake-up command, and then rotate the camera according to the wake-up angle to make the camera face the wake-up angle, thereby Make the acquisition range of the camera cover the wake-up angle.

显示设备在控制摄像头朝向唤醒角度后，可对摄像头拍摄的图像进行人脸检测与人脸识别，从而判断摄像头拍摄的图像中是否检测到目标人。After the display device controls the camera to face the wake-up angle, it can perform face detection and face recognition on the image captured by the camera, so as to determine whether a target person is detected in the image captured by the camera.

若在摄像头拍摄的图像中检测不到目标人，则根据实时收录的音频重新计算唤醒角度，实现实时声源定位，然后根据声源定位的结果，转动摄像头，使摄像头朝向重新计算出的唤醒角度，再从摄像头拍摄的图像中检测人脸，对检测出的人脸进行人脸识别，判断其是否为目标人。If the target person is not detected in the image captured by the camera, the wake-up angle is recalculated according to the real-time recorded audio to realize real-time sound source localization, and then according to the result of the sound source localization, the camera is rotated to make the camera face the recalculated wake-up angle. , and then detect the face from the image captured by the camera, and perform face recognition on the detected face to determine whether it is the target person.

若在摄像头拍摄的图像中检测到目标人，显示设备则控制摄像头对目标人进行人脸追踪，并对摄像头拍摄的图像进行唇动检测与识别。其中，摄像头拍摄的图像中可能包含目标人的人脸以及非目标人的人脸，或者只包含目标人的人脸。If the target person is detected in the image captured by the camera, the display device controls the camera to track the face of the target person, and performs lip movement detection and recognition on the image captured by the camera. The image captured by the camera may include the face of the target person and the face of the non-target person, or only the face of the target person.

若在摄像头拍摄的图像中检测到发生唇动的为目标人，则可获取实时录音，即显示设备的音频输入装置采集的实时语音指令。显示设备可利用波束成行法，对实时录音的波束进行处理，处理内容可包括将目标人的语音波束进行增强，将其他语音波束进行抑制，其中，目标人的语音波束可根据目标人在图像中的位置确定。If it is detected in the image captured by the camera that the person with lip movement is the target person, a real-time recording, that is, a real-time voice command collected by the audio input device of the display device, can be obtained. The display device can use the beam forming method to process the real-time recording beam, and the processing content can include enhancing the voice beam of the target person and suppressing other voice beams. location is determined.

若在摄像头拍摄的图像中检测到发生唇动的为非目标人，则实现录音可能包括噪声和非目标人声中的一种或多种，显示设备可对实时录音进行噪声抑制和非目标人声抑制，以便提高下一段实时录音的识别准确性，当然，显示设备也可直接将该噪音或非目标人声进行删除。If a non-target person is detected in the image captured by the camera, the real-time recording may include one or more of noise and non-target vocals. Sound suppression, in order to improve the recognition accuracy of the next real-time recording, of course, the display device can also directly delete the noise or non-target vocals.

其中，在进行人脸检测与识别之前，显示设备可预先根据语音唤醒指令确定目标人，该目标人可为发出语音唤醒指令的用户。参见图10，为根据一些实施例的语音唤醒指令的处理方法的流程示意图。如图10所示，显示设备在采集到唤醒音频时，通过唤醒词检测确定唤醒音频中包括语音唤醒指令，可根据唤醒音频计算唤醒角度，并括将唤醒音频保存到预设路径，然后将预设路径的唤醒音频上传到云端，使云端的服务器对唤醒音频进行处理。服务器对唤醒音频的处理可包括声纹识别和用户属性识别，通过声纹识别可根据确定唤醒音频是哪个用户的声纹，通过用户属性识别可得到唤醒音频对应的用户的性别和年龄段。最终，将显示设备本地计算出的唤醒角度，以及云端的服务器识别出的声纹和用户属性，确定为唤醒音频的识别结果。Wherein, before performing face detection and recognition, the display device may determine the target person in advance according to the voice wake-up instruction, and the target person may be the user who issued the voice wake-up instruction. Referring to FIG. 10 , it is a schematic flowchart of a method for processing a voice wake-up instruction according to some embodiments. As shown in Figure 10, when the display device collects the wake-up audio, it determines that the wake-up audio includes a voice wake-up instruction through wake-up word detection, calculates the wake-up angle according to the wake-up audio, saves the wake-up audio to a preset path, and then The wake-up audio of the set path is uploaded to the cloud, so that the server in the cloud can process the wake-up audio. The server's processing of the wake-up audio may include voiceprint recognition and user attribute recognition. Through the voiceprint recognition, the user's voiceprint of the wake-up audio can be determined, and the user's gender and age group corresponding to the wake-up audio can be obtained through the user attribute recognition. Finally, the wake-up angle calculated locally by the display device and the voiceprint and user attributes recognized by the server in the cloud are determined as the recognition result of the wake-up audio.

由图10可见，显示设备在采集到语音唤醒指令后，根据语音唤醒指令进行唤醒角度的计算，然而控制摄像头朝向唤醒角度进行转动，可增加在拍摄到的图像中检测到目标人的几率。As can be seen from Figure 10, after the display device collects the voice wake-up command, it calculates the wake-up angle according to the voice wake-up command. However, controlling the camera to rotate toward the wake-up angle can increase the probability of detecting the target person in the captured image.

在一些实施例中，摄像头拍摄的图像中可能包括单人人脸和多人人脸，为对这两种情况进行具体分析，图11示出了根据一些实施例的语音交互过程中检测到单人人脸时的处理方法的流程示意图，图12示出了一种语音交互过程中检测到多人人脸时的处理方法的流程示意图。In some embodiments, the images captured by the camera may include single-person faces and multiple-person faces. In order to analyze these two situations specifically, FIG. 11 shows that a single person is detected during the voice interaction process according to some embodiments. Fig. 12 shows a schematic flowchart of a processing method when multiple human faces are detected in a voice interaction process.

参见图11，若显示设备在唤醒角度内检测到单人人脸，可采集实时录音，并对检测到的人脸进行唇动检测。其中，对实时录音的处理包括实时声源定位和云端识别，实时声源定位包括计算出实时录音的实时声源角度，云端识别包括语音识别和声纹识别，显示设备在向云端的服务器请求进行声纹识别时，可将语音唤醒指令的声纹和实时录音一并上传到服务器，使服务器在得到实时录音的声纹后，可确认该声纹是否与语音唤醒指令的声纹为同一用户的声纹，并将确认结果包含到声纹识别结果中。根据服务器返回的语音识别结果和声纹识别结果，以及显示设备本地得到的实时声源定位的结果，显示设备可得到实时录音对应的实时声源角度、声纹、属性、语音内容等信息。Referring to Figure 11, if the display device detects a single human face within the wake-up angle, it can collect real-time recordings and perform lip movement detection on the detected face. Among them, the processing of real-time recording includes real-time sound source positioning and cloud recognition. Real-time sound source positioning includes calculating the real-time sound source angle of real-time recording. Cloud recognition includes voice recognition and voiceprint recognition. During voiceprint recognition, the voiceprint of the voice wake-up command and the real-time recording can be uploaded to the server, so that after the server obtains the voiceprint of the real-time recording, it can confirm whether the voiceprint and the voiceprint of the voice wake-up command belong to the same user. voiceprint, and include the confirmation result into the voiceprint recognition result. According to the speech recognition results and voiceprint recognition results returned by the server, as well as the real-time sound source localization results obtained locally by the display device, the display device can obtain the real-time sound source angle, voiceprint, attributes, voice content and other information corresponding to the real-time recording.

显示设备可根据声纹识别结果，校对实时录音是否为目标说话人的语音。若实时录音不是目标说话人的语音，则基于实时声源定位确定的实时声源角度，重新旋转摄像头，使摄像头朝向实时声源角度，然后再在摄像头拍摄的图像中进行人脸检测。若实时录音是目标说话人的语音，则对摄像头检测到的人脸进行人脸跟踪，然后获取该人脸的唇动检测结果。若唇动检测结果为发生了唇动，则根据实时录音进行语音交互。The display device can verify whether the real-time recording is the voice of the target speaker according to the voiceprint recognition result. If the real-time recording is not the voice of the target speaker, based on the real-time sound source angle determined by the real-time sound source localization, re-rotate the camera to make the camera face the real-time sound source angle, and then perform face detection in the image captured by the camera. If the real-time recording is the voice of the target speaker, face tracking is performed on the face detected by the camera, and then the lip movement detection result of the face is obtained. If the result of the lip movement detection is that the lip movement has occurred, the voice interaction is performed according to the real-time recording.

参见图12，若显示设备在唤醒角度内检测到多人人脸，可采集实时录音，并对检测到的人脸进行唇动检测。其中，对实时录音的处理包括实时声源定位和云端识别，实时声源定位包括计算出实时录音的实时声源角度，云端识别包括语音识别和声纹识别，显示设备在向云端的服务器请求进行声纹识别时，可将语音唤醒指令的声纹和实时录音一并上传到服务器，使服务器在得到实时录音的声纹后，可确认该实时录音中是否包含了与语音唤醒指令的声纹为同一用户的声纹，即是否包含了目标人的声纹，并将确认结果包含到声纹识别结果中。根据服务器返回的语音识别结果和声纹识别结果，以及显示设备本地得到的实时声源定位的结果，显示设备可得到实时录音对应的实时声源角度、声纹、属性、语音内容等信息。Referring to Figure 12, if the display device detects multiple faces within the wake-up angle, it can collect real-time recordings and perform lip motion detection on the detected faces. Among them, the processing of real-time recording includes real-time sound source positioning and cloud recognition. Real-time sound source positioning includes calculating the real-time sound source angle of real-time recording. Cloud recognition includes voice recognition and voiceprint recognition. During voiceprint recognition, the voiceprint of the voice wake-up command and the real-time recording can be uploaded to the server, so that after the server obtains the voiceprint of the real-time recording, it can confirm whether the real-time recording contains the voiceprint of the voice wake-up command. The voiceprint of the same user, that is, whether the voiceprint of the target person is included, and the confirmation result is included in the voiceprint recognition result. According to the speech recognition results and voiceprint recognition results returned by the server, as well as the real-time sound source localization results obtained locally by the display device, the display device can obtain the real-time sound source angle, voiceprint, attributes, voice content and other information corresponding to the real-time recording.

显示设备可根据声纹识别结果，校对实时录音是否为目标说话人的语音。若实时录音不是目标说话人的语音，则基于实时声源定位确定的实时声源角度，重新旋转摄像头，使摄像头朝向实时声源角度，然后再在摄像头拍摄的图像中进行人脸检测。若实时录音包含了目标说话人的语音，则从摄像头检测到的人脸中定位目标人，然后对目标人进行人脸跟踪，然后获取目标人的人脸的唇动检测结果。若唇动检测结果为发生了唇动，则根据实时录音进行语音交互。The display device can verify whether the real-time recording is the voice of the target speaker according to the voiceprint recognition result. If the real-time recording is not the voice of the target speaker, based on the real-time sound source angle determined by the real-time sound source localization, re-rotate the camera to make the camera face the real-time sound source angle, and then perform face detection in the image captured by the camera. If the real-time recording contains the voice of the target speaker, locate the target person from the face detected by the camera, then perform face tracking on the target person, and then obtain the lip movement detection result of the target person's face. If the result of the lip movement detection is that the lip movement has occurred, the voice interaction is performed according to the real-time recording.

在上述实施例中，显示设备自带的摄像头或显示设备连接的摄像头通过进行转动，可获得较大的视场角度，从而实现了较大视场范围内的人脸检测、追踪以及唇动检测，提高了语音交互中目标人定位成功的概率，进而提高了语音交互的唤醒率和语音识别准确率。然而，在一些实施例中，显示设备自带的摄像头或显示设备连接的摄像头没有设置云台，摄像头不可进行转动，这种情况下，在摄像头拍摄的固定视场范围内的图像中，仍然可进行人脸检测、追踪以及唇动检测，此时，人脸追踪可为获取目标人脸部的实时坐标范围，该实时坐标范围可为一个包含目标人人脸的矩形区域的坐标范围，唇动检测为对该实时坐标范围内的图像进行唇动检测。通过人脸追踪，可缩小唇动检测的范围，提高语音交互的响应速度。In the above-mentioned embodiment, the camera built in the display device or the camera connected to the display device can be rotated to obtain a larger field of view angle, thereby realizing face detection, tracking and lip motion detection within a larger field of view. , which improves the probability of successful target positioning in voice interaction, thereby improving the wake-up rate of voice interaction and the accuracy of voice recognition. However, in some embodiments, the camera provided with the display device or the camera connected to the display device is not set with a pan/tilt, and the camera cannot be rotated. Perform face detection, tracking and lip movement detection. At this time, face tracking can be used to obtain the real-time coordinate range of the target person's face, and the real-time coordinate range can be a coordinate range of a rectangular area containing the target person's face. The detection is to perform lip motion detection on the image within the real-time coordinate range. Through face tracking, the range of lip movement detection can be narrowed and the response speed of voice interaction can be improved.

由上述实施例可见，本申请通过综合声源定位、人脸追踪、声纹识别、唇动检测和降噪处理等技术，能够有效提高在复杂语音交互场景下的语音交互体验。It can be seen from the above embodiments that the present application can effectively improve the voice interaction experience in complex voice interaction scenarios by integrating technologies such as sound source localization, face tracking, voiceprint recognition, lip movement detection, and noise reduction processing.

最后应说明的是：以上各实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述各实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application. scope.

为了方便解释，已经结合具体的实施方式进行了上述说明。但是，上述示例性的讨论不是意图穷尽或者将实施方式限定到上述公开的具体形式。根据上述的教导，可以得到多种修改和变形。上述实施方式的选择和描述是为了更好的解释原理以及实际的应用，从而使得本领域技术人员更好的使用所述实施方式以及适于具体使用考虑的各种不同的变形的实施方式。For the convenience of explanation, the above description has been made in conjunction with specific embodiments. However, the above exemplary discussions are not intended to be exhaustive or to limit implementations to the specific forms disclosed above. Numerous modifications and variations are possible in light of the above teachings. The above embodiments are chosen and described to better explain the principles and practical applications, so as to enable those skilled in the art to better utilize the described embodiments and various modified embodiments suitable for specific use considerations.

Claims

1. A display device, comprising:

a display for presenting a user interface;

the camera is used for collecting images;

a controller connected with the display, the controller configured to:

collecting a voice awakening instruction;

responding to the voice awakening instruction, acquiring user identity information of a target person, and acquiring a voice real-time instruction, wherein the target person comprises a person who sends the awakening instruction or a registered user;

detecting face information in an image acquired by the camera;

if the face information of the target person is detected, carrying out face tracking and lip movement detection on the target person, and if the face of the target person has lip movement and the voice real-time instruction comprises the voice of the target person, responding to the voice real-time instruction;

and if the human face of the target person does not have lip movement or the voice real-time instruction does not comprise the voice of the target person, not responding to the voice real-time instruction.

2. The display device according to claim 1, wherein detecting face information in the image captured by the camera comprises:

carrying out sound source positioning on the voice awakening instruction to obtain an awakening sound source position;

and rotating the camera towards the awakening sound source position, detecting face information in an image acquired by the camera in the rotating process, and controlling the camera to stop rotating if the face information of the target person is detected.

3. The display device according to claim 1, wherein performing face tracking and lip movement detection on the target person comprises:

acquiring a real-time coordinate range of the face of a target person in an image shot by the camera;

and carrying out lip movement detection on the image in the real-time coordinate range.

4. The display device according to claim 1, wherein performing face tracking and lip movement detection on the target person comprises:

controlling the camera to rotate according to the variation trend of the real-time coordinate range, so that the face of the target person is positioned in a preset area in the image acquired by the camera;

and carrying out lip movement detection on the image of the face of the target person.

5. The display device of claim 1, wherein the controller is further configured to:

if the face of the target person cannot be detected, carrying out sound source positioning on the voice real-time instruction to obtain a real-time sound source position;

and rotating a camera towards the real-time sound source position, and detecting the face of the target person corresponding to the user identity information in the image acquired by the camera.

6. The display device according to claim 1, wherein obtaining the user identity information corresponding to the voice wake-up instruction comprises:

and carrying out voiceprint recognition on the voice awakening instruction to obtain user identity information, wherein the user identity information comprises voiceprint information of the target person.

7. The display device of claim 1, wherein responding to the voice real-time instruction comprises:

if the image shot by the camera only comprises a single human face of the target person, performing directional voice enhancement on the voice corresponding to the awakening sound source position;

and responding to the voice real-time instruction after the directional voice enhancement.

8. The display device of claim 1, wherein responding to the voice real-time instruction comprises:

if the voice real-time instruction comprises multi-path voice, separating the single-path voice of the target person from the voice real-time instruction;

and responding according to the single-channel voice of the target person.

9. The display device of claim 1, wherein the controller is further configured to:

after the voice real-time instruction is collected, the voice real-time instruction is sent to a server, so that the server performs voiceprint recognition, voice recognition and semantic recognition on the voice real-time instruction to obtain a recognition result;

and receiving the recognition result of the server to the voice real-time instruction.

10. A method of voice interaction, comprising:

collecting a voice awakening instruction;

detecting face information in an image acquired by a camera;