CN109766759A

CN109766759A - Emotion recognition method and related products

Info

Publication number: CN109766759A
Application number: CN201811519898.8A
Authority: CN
Inventors: 陈奕丹; 谢利民; 莫磊
Original assignee: Chengdu Yuntian Lifei Technology Co Ltd
Current assignee: Chengdu Yuntian Lifei Technology Co Ltd
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2019-05-17

Abstract

Embodiments of the present application provide an emotion recognition method and related products. The method includes: acquiring video clips and audio clips for a target user in a specified time period; analyzing the video clips to obtain the target user's target Body behavior and target facial expression; perform parameter extraction on the audio segment to obtain the target voice feature parameters of the target user; determine the target user according to the target body behavior, the facial expression and the voice feature parameters target emotion. Through the embodiments of the present application, body behavior, facial expressions, and voice features can be parsed from the video. The above three dimensions all reflect the user's emotions to a certain extent. Furthermore, the user's emotions can be jointly determined through the three dimensions, which can accurately Identify user emotions.

Description

Emotion recognition method and related products

技术领域technical field

本申请涉及视频监控技术领域，具体涉及一种情绪识别方法及相关产品。The present application relates to the technical field of video surveillance, in particular to an emotion recognition method and related products.

背景技术Background technique

随着经济、社会、文化的快速发展，国内外影响力的与日俱增，越来越多外来人口流向城市，这些人口增加在加快城市化进程的同时，也为城市管理带来更大的挑战，虽然，视频监控对城市安全提供了技术支持，目前来看，摄像头已经在城市中布局开来，摄像头可对城市的安全进行有效监控，以及为相关机构的安保提供有效帮助。但生活中，主要是通过摄像头获取人的图像信息来识别人的表情，进而分析人的情绪，但这种单一的评判方式识别出的情绪精准度低。With the rapid development of economy, society and culture, the influence at home and abroad is increasing day by day, and more and more migrants are flowing to cities. While accelerating the process of urbanization, these population increases also bring greater challenges to urban management. , Video surveillance provides technical support for urban security. At present, cameras have been deployed in the city, and cameras can effectively monitor the security of the city and provide effective help for the security of relevant institutions. However, in life, people's expressions are mainly recognized by obtaining image information of people through cameras, and then people's emotions are analyzed. However, the accuracy of emotions recognized by this single judgment method is low.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种情绪识别方法及相关产品，可以对用户表情识别精准识别。The embodiments of the present application provide an emotion recognition method and related products, which can accurately recognize the user's facial expression.

本申请实施例第一方面提供了一种情绪识别方法，包括：A first aspect of the embodiments of the present application provides an emotion recognition method, including:

获取指定时间段的针对目标用户的视频片段和音频片段；Obtain the video clips and audio clips of the target user in the specified time period;

对所述视频片段进行分析，得到所述目标用户的目标肢体行为以及目标面部表情；The video clip is analyzed to obtain the target limb behavior and the target facial expression of the target user;

对所述音频片段进行参数提取，得到所述目标用户的目标语音特征参数；Perform parameter extraction on the audio segment to obtain the target speech feature parameters of the target user;

依据所述目标肢体行为、所述面部表情以及所述语音特征参数决策出所述目标用户的目标情绪。The target emotion of the target user is determined according to the target body behavior, the facial expression and the speech feature parameters.

可选地，所述方法还包括：Optionally, the method further includes:

对所述目标用户进行身份验证；authenticate the target user;

在所述目标用户身份验证通过后，获取室内地图；After the target user's identity verification is passed, obtain an indoor map;

在所述室内地图中对所述目标用户的位置进行标记，得到所述目标用户所在的目标区域；Marking the location of the target user in the indoor map to obtain the target area where the target user is located;

按照预设的情绪与控制参数之间的映射关系，确定所述目标情绪对应的用于控制所述目标区域对应的至少一个智能家居设备的目标控制参数；According to a preset mapping relationship between emotions and control parameters, determine a target control parameter corresponding to the target emotion for controlling at least one smart home device corresponding to the target area;

依据所述目标控制参数对所述至少一个智能家居设备进行调节。The at least one smart home device is adjusted according to the target control parameter.

进一步可选地，所述对所述目标用户进行身份验证，包括：Further optionally, the performing identity verification on the target user includes:

获取所述目标用户的第一人脸图像；obtaining the first face image of the target user;

对所述第一人脸图像进行图像质量评价，得到目标图像质量评价值；Perform image quality evaluation on the first face image to obtain a target image quality evaluation value;

按照预设的图像质量评价值与匹配阈值之间的映射关系，确定所述目标图像质量评价值对应的目标匹配阈值；According to the mapping relationship between the preset image quality evaluation value and the matching threshold, determine the target matching threshold corresponding to the target image quality evaluation value;

对所述第一人脸图像进行轮廓提取，得到第一外围轮廓；performing contour extraction on the first face image to obtain a first peripheral contour;

对所述第一人脸图像进行特征点提取，得到第一特征点集；performing feature point extraction on the first face image to obtain a first feature point set;

将所述第一外围轮廓与所述预设人脸模板的第二外围轮廓进行匹配，得到第一匹配值；Matching the first peripheral contour with the second peripheral contour of the preset face template to obtain a first matching value;

所述第一特征点集与所述预设人脸模板的第二特征点集进行匹配，得到第二匹配值；The first feature point set is matched with the second feature point set of the preset face template to obtain a second matching value;

依据所述第一匹配值、所述第二匹配值确定目标匹配值；determining a target matching value according to the first matching value and the second matching value;

在所述目标匹配值大于所述目标匹配阈值时，确认所述目标用户身份验证通过。When the target matching value is greater than the target matching threshold, it is confirmed that the target user's identity verification is passed.

本申请实施例第二方面提供了一种情绪识别装置，包括:A second aspect of the embodiment of the present application provides an emotion recognition device, comprising:

获取单元，用于获取指定时间段的针对所述目标用户的视频片段和音频片段；an acquisition unit for acquiring a video clip and an audio clip for the target user in a specified time period;

分析单元，用于对所述视频片段进行分析，得到所述目标用户的目标肢体行为以及目标面部表情；an analysis unit, configured to analyze the video clip to obtain the target limb behavior and the target facial expression of the target user;

提取单元，用于对所述音频片段进行参数提取，得到所述目标用户的目标语音特征参数；an extraction unit, configured to perform parameter extraction on the audio segment to obtain the target speech feature parameter of the target user;

决策单元，用于依据所述目标肢体行为、所述面部表情以及所述语音特征参数决策出所述目标用户的目标情绪。A decision unit, configured to decide the target emotion of the target user according to the target body behavior, the facial expression and the speech feature parameter.

第三方面，本申请实施例提供一种情绪识别装置，包括处理器、存储器、通信接口，以及一个或多个程序，其中，上述一个或多个程序被存储在上述存储器中，并且被配置由上述处理器执行，上述程序包括用于执行本申请实施例第一方面中的步骤的指令。In a third aspect, embodiments of the present application provide an emotion recognition device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured by The above-mentioned processor is executed, and the above-mentioned program includes instructions for executing the steps in the first aspect of the embodiments of the present application.

第四方面，本申请实施例提供了一种计算机可读存储介质，其中，上述计算机可读存储介质存储用于电子数据交换的计算机程序，其中，上述计算机程序使得计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the computer program as described in the first embodiment of the present application. Some or all of the steps described in an aspect.

第五方面，本申请实施例提供了一种计算机程序产品，其中，上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质，上述计算机程序可操作来使计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。In a fifth aspect, an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute as implemented in the present application. Examples include some or all of the steps described in the first aspect. The computer program product may be a software installation package.

实施本申请实施例，具有如下有益效果：Implementing the embodiments of the present application has the following beneficial effects:

可以看出，通过本申请实施例情绪识别方法及相关产品，获取指定时间段的针对目标用户的视频片段和音频片段，对视频片段进行分析，得到目标用户的目标肢体行为以及目标面部表情，对音频片段进行参数提取，得到目标用户的目标语音特征参数，依据目标肢体行为、面部表情以及语音特征参数决策出目标用户的目标情绪，如此，从视频中解析出肢体行为、面部表情以及语音特征，以上三个维度均在一定程度上反映了用户情绪，进而，通过该三个维度共同决策出用户的情绪，能够精准识别出用户情绪。It can be seen that, through the emotion recognition method and related products of the embodiments of the present application, video clips and audio clips for the target user in a specified time period are obtained, and the video clips are analyzed to obtain the target body behavior and target facial expressions of the target user. Extract the parameters of the audio clip to obtain the target voice feature parameters of the target user, and decide the target user's target emotion according to the target body behavior, facial expression and voice feature parameters. The above three dimensions all reflect the user's emotions to a certain extent, and then, the user's emotions can be jointly determined through the three dimensions, and the user's emotions can be accurately identified.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1A是本申请实施例提供的一种情绪识别方法的实施例流程示意图；1A is a schematic flowchart of an embodiment of an emotion recognition method provided by an embodiment of the present application;

图1B是本申请实施例提供的另一种情绪识别方法的演示示意图；FIG. 1B is a schematic diagram illustrating another emotion recognition method provided by an embodiment of the present application;

图2是本申请实施例提供的另一种情绪识别方法的实施例流程示意图；2 is a schematic flowchart of an embodiment of another emotion recognition method provided by an embodiment of the present application;

图3是本申请实施例提供的一种情绪识别装置的实施例结构示意图；3 is a schematic structural diagram of an embodiment of an emotion recognition device provided by an embodiment of the present application;

图4是本申请实施例提供的另一种情绪识别装置的实施例结构示意图。FIG. 4 is a schematic structural diagram of an embodiment of another emotion recognition apparatus provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象，而不是用于描述特定顺序。此外，术语“包括”和“具有”以及它们任何变形，意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third" and "fourth" in the description and claims of the present application and the drawings are used to distinguish different objects, rather than to describe a specific order . Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.

在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置展示该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是，本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearance of this phrase in various places in the specification is not necessarily all referring to the same embodiment, nor is it a separate or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

本申请实施例所描述情绪识别装置可以包括智能手机(如Android手机、iOS手机、Windows Phone手机等)、平板电脑、掌上电脑、笔记本电脑、视频矩阵、监控平台、移动互联网设备(MID，Mobile Internet Devices)或穿戴式设备等，上述仅是举例，而非穷举，包含但不限于上述装置，当然，上述情绪识别装置还可以为服务器。The emotion recognition device described in the embodiments of this application may include smart phones (such as Android phones, iOS phones, Windows Phone phones, etc.), tablet computers, PDAs, notebook computers, video matrixes, monitoring platforms, and mobile Internet devices (MID, Mobile Internet Devices) or wearable devices, etc., the above are only examples, not exhaustive, including but not limited to the above devices, of course, the above emotion recognition device can also be a server.

需要说明的是，本申请实施例中的情绪识别装置可与多个摄像头连接，每一摄像头均可用于抓拍视频图像，每一摄像头均可有一个与之对应的位置标记，或者，可有一个与之对应的编号。通常情况下，摄像头可设置在公共场所，例如，学校、博物馆、十字路口、步行街、写字楼、车库、机场、医院、地铁站、车站、公交站台、超市、酒店、娱乐场所等等。摄像头在拍摄到视频图像后，可将该视频图像保存到情绪识别装置所在系统的存储器。存储器中可存储有多个图像库，每一图像库可包含同一人的不同视频图像，当然，每一图像库还可以用于存储一个区域的视频图像或者某个指定摄像头拍摄的视频图像。It should be noted that the emotion recognition device in this embodiment of the present application may be connected to multiple cameras, each camera may be used to capture video images, and each camera may have a corresponding position mark, or may have a corresponding number. Usually, cameras can be installed in public places, such as schools, museums, intersections, pedestrian streets, office buildings, garages, airports, hospitals, subway stations, stations, bus stops, supermarkets, hotels, entertainment venues, and so on. After the camera captures the video image, the video image can be saved to the memory of the system where the emotion recognition device is located. There may be multiple image banks stored in the memory, and each image bank may contain different video images of the same person. Of course, each image bank may also be used to store a video image of an area or a video image captured by a specified camera.

进一步可选地，本申请实施例中，摄像头拍摄的每一帧视频图像均对应一个属性信息，属性信息为以下至少一种：视频图像的拍摄时间、视频图像的位置、视频图像的属性参数(格式、大小、分辨率等)、视频图像的编号和视频图像中的人物特征属性。上述视频图像中的人物特征属性可包括但不仅限于：视频图像中的人物个数、人物位置、人物角度、年龄、图像质量等等。Further optionally, in the embodiment of the present application, each frame of video image captured by the camera corresponds to one attribute information, and the attribute information is at least one of the following: the shooting time of the video image, the position of the video image, the attribute parameters of the video image ( format, size, resolution, etc.), the number of the video image, and the character attributes of the person in the video image. The character attributes in the above-mentioned video image may include, but are not limited to: the number of characters in the video image, the position of the character, the angle of the character, the age, the image quality, and the like.

进一步需要说明的是，每一摄像头采集的视频图像通常为动态人脸图像，因而，本申请实施例中可以对人脸图像的角度信息进行规划，上述角度信息可包括但不仅限于：水平转动角度、俯仰角或者倾斜度。例如，可定义动态人脸图像数据要求两眼间距不小于30像素，建议60像素以上。水平转动角度不超过±30°、俯仰角不超过±20°、倾斜角不超过±45°。建议水平转动角度不超过±15°、俯仰角不超过±10°、倾斜角不超过±15°。例如，还可对人脸图像是否被其他物体遮挡进行筛选，通常情况下，饰物不应遮挡脸部主要区域，饰物如深色墨镜、口罩和夸张首饰等，当然，也有可能摄像头上面布满灰尘，导致人脸图像被遮挡。本申请实施例中的视频图像的图像格式可包括但不仅限于：BMP，JPEG，JPEG2000，PNG等等，其大小可以在10-30KB之间，每一视频图像还可以对应一个拍摄时间、以及拍摄该视频图像的摄像头统一编号、与人脸图像对应的全景大图的链接等信息(人脸图像和全局图像建立特点对应性关系文件)。It should be further noted that the video image collected by each camera is usually a dynamic face image. Therefore, in the embodiment of the present application, the angle information of the face image may be planned, and the above-mentioned angle information may include but is not limited to: horizontal rotation angle. , pitch angle or inclination. For example, it can be defined that the dynamic face image data requires that the distance between the eyes be not less than 30 pixels, and more than 60 pixels is recommended. The pan angle does not exceed ±30°, the pitch angle does not exceed ±20°, and the tilt angle does not exceed ±45°. It is recommended that the pan angle does not exceed ±15°, the pitch angle does not exceed ±10°, and the tilt angle does not exceed ±15°. For example, it can also screen whether the face image is blocked by other objects. Usually, accessories should not block the main area of the face, such as dark sunglasses, masks and exaggerated jewelry. Of course, the camera may also be covered with dust. , causing the face image to be occluded. The image formats of the video images in the embodiments of the present application may include but are not limited to: BMP, JPEG, JPEG2000, PNG, etc., and the size may be between 10-30KB, and each video image may also correspond to a shooting time, and Information such as the unified camera number of the video image, the link to the large panorama image corresponding to the face image (the face image and the global image establish a feature correspondence relationship file).

请参阅图1A，为本申请实施例提供的一种情绪识别方法的实施例流程示意图。本实施例中所描述的情绪识别方法，包括以下步骤：Please refer to FIG. 1A , which is a schematic flowchart of an embodiment of an emotion recognition method provided by an embodiment of the present application. The emotion recognition method described in this embodiment includes the following steps:

101、获取指定时间段的针对目标用户的视频片段和音频片段。101. Acquire a video clip and an audio clip for a target user in a specified time period.

其中，上述指定时间段可以由用户自行设置或者系统默认。上述视频片段、音频片段可以针对同一物理场景以及同一指定时间段。目标用户可以为任一用户，可以对目标用户进行跟踪拍摄，得到目标用户的视频片段，同时，对拍摄现场进行拍摄，得到音频片段。The above specified time period may be set by the user or the system defaults. The above-mentioned video clips and audio clips may be for the same physical scene and the same specified time period. The target user can be any user, and the target user can be tracked and photographed to obtain video clips of the target user, and at the same time, the shooting site can be photographed to obtain audio clips.

102、对所述视频片段进行分析，得到所述目标用户的目标肢体行为以及目标面部表情。102. Analyze the video clip to obtain the target body behavior and the target facial expression of the target user.

其中，视频片段为目标用户的一系列肢体动作，因此，可以对这些肢体动作加以识别，得到目标肢体行为，目标肢体行为可以为至少一个行为，本申请实施例中，肢体行为可以为以下至少一种：走路、叉腰、跑步、举手、敲键盘、思考、牵手、打架等等，在此不作限定。当然，视频片段中还可以包括人脸，对人脸进行表情识别，可以得到目标面部表情，本申请实施例中，表情可以为以下至少一种：喜、怒、哀、乐、郁闷、浮躁、惊讶、尴尬、激动、紧张等等，在此不作限定。The video clips are a series of body movements of the target user. Therefore, these body movements can be identified to obtain the target body behavior. The target body behavior can be at least one behavior. In the embodiment of the present application, the body behavior can be at least one of the following Types: walking, akimbo, running, raising hands, typing on a keyboard, thinking, holding hands, fighting, etc., which are not limited here. Of course, a human face may also be included in the video clip, and by performing facial expression recognition on the human face, the target facial expression can be obtained. Surprise, embarrassment, excitement, nervousness, etc., are not limited here.

可选地，上述步骤102，对所述视频片段进行分析，得到所述目标用户的肢体行为以及面部表情，可包括如下步骤：Optionally, in the above step 102, analyzing the video clip to obtain the physical behavior and facial expression of the target user, which may include the following steps:

21、对所述视频片段进行解析，得到多帧视频图像；21. Analyze the video clip to obtain multiple frames of video images;

22、对所述多帧视频图像进行图像分割，得到多个目标图像，每一目标图像为所述目标用户的人体图像；22. Perform image segmentation on the multi-frame video images to obtain a plurality of target images, each target image being a human body image of the target user;

23、依据所述多个目标图像进行行为识别，得到所述目标肢体行为；23. Perform behavior recognition according to the multiple target images to obtain the target limb behavior;

24、对所述多个目标图像进行人脸识别，得到多个人脸图像；24. Perform face recognition on the multiple target images to obtain multiple face images;

25、对所述多个人脸图像进行表情识别，得到多个表情，并将所述多个表情中出现次数最多的表情作为所述目标面部表情。25. Perform facial expression recognition on the plurality of face images to obtain a plurality of expressions, and use the facial expression that appears the most frequently among the plurality of facial expressions as the target facial expression.

其中，具体实现中，可以对视频片段进行解析，得到多帧视频图像，当然，并非每帧视频图像中均包括人体图像，因此，可以对多帧视频图像进行图像分割，得到多个目标图像，每一目标图像中均包括用户的整个人体图像，进而，可以对多个目标图像进行行为识别，得到目标肢体行为，具体地，可以将多个目标图像输入到预设神经网络模型，得到至少肢体行为，预设神经网络模型可以由用户自行设置或者系统默认，例如，卷积神经网络模型，还可以对多个目标图像进行人脸识别，得到多个人脸图像，对该多个人脸图像进行表情识别，由于不同的人脸图像可以对应不同的表情，因此，可以得到多个表情，进而，可以将多个表情中出现次数最多的表情作为目标面部表情，如此，可以通过一系列视频图像解析出肢体行为，还可以精准识别出用户的面部表情。Among them, in the specific implementation, video clips can be analyzed to obtain multiple frames of video images. Of course, not every frame of video images includes human images. Therefore, image segmentation of multiple frames of video images can be performed to obtain multiple target images. Each target image includes the entire human body image of the user, and further, behavior recognition can be performed on multiple target images to obtain the behavior of the target limbs. Specifically, multiple target images can be input into a preset neural network model to obtain at least limbs. Behavior, the preset neural network model can be set by the user or the system defaults, for example, the convolutional neural network model can also perform face recognition on multiple target images, obtain multiple face images, and express expressions on the multiple face images. Recognition, since different face images can correspond to different expressions, multiple expressions can be obtained, and then the most frequent expression among the multiple expressions can be used as the target facial expression. In this way, a series of video images can be parsed out. Body behavior can also accurately identify the user's facial expressions.

103、对所述音频片段进行参数提取，得到所述目标用户的目标语音特征参数。103. Perform parameter extraction on the audio segment to obtain target speech feature parameters of the target user.

其中，本申请实施例中，语音特征参数可以包括以下至少一种：语速、语调、关键字、音色等等，在此不作限定，音色则与声音的频率相关。由于语音中包括一些信息可以从侧面反映用户情绪，例如，用户说“美滋滋”，则可以表明其情绪为喜，或者乐。又例如，用户说“尴尬”，则可以表明其情绪为尴尬。Wherein, in the embodiment of the present application, the speech feature parameters may include at least one of the following: speech rate, intonation, keywords, timbre, etc., which are not limited here, and timbre is related to the frequency of the sound. Since some information is included in the speech, the user's mood can be reflected from the side. For example, if the user says "beautiful", it can indicate that his mood is joy or joy. For another example, if the user says "embarrassed", it can indicate that the user's emotion is embarrassing.

可选地，所述目标语音特征参数包括以下至少一种：目标语速、目标语调、至少一个目标关键字；上述步骤103，对所述音频片段进行参数提取，得到所述目标用户的目标语音特征参数，可包括如下步骤：Optionally, the target speech feature parameters include at least one of the following: target speech rate, target intonation, and at least one target keyword; in the above step 103, parameter extraction is performed on the audio segment to obtain the target user's target speech The characteristic parameters can include the following steps:

31、对所述音频片段进行语义分析，得到多个字符；31. Perform semantic analysis on the audio segment to obtain a plurality of characters;

32、确定所述多个字符对应的发音时长，以及依据所述多个字符、所述发音时长确定所述语速；32. Determine the pronunciation duration corresponding to the multiple characters, and determine the speech rate according to the multiple characters and the pronunciation duration;

或者，or,

33、将所述多个字符进行字符分割，得到多个关键字；33. Perform character segmentation on the multiple characters to obtain multiple keywords;

34、将所述多个关键字与预设关键字集进行匹配，得到所述至少一个目标关键字；34. Matching the multiple keywords with a preset keyword set to obtain the at least one target keyword;

或者，or,

35、提取所述语音片段的波形图；35. Extract the waveform diagram of the voice segment;

36、对所述波形图进行解析，得到所述目标语调。36. Analyze the waveform diagram to obtain the target intonation.

其中，具体实现中，可以对音频片段进行语义分析，得到多个字符，该多个字符中每一字符可以对应一个发音时长，进而，可以确定该多个字符对应的发音时长，语速＝多个字符对应的发音时长/多个字符的总数量，进一步地，可以对多个字符进行字符分割，得到多个关键字，每一个关键字可以理解为一个词，本申请实施例中，还可以预先存储预设关键字集，该预设关键字集中可以包括至少一个关键字，进而，将该关键字与预设关键字集进行匹配，得到匹配成功的至少一个目标关键字，对于语调而言，可以提取语音片段的波形图，对该波形图进行解析，得到目标语调，例如，将平均振幅作为目标语调，如此，可以通过对语音片段进行解析，得到多个维度的特征参数。Among them, in the specific implementation, the audio clip can be semantically analyzed to obtain a plurality of characters, and each character in the plurality of characters can correspond to a pronunciation duration, and further, the pronunciation duration corresponding to the plurality of characters can be determined, and the speech rate=multiple Pronunciation duration corresponding to each character/total number of multiple characters. Further, multiple characters can be divided into multiple characters to obtain multiple keywords, and each keyword can be understood as a word. In the embodiment of the present application, it is also possible to Pre-store a preset keyword set, the preset keyword set may include at least one keyword, and then match the keyword with the preset keyword set to obtain at least one target keyword that is successfully matched. , the waveform diagram of the speech segment can be extracted, and the waveform diagram can be analyzed to obtain the target intonation. For example, the average amplitude can be used as the target intonation. In this way, the feature parameters of multiple dimensions can be obtained by analyzing the speech segment.

104、依据所述目标肢体行为、所述面部表情以及所述语音特征参数决策出所述目标用户的目标情绪。104. Determine the target emotion of the target user according to the target body behavior, the facial expression, and the speech feature parameter.

其中，由于肢体行为、面部情绪、语音特征参数，均从一定程度上反映了用户的情绪，因此，可以通过该三个维度进行决策，得到最终的目标情绪。Among them, since body behavior, facial emotion, and speech feature parameters all reflect the user's emotion to a certain extent, the final target emotion can be obtained by making decisions through these three dimensions.

可选地，上述步骤104，依据所述目标肢体行为、所述目标面部表情以及所述目标语音特征参数决策出所述目标用户的目标情绪，可包括如下步骤：Optionally, in the above step 104, the target emotion of the target user is determined according to the target body behavior, the target facial expression and the target voice feature parameter, which may include the following steps:

41、确定所述目标肢体行为对应的第一情绪集，所述第一情绪集包括至少一个情绪，每一情绪对应一个第一情绪概率值；41. Determine a first emotion set corresponding to the target limb behavior, the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value;

42、确定所述目标面部表情对应的第二情绪集，所述第二情绪集包括至少一个情绪，每一情绪对应一个第二情绪概率值；42. Determine the second emotion set corresponding to the target facial expression, the second emotion set includes at least one emotion, and each emotion corresponds to a second emotion probability value;

43、确定所述目标语音特征参数对应的第三情绪集，所述第三情绪集包括至少一个情绪，每一情绪对应一个第三情绪概率值；43. Determine a third emotion set corresponding to the target speech feature parameter, the third emotion set includes at least one emotion, and each emotion corresponds to a third emotion probability value;

44、获取肢体行为对应的第一权值，面部表情对应的第二权值，以及语音特征参数对应的第三权值；44. Obtain the first weight corresponding to the body behavior, the second weight corresponding to the facial expression, and the third weight corresponding to the speech feature parameter;

45、依据所述第一权值、所述第二权值、所述第三权值、所述第一情绪集、所述第二情绪集、所述第三情绪集确定每一类情绪的得分值，得到多个得分值；45. Determine each type of emotion according to the first weight, the second weight, the third weight, the first emotion set, the second emotion set, and the third emotion set. Score value, get multiple score values;

46、从所述多个得分值中选取最大值，将该最大值对应的情绪作为所述目标情绪。46. Select the maximum value from the plurality of score values, and use the emotion corresponding to the maximum value as the target emotion.

其中，可以参阅图1B，图1B示出了，每一种肢体行为可以对应至少一个情绪，每一种面部表情可以对应至少一个情绪，每一语音特征参数也可以对应至少一个情绪，每一情绪对应一个情绪概率值，由上述所有的情绪以及情绪概率值可以决策出最终的目标情绪。1B shows that each physical behavior can correspond to at least one emotion, each facial expression can correspond to at least one emotion, each speech feature parameter can also correspond to at least one emotion, and each emotion can also correspond to at least one emotion. Corresponding to an emotion probability value, the final target emotion can be determined from all the above emotions and the emotion probability value.

具体地，不同的肢体行为可以对应不同的情绪，因此，情绪识别装置中可以预先存储肢体行为与情绪之间的映射关系，进而，可以依据该映射关系确定目标肢体行为对应的至少一个情绪，由该至少一个情绪可以构成第一情绪集，当然，每一情绪可以对应一个第一情绪概率值，该第一情绪概率值可以预先设置或者系统默认。当然，不同的面部表情也可以对应不同的情绪，因此，情绪识别装置中可以预先存储面部表情与情绪之间的映射关系，通过该映射关系可以确定目标面部表情对应的至少一个情绪，由该至少一个情绪可以构成第二情绪集，第二情绪集中的每一情绪可以对应一个第二情绪概率值，该第二情绪概率值可以预先设置或者系统默认。另外，每一语音特征参数也可以对应不同的情绪，因此，情绪识别装置中还可以预先存储语音特征参数与情绪之间的映射关系，进而，依据该映射关系确定目标语音特征参数对应的至少一个情绪，由该至少一个情绪可以构成第三情绪集，该第三情绪集中的每一情绪可以对应一个第三情绪概率值，该第三情绪概率值可以预先设置或者系统默认。进一步地，还可以获取肢体行为对应的第一权值，面部表情对应的第二权值，以及语音特征参数对应的第三权值，当然，该第一权值、第二权值、第三权值可以预先设置，第一权值+第二权值+第三权值＝1。每一情绪的得分值＝情绪值*情绪概率值，于是，可以得到每一个情绪的得分值，再确定每一类情绪对应的得分值，得到每一类对应的得分值，从该每一类得分值中选取最大值，将该最大值对应的情绪作为目标情绪，如此，可以通过肢体行为、面部表情、语音三个维度决策出最终的情绪，单一维度情绪识别误差较大，多个维度则可以弱化单一维度造成的识别误差，提升了情绪识别精度。Specifically, different limb behaviors may correspond to different emotions. Therefore, the emotion recognition device may pre-store a mapping relationship between limb behaviors and emotions, and then, according to the mapping relationship, at least one emotion corresponding to the target limb behavior may be determined, and the The at least one emotion may constitute a first emotion set. Of course, each emotion may correspond to a first emotion probability value, and the first emotion probability value may be preset or system default. Of course, different facial expressions can also correspond to different emotions. Therefore, a mapping relationship between facial expressions and emotions can be pre-stored in the emotion recognition device, and at least one emotion corresponding to the target facial expression can be determined through the mapping relationship. One emotion may constitute a second emotion set, each emotion in the second emotion set may correspond to a second emotion probability value, and the second emotion probability value may be preset or system default. In addition, each speech feature parameter may also correspond to a different emotion. Therefore, the emotion recognition device may also pre-store a mapping relationship between the speech feature parameter and emotion, and then determine at least one corresponding to the target speech feature parameter according to the mapping relationship. Emotion, the at least one emotion may constitute a third emotion set, each emotion in the third emotion set may correspond to a third emotion probability value, and the third emotion probability value may be preset or system default. Further, the first weight corresponding to the body behavior, the second weight corresponding to the facial expression, and the third weight corresponding to the speech feature parameter can also be obtained. Of course, the first weight, the second weight, the third weight The weight can be preset, the first weight + the second weight + the third weight=1. The score value of each emotion = emotion value * emotion probability value, thus, the score value of each emotion can be obtained, and then the score value corresponding to each type of emotion can be determined, and the corresponding score value of each type can be obtained, from The maximum value is selected from each type of score value, and the emotion corresponding to the maximum value is used as the target emotion. In this way, the final emotion can be determined through the three dimensions of body behavior, facial expression, and voice, and the error of single-dimensional emotion recognition is large. , and multiple dimensions can weaken the recognition error caused by a single dimension and improve the accuracy of emotion recognition.

其中，上述第一概率值、第二概率值、第三概率值均可以由大数据得到，具体地，采集大量的用户的肢体行为、面部表情以及语音，分析其肢体行为对应的情绪的概率值。The above-mentioned first probability value, second probability value, and third probability value can all be obtained from big data. Specifically, a large number of users' physical behaviors, facial expressions, and voices are collected, and the probability values of emotions corresponding to their physical behaviors are analyzed. .

进一步可选地，所述目标肢体行为包括至少一个肢体行为；上述步骤23，依据所述多个目标图像进行行为识别，得到所述目标肢体行为，可以按照如下方式实施：Further optionally, the target limb behavior includes at least one limb behavior; the above step 23, performing behavior recognition according to the multiple target images, to obtain the target limb behavior, can be implemented as follows:

将所述多个目标图像输入到预设神经网络模型中，得到所述至少一个肢体行为，以及每一肢体行为对应的识别概率；Inputting the multiple target images into a preset neural network model to obtain the at least one limb behavior and the recognition probability corresponding to each limb behavior;

则，上述步骤41，确定所述目标肢体行为对应的第一情绪集，所述第一情绪集包括至少一个情绪，每一情绪对应一个第一情绪概率值，可包括如下步骤：Then, in the above step 41, determine the first emotion set corresponding to the target limb behavior, the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value, which may include the following steps:

411、按照预设的肢体行为与情绪之间的映射关系，确定所述至少一个肢体行为中每一肢体行为对应的情绪，得到至少一个情绪，每一情绪对应一个预设概率值；411. Determine the emotion corresponding to each limb behavior in the at least one limb behavior according to the preset mapping relationship between the limb behavior and the emotion, and obtain at least one emotion, and each emotion corresponds to a preset probability value;

412、依据所述每一肢体行为对应的识别概率、所述至少一个情绪中每一情绪对应一个预设概率值进行运算，得到至少一个第一情绪概率值，每一情绪对应一个第一情绪概率值。412. Perform an operation according to the identification probability corresponding to each limb behavior, and each emotion in the at least one emotion corresponds to a preset probability value to obtain at least one first emotion probability value, and each emotion corresponds to a first emotion probability. value.

其中，上述目标肢体行为可以包括至少一个肢体行为，由于行为识别，有可能将一个肢体动作识别为多个肢体行为，每一肢体行为则可以对应一个识别概率，具体实现中，可以将多个目标图像输入到预设神经网络模型中，由该预设神经网络模型对每一张目标图像进行行为识别，可以得到至少一个肢体行为，每一肢体行为可以对应一个识别概率。情绪识别装置中可以预先存储预设的肢体行为与情绪之间的映射关系，进而，依据该映射关系确定至少一个肢体行为对应的情绪，得到最少一个情绪，每一个情绪可以对应一个预设概率值，该预设概率值可以预先设置，进而，依据每一肢体行为对应的识别概率、至少一个情绪中每一情绪对应的预设概率值进行运算，得到至少一个第一情绪概率值，即第一情绪概率值＝识别概率*预设概率值，如此，可以通过肢体行为的识别精度以及情绪的概率值，精准确定最终的情绪概率值，有利于精准分析用户的情绪。Among them, the above-mentioned target limb behavior may include at least one limb behavior. Due to behavior recognition, it is possible to identify one limb action as multiple limb behaviors, and each limb behavior may correspond to a recognition probability. The image is input into the preset neural network model, and the preset neural network model performs behavior recognition on each target image, and at least one limb behavior can be obtained, and each limb behavior can correspond to a recognition probability. The emotion recognition device can pre-store a preset mapping relationship between body behaviors and emotions, and then, according to the mapping relationship, determine the emotion corresponding to at least one body behavior, and obtain at least one emotion, and each emotion can correspond to a preset probability value. , the preset probability value can be preset, and then, according to the recognition probability corresponding to each limb behavior and the preset probability value corresponding to each emotion in the at least one emotion, at least one first emotion probability value is obtained, that is, the first emotion probability value is obtained. Emotion probability value=recognition probability*preset probability value, in this way, the final emotion probability value can be accurately determined through the recognition accuracy of the body behavior and the emotion probability value, which is conducive to the accurate analysis of the user's emotion.

当然，上述步骤42-43，均可以人脸图像或者语音输入到一个神经网络模型，通过该神经网络模型得到对应的情绪，具体思想可以参照上述步骤411-步骤412，不再赘述。Of course, in the above steps 42-43, a face image or voice can be input into a neural network model, and corresponding emotions can be obtained through the neural network model.

举例说明下，情绪识别装置可以采用摄像机捕捉测试对象的信息包括基于抓拍机的抓拍图像以及一定时间内包含测试对象的视频，其中，抓拍图像用于识别测试对象的面部表情信息，视频内容可用于识别测试对象的肢体行为，还可以测试对象的面部表情，具体地，可以通过抓拍机捕捉到的人脸信息进行分析获取，可对测试对象在视频片段内的肢体行为进行分析，具体地，利用深度学习预训练模型同时以面部表情信息和肢体行为作为输入，输出情绪多分类概率，以此作为测试对象的情绪预判定。当然，情绪的类别可以根据具体情况而定，例如，包括高兴、激动、紧张、愤怒等等，在此不作限定。另外，情绪识别装置可以采用录音设备(如扬声器)获取测试对象在同一时间内的音频片段，音频内容用于识别测试对象的语音、语速以及语调，最后，可以利用测试对象的语音、语速和语调，输入基于深度学习的预训练模型中，输出情绪多分类概率，以此基于测试对象的音频的情绪结果预判，最后，综合上述三种情绪多分类概率预判结果，基于深度学习模型，输出最终的综合情绪判定结果。For example, the emotion recognition device can use a camera to capture the information of the test object, including a snapshot image based on a capture camera and a video containing the test object within a certain period of time, wherein the snapshot image is used to identify the facial expression information of the test object, and the video content can be used for Identify the physical behavior of the test object, and also test the facial expression of the object. Specifically, it can be obtained by analyzing the face information captured by the camera, and the physical behavior of the test object in the video clip can be analyzed. Specifically, using The deep learning pre-training model takes facial expression information and body behavior as input at the same time, and outputs the multi-category probability of emotion, which is used as the emotion prediction of the test object. Of course, the categories of emotions can be determined according to specific situations, for example, including happiness, excitement, tension, anger, etc., which are not limited here. In addition, the emotion recognition device can use a recording device (such as a speaker) to obtain the audio clips of the test object at the same time, and the audio content is used to identify the test object's voice, speech rate and intonation. Finally, the test object's voice, speech rate can be used. And intonation, input into the pre-training model based on deep learning, and output the multi-classification probability of emotion, so as to predict the emotional result based on the audio of the test object. , and output the final comprehensive emotion judgment result.

进一步可选地，上述步骤104之后，还可以包括如下步骤：Further optionally, after the above step 104, the following steps may also be included:

A1、对所述目标用户进行身份验证；A1. Authenticate the target user;

A2、在所述目标用户身份验证通过后，获取室内地图；A2. After the target user's identity verification is passed, obtain an indoor map;

A3、在所述室内地图中对所述目标用户的位置进行标记，得到所述目标用户所在的目标区域；A3. Mark the location of the target user in the indoor map to obtain the target area where the target user is located;

A4、按照预设的情绪与控制参数之间的映射关系，确定所述目标情绪对应的用于控制所述目标区域对应的至少一个智能家居设备的目标控制参数；A4. Determine the target control parameter corresponding to the target emotion for controlling at least one smart home device corresponding to the target area according to the preset mapping relationship between the emotion and the control parameter;

A5、依据所述目标控制参数对所述至少一个智能家居设备进行调节。A5. Adjust the at least one smart home device according to the target control parameter.

其中，情绪识别装置中可以预先存储预设的情绪与控制参数之间的映射关系，上述智能家居设备可以为以下至少一种：智能空调、智能加湿器、智能音箱、智能灯、智能窗帘、智能按摩椅、智能电视等等，在此不作限定。具体实现中，可以对目标用户进行身份验证，在其身份验证通过后，则可以获取当前场景的室内地图，并在室内地图中对目标用户的位置进行标记，得到目标用户所在的目标区域，依据上述预先存储预设的情绪与控制参数之间的映射关系，可以确定目标情绪对应的用于控制目标区域对应的至少一个智能家居设备的目标控制参数，上述控制参数可以为以下至少一种：温度调节参数、湿度调节参数、音箱播放参数(例如，音量、曲目、音效等等)、亮度或者色温调节参数、窗帘控制参数(闭合程度参数)、按摩椅控制参数(按摩方式、按摩时长)、电视控制参数(音量、曲面、亮度、色温等等)，进而，依据该目标控制参数对至少一个智能家居设备进行调节，如此，可以依据不同的情绪，调节不同的环境，有助于缓和用户情绪，提升用户体验。The emotion recognition device may pre-store the mapping relationship between preset emotions and control parameters, and the above-mentioned smart home equipment may be at least one of the following: smart air conditioners, smart humidifiers, smart speakers, smart lights, smart curtains, smart Massage chairs, smart TVs, etc., are not limited here. In the specific implementation, the target user can be authenticated. After the identity verification is passed, the indoor map of the current scene can be obtained, and the location of the target user can be marked in the indoor map to obtain the target area where the target user is located. The above-mentioned pre-stored mapping relationship between preset emotions and control parameters can determine the target control parameter corresponding to the target emotion for controlling at least one smart home device corresponding to the target area, and the above-mentioned control parameter can be at least one of the following: temperature Adjustment parameters, humidity adjustment parameters, speaker playback parameters (for example, volume, track, sound effects, etc.), brightness or color temperature adjustment parameters, curtain control parameters (closeness parameter), massage chair control parameters (massage method, massage duration), TV Control parameters (volume, surface, brightness, color temperature, etc.), and then adjust at least one smart home device according to the target control parameter, so that different environments can be adjusted according to different emotions, which helps to ease the user's emotions, Improve user experience.

进一步可选地，上述步骤A1，对所述目标用户进行身份验证，可以包括如下步骤：Further optionally, the above-mentioned step A1, performing identity verification on the target user, may include the following steps:

A11、获取所述目标用户的第一人脸图像；A11. Obtain the first face image of the target user;

A12、对所述第一人脸图像进行图像质量评价，得到目标图像质量评价值；A12. Perform image quality evaluation on the first face image to obtain a target image quality evaluation value;

A13、按照预设的图像质量评价值与匹配阈值之间的映射关系，确定所述目标图像质量评价值对应的目标匹配阈值；A13. According to the mapping relationship between the preset image quality evaluation value and the matching threshold, determine the target matching threshold corresponding to the target image quality evaluation value;

A14、对所述第一人脸图像进行轮廓提取，得到第一外围轮廓；A14, performing contour extraction on the first face image to obtain a first peripheral contour;

A15、对所述第一人脸图像进行特征点提取，得到第一特征点集；A15. Perform feature point extraction on the first face image to obtain a first feature point set;

A16、将所述第一外围轮廓与所述预设人脸模板的第二外围轮廓进行匹配，得到第一匹配值；A16, matching the first peripheral contour with the second peripheral contour of the preset face template to obtain a first matching value;

A17、将所述第一特征点集与所述预设人脸模板的第二特征点集进行匹配，得到第二匹配值；A17, matching the first feature point set with the second feature point set of the preset face template to obtain a second matching value;

A18、依据所述第一匹配值、所述第二匹配值确定目标匹配值；A18. Determine a target matching value according to the first matching value and the second matching value;

A19、在所述目标匹配值大于所述目标匹配阈值时，确认所述目标用户身份验证通过。A19. When the target matching value is greater than the target matching threshold, confirm that the target user's identity verification is passed.

其中，情绪识别装置可预先存储预设人脸模板，在人脸识别过程中，成功与否很大程度上取决于人脸图像质量，因此，本申请实施例中，则可以考虑动态匹配阈值，即若质量好，则匹配阈值则可以提高，若质量差，则匹配阈值可以降低，由于暗视觉环境下，拍摄的图像未必图像质量好，因此，可以适当调节匹配阈值。情绪识别装置中还可以存储预设的图像质量评价值与匹配阈值之间的映射关系，进而，依据该映射关系确定目标图像质量评价值对应的目标匹配阈值，在此基础上，对第一人脸图像进行轮廓提取，得到第一外围轮廓，对第一人脸图像进行特征点提取，得到第一特征点集，将第一外围轮廓与预设人脸模板的第二外围轮廓进行匹配，得到第一匹配值，将第一特征点集与预设人脸模板的第二特征点集进行匹配，得到第二匹配值，进而，依据第一匹配值、第二匹配值确定目标匹配值，例如，中可以预先存储环境参数与权重值对之间的映射关系，得到第一匹配值对应的第一权重系数，以及第二匹配值对应的第二权重系数，目标匹配值＝第一匹配值*第一权重系数+第二匹配值*第二权重系数，最后，在目标匹配值大于目标匹配阈值时，确认第一人脸图像与所述预设人脸模板匹配成功，否则，确认人脸识别失败，如此，动态调节人脸匹配过程，有利于针对具体环境，提升人脸识别效率。The emotion recognition device may pre-store a preset face template. In the face recognition process, success or failure largely depends on the quality of the face image. Therefore, in this embodiment of the present application, a dynamic matching threshold may be considered, That is, if the quality is good, the matching threshold can be increased, and if the quality is poor, the matching threshold can be lowered. Since the captured image may not be of good quality in a scotopic environment, the matching threshold can be adjusted appropriately. The emotion recognition device can also store the mapping relationship between the preset image quality evaluation value and the matching threshold, and then determine the target matching threshold corresponding to the target image quality evaluation value according to the mapping relationship. Perform contour extraction on the face image to obtain a first peripheral contour, perform feature point extraction on the first face image to obtain a first feature point set, and match the first peripheral contour with the second peripheral contour of the preset face template to obtain For the first matching value, the first feature point set is matched with the second feature point set of the preset face template to obtain the second matching value, and then the target matching value is determined according to the first matching value and the second matching value, for example , the mapping relationship between the environmental parameter and the weight value pair can be stored in advance, and the first weight coefficient corresponding to the first matching value and the second weight coefficient corresponding to the second matching value can be obtained. Target matching value=first matching value* The first weight coefficient + the second matching value * the second weight coefficient, and finally, when the target matching value is greater than the target matching threshold, confirm that the first face image is successfully matched with the preset face template, otherwise, confirm the face recognition Failure, in this way, the dynamic adjustment of the face matching process is conducive to improving the efficiency of face recognition for specific environments.

另外，轮廓提取的算法可以为以下至少一种：霍夫变换、canny算子等等，在此不做限定，特征点提取的算法可以为以下至少一种：Harris角点、尺度不变特征提取变换(scaleinvariant feature transform，SIFT)等等，在此不做限定。In addition, the contour extraction algorithm may be at least one of the following: Hough transform, canny operator, etc., which is not limited here, and the feature point extraction algorithm may be at least one of the following: Harris corner point, scale-invariant feature extraction Transformation (scaleinvariant feature transform, SIFT), etc., is not limited here.

可选地，上述步骤A12，对所述第一人脸图像进行图像质量评价，得到目标图像质量评价值，可按照如下方式实施：Optionally, in the above step A12, image quality evaluation is performed on the first face image to obtain a target image quality evaluation value, which can be implemented as follows:

采用至少一个图像质量评价指标对第一人脸图像进行图像质量评价，得到目标图像质量评价值。Perform image quality evaluation on the first face image by using at least one image quality evaluation index to obtain a target image quality evaluation value.

其中，图像质量评价指标可包括但不仅限于：平均灰度、均方差、熵、边缘保持度、信噪比等等。可定义为得到的图像质量评价值越大，则图像质量越好。Wherein, the image quality evaluation indicators may include but are not limited to: average gray level, mean square error, entropy, edge retention, signal-to-noise ratio, and so on. It can be defined that the larger the obtained image quality evaluation value, the better the image quality.

可以看出，通过本申请实施例情绪识别方法，获取指定时间段的针对目标用户的视频片段和音频片段，对视频片段进行分析，得到目标用户的目标肢体行为以及目标面部表情，对音频片段进行参数提取，得到目标用户的目标语音特征参数，依据目标肢体行为、面部表情以及语音特征参数决策出目标用户的目标情绪，如此，从视频中解析出肢体行为、面部表情以及语音特征，以上三个维度均在一定程度上反映了用户情绪，进而，通过该三个维度共同决策出用户的情绪，能够精准识别出用户情绪。It can be seen that, through the emotion recognition method of the embodiment of the present application, video clips and audio clips of the target user in a specified time period are obtained, the video clips are analyzed, the target body behavior and target facial expressions of the target user are obtained, and the audio clips are analyzed. Parameter extraction, obtain the target voice feature parameters of the target user, and decide the target user's target emotion according to the target body behavior, facial expression and voice feature parameters. In this way, the body behavior, facial expression and voice features are parsed from the video. The above three All dimensions reflect the user's emotions to a certain extent, and then, the user's emotions can be jointly determined through the three dimensions, and the user's emotions can be accurately identified.

与上述一致地，请参阅图2，为本申请实施例提供的一种情绪识别方法的实施例流程示意图。本实施例中所描述的情绪识别方法，包括以下步骤：Consistent with the above, please refer to FIG. 2 , which is a schematic flowchart of an embodiment of an emotion recognition method provided by an embodiment of the present application. The emotion recognition method described in this embodiment includes the following steps:

201、获取指定时间段的针对目标用户的视频片段和音频片段。201. Acquire a video clip and an audio clip for a target user in a specified time period.

202、对所述视频片段进行分析，得到所述目标用户的目标肢体行为以及目标面部表情。202. Analyze the video clip to obtain the target body behavior and the target facial expression of the target user.

203、对所述音频片段进行参数提取，得到所述目标用户的目标语音特征参数。203. Perform parameter extraction on the audio segment to obtain target speech feature parameters of the target user.

204、依据所述目标肢体行为、所述面部表情以及所述语音特征参数决策出所述目标用户的目标情绪。204. Determine the target emotion of the target user according to the target body behavior, the facial expression, and the speech feature parameter.

205、对所述目标用户进行身份验证。205. Authenticate the target user.

206、在所述目标用户身份验证通过后，获取室内地图。206. After the target user's identity verification is passed, acquire an indoor map.

207、在所述室内地图中对所述目标用户的位置进行标记，得到所述目标用户所在的目标区域。207. Mark the location of the target user on the indoor map to obtain a target area where the target user is located.

208、按照预设的情绪与控制参数之间的映射关系，确定所述目标情绪对应的用于控制所述目标区域对应的至少一个智能家居设备的目标控制参数。208. Determine, according to a preset mapping relationship between emotions and control parameters, a target control parameter corresponding to the target emotion for controlling at least one smart home device corresponding to the target area.

209、依据所述目标控制参数对所述至少一个智能家居设备进行调节。209. Adjust the at least one smart home device according to the target control parameter.

其中，上述步骤201-步骤209所描述的情绪识别方法可参考图1A所描述的情绪识别方法的对应步骤。Wherein, for the emotion recognition method described in the above steps 201 to 209, reference may be made to the corresponding steps of the emotion recognition method described in FIG. 1A .

可以看出，通过本申请实施例情绪识别方法，获取指定时间段的针对目标用户的视频片段和音频片段，对视频片段进行分析，得到目标用户的目标肢体行为以及目标面部表情，对音频片段进行参数提取，得到目标用户的目标语音特征参数，依据目标肢体行为、面部表情以及语音特征参数决策出目标用户的目标情绪，对目标用户进行身份验证，在目标用户身份验证通过后，获取室内地图，在室内地图中对目标用户的位置进行标记，得到目标用户所在的目标区域，按照预设的情绪与控制参数之间的映射关系，确定目标情绪对应的用于控制目标区域对应的至少一个智能家居设备的目标控制参数，依据目标控制参数对至少一个智能家居设备进行调节，如此，从视频中解析出肢体行为、面部表情以及语音特征，以上三个维度均在一定程度上反映了用户情绪，进而，通过该三个维度共同决策出用户的情绪，能够精准识别出用户情绪，还能够依据情绪调节环境内的智能家居设备，有助于缓和用户情绪，提升用户体验。It can be seen that, through the emotion recognition method of the embodiment of the present application, video clips and audio clips of the target user in a specified time period are obtained, the video clips are analyzed, the target body behavior and target facial expressions of the target user are obtained, and the audio clips are analyzed. Parameter extraction, obtain the target voice feature parameters of the target user, decide the target user's target emotion according to the target body behavior, facial expression and voice feature parameters, authenticate the target user, and obtain the indoor map after the target user's identity verification is passed. Mark the location of the target user in the indoor map, obtain the target area where the target user is located, and determine at least one smart home corresponding to the target emotion for controlling the target area according to the preset mapping relationship between emotions and control parameters The target control parameters of the device, adjust at least one smart home device according to the target control parameters. In this way, the body behavior, facial expressions and voice features are parsed from the video. The above three dimensions reflect the user's emotions to a certain extent, and then Through these three dimensions, the user's emotions can be jointly determined, which can accurately identify the user's emotions, and can also adjust the smart home devices in the environment according to the emotions, which helps to ease the user's emotions and improve the user experience.

与上述一致地，以下为实施上述情绪识别方法的装置，具体如下：Consistent with the above, the following is a device for implementing the above emotion recognition method, and the details are as follows:

请参阅图3，为本申请实施例提供的一种情绪识别装置的实施例结构示意图。本实施例中所描述的情绪识别装置，包括：获取单元301、分析单元302、提取单元和决策单元304，具体如下：Please refer to FIG. 3 , which is a schematic structural diagram of an embodiment of an emotion recognition apparatus according to an embodiment of the present application. The emotion recognition device described in this embodiment includes: an acquisition unit 301, an analysis unit 302, an extraction unit, and a decision unit 304, and the details are as follows:

获取单元301，用于获取指定时间段的针对目标用户的视频片段和音频片段；Obtaining unit 301, for obtaining the video clip and audio clip for the target user in a specified time period;

分析单元302，用于对所述视频片段进行分析，得到所述目标用户的目标肢体行为以及目标面部表情；An analysis unit 302, configured to analyze the video clip to obtain the target limb behavior and the target facial expression of the target user;

提取单元303，用于对所述音频片段进行参数提取，得到所述目标用户的目标语音特征参数；An extraction unit 303, configured to perform parameter extraction on the audio segment to obtain the target voice feature parameter of the target user;

决策单元304，用于依据所述目标肢体行为、所述面部表情以及所述语音特征参数决策出所述目标用户的目标情绪。The decision unit 304 is configured to decide the target emotion of the target user according to the target body behavior, the facial expression and the speech feature parameter.

其中，上述获取单元301可用于实现上述步骤101所描述的方法，分析单元302可用于实现上述步骤102所描述的方法，上述提取单元303可用于实现上述步骤103所描述的方法，上述决策单元304可用于实现上述步骤104所描述的方法，以下如此类推。The acquisition unit 301 can be used to implement the method described in the above step 101, the analysis unit 302 can be used to implement the method described in the above step 102, the extraction unit 303 can be used to implement the method described in the above step 103, and the decision unit 304 can be used to implement the method described in the above step 103. It can be used to implement the method described in the above step 104, and so on.

可以看出，通过本申请实施例情绪识别装置，应用于情绪识别装置，获取指定时间段的针对目标用户的视频片段和音频片段，对视频片段进行分析，得到目标用户的目标肢体行为以及目标面部表情，对音频片段进行参数提取，得到目标用户的目标语音特征参数，依据目标肢体行为、面部表情以及语音特征参数决策出目标用户的目标情绪，如此，从视频中解析出肢体行为、面部表情以及语音特征，以上三个维度均在一定程度上反映了用户情绪，进而，通过该三个维度共同决策出用户的情绪，能够精准识别出用户情绪。It can be seen that the emotion recognition device in the embodiment of the present application is applied to the emotion recognition device to obtain video clips and audio clips of the target user in a specified time period, analyze the video clips, and obtain the target body behavior and target face of the target user. Expression, extract the parameters of the audio clip to obtain the target voice feature parameters of the target user, and decide the target user's target emotion according to the target body behavior, facial expression and voice feature parameters. In this way, the body behavior, facial expression and Voice features, the above three dimensions all reflect the user's emotions to a certain extent, and then the user's emotions can be jointly determined through the three dimensions, which can accurately identify the user's emotions.

在一个可能的示例中，在所述对所述视频片段进行分析，得到所述目标用户的肢体行为以及面部表情方面，所述分析单元302具体用于：In a possible example, in the aspect of analyzing the video clip to obtain the physical behavior and facial expression of the target user, the analyzing unit 302 is specifically configured to:

对所述视频片段进行解析，得到多帧视频图像；Analyzing the video clip to obtain multiple frames of video images;

对所述多帧视频图像进行图像分割，得到多个目标图像，每一目标图像为所述目标用户的人体图像；Image segmentation is performed on the multi-frame video images to obtain a plurality of target images, each target image being a human body image of the target user;

依据所述多个目标图像进行行为识别，得到所述目标肢体行为；performing behavior recognition according to the multiple target images to obtain the target limb behavior;

对所述多个目标图像进行人脸识别，得到多个人脸图像；performing face recognition on the multiple target images to obtain multiple face images;

对所述多个人脸图像进行表情识别，得到多个表情，并将所述多个表情中出现次数最多的表情作为所述目标面部表情。Performing expression recognition on the multiple face images to obtain multiple expressions, and using the most frequently occurring expression among the multiple expressions as the target facial expression.

在一个可能的示例中，在所述依据所述目标肢体行为、所述目标面部表情以及所述目标语音特征参数决策出所述目标用户的目标情绪方面，所述决策单元304具体用于：In a possible example, in the aspect of determining the target emotion of the target user according to the target body behavior, the target facial expression and the target voice feature parameter, the decision unit 304 is specifically configured to:

确定所述目标肢体行为对应的第一情绪集，所述第一情绪集包括至少一个情绪，每一情绪对应一个第一情绪概率值；determining a first emotion set corresponding to the target limb behavior, the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value;

确定所述目标面部表情对应的第二情绪集，所述第二情绪集包括至少一个情绪，每一情绪对应一个第二情绪概率值；determining a second emotion set corresponding to the target facial expression, the second emotion set including at least one emotion, and each emotion corresponding to a second emotion probability value;

确定所述目标语音特征参数对应的第三情绪集，所述第三情绪集包括至少一个情绪，每一情绪对应一个第三情绪概率值；determining a third emotion set corresponding to the target speech feature parameter, the third emotion set includes at least one emotion, and each emotion corresponds to a third emotion probability value;

获取肢体行为对应的第一权值，面部表情对应的第二权值，以及语音特征参数对应的第三权值；Obtain the first weight corresponding to the body behavior, the second weight corresponding to the facial expression, and the third weight corresponding to the speech feature parameter;

依据所述第一权值、所述第二权值、所述第三权值、所述第一情绪集、所述第二情绪集、所述第三情绪集确定每一类情绪的得分值，得到多个得分值；Determine the score of each type of emotion according to the first weight, the second weight, the third weight, the first emotion set, the second emotion set, and the third emotion set value, get multiple score values;

从所述多个得分值中选取最大值，将该最大值对应的情绪作为所述目标情绪。The maximum value is selected from the plurality of score values, and the emotion corresponding to the maximum value is used as the target emotion.

在一个可能的示例中，所述目标肢体行为包括至少一个肢体行为；In a possible example, the target limb behavior includes at least one limb behavior;

在所述依据所述多个目标图像进行行为识别，得到所述目标肢体行为方面，所述分析单元302具体用于：In the aspect of performing behavior recognition according to the multiple target images to obtain the target limb behavior, the analyzing unit 302 is specifically configured to:

在所述确定所述目标肢体行为对应的第一情绪集，所述第一情绪集包括至少一个情绪，每一情绪对应一个第一情绪概率值方面，所述决策单元具体用于：In the aspect of determining the first emotion set corresponding to the target limb behavior, the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value, the decision-making unit is specifically used for:

按照预设的肢体行为与情绪之间的映射关系，确定所述至少一个肢体行为中每一肢体行为对应的情绪，得到至少一个情绪，每一情绪对应一个预设概率值；According to the preset mapping relationship between the limb behavior and the emotion, the emotion corresponding to each limb behavior in the at least one limb behavior is determined, and at least one emotion is obtained, and each emotion corresponds to a preset probability value;

依据所述每一肢体行为对应的识别概率、所述至少一个情绪中每一情绪对应一个预设概率值进行运算，得到至少一个第一情绪概率值，每一情绪对应一个第一情绪概率值。According to the identification probability corresponding to each limb behavior, and each emotion in the at least one emotion corresponding to a preset probability value, at least one first emotion probability value is obtained, and each emotion corresponds to a first emotion probability value.

所述确定所述目标肢体行为对应的第一情绪集，所述第一情绪集包括至少一个情绪，每一情绪对应一个第一情绪概率值，包括：Determining the first emotion set corresponding to the target limb behavior, the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value, including:

在一个可能的示例中，所述目标语音特征参数包括以下至少一种：目标语速、目标语调、至少一个目标关键字；In a possible example, the target speech feature parameters include at least one of the following: target speech rate, target intonation, and at least one target keyword;

在所述对所述音频片段进行参数提取，得到所述目标用户的目标语音特征参数方面，所述提取单元303具体用于：In the aspect of performing parameter extraction on the audio segment to obtain the target speech feature parameters of the target user, the extraction unit 303 is specifically configured to:

对所述音频片段进行语义分析，得到多个字符；Semantic analysis is performed on the audio clip to obtain a plurality of characters;

确定所述多个字符对应的发音时长，以及依据所述多个字符、所述发音时长确定所述语速；Determine the pronunciation duration corresponding to the multiple characters, and determine the speech rate according to the multiple characters and the pronunciation duration;

或者，or,

将所述多个字符进行字符分割，得到多个关键字；performing character segmentation on the multiple characters to obtain multiple keywords;

将所述多个关键字与预设关键字集进行匹配，得到所述至少一个目标关键字；Matching the multiple keywords with a preset keyword set to obtain the at least one target keyword;

或者，or,

提取所述语音片段的波形图；extracting a waveform diagram of the speech segment;

对所述波形图进行解析，得到所述目标语调。The waveform diagram is analyzed to obtain the target intonation.

可以理解的是，本实施例的情绪识别装置的各程序模块的功能可根据上述方法实施例中的方法具体实现，其具体实现过程可以参照上述方法实施例的相关描述，此处不再赘述。It can be understood that the functions of each program module of the emotion recognition apparatus in this embodiment can be specifically implemented according to the methods in the above method embodiments, and the specific implementation process can refer to the relevant descriptions of the above method embodiments, which will not be repeated here.

与上述一致地，请参阅图4，为本申请实施例提供的一种情绪识别装置的实施例结构示意图。本实施例中所描述的情绪识别装置，包括：至少一个输入设备1000；至少一个输出设备2000；至少一个处理器3000，例如CPU；和存储器4000，上述输入设备1000、输出设备2000、处理器3000和存储器4000通过总线5000连接。Consistent with the above, please refer to FIG. 4 , which is a schematic structural diagram of an embodiment of an emotion recognition apparatus provided in an embodiment of the present application. The emotion recognition apparatus described in this embodiment includes: at least one input device 1000; at least one output device 2000; at least one processor 3000, such as a CPU; It is connected to the memory 4000 through the bus 5000.

其中，上述输入设备1000具体可为触控面板、物理按键或者鼠标。The above-mentioned input device 1000 may specifically be a touch panel, a physical button or a mouse.

上述输出设备2000具体可为显示屏。The above-mentioned output device 2000 may specifically be a display screen.

上述存储器4000可以是高速RAM存储器，也可为非易失存储器(non-volatilememory)，例如磁盘存储器。上述存储器4000用于存储一组程序代码，上述输入设备1000、输出设备2000和处理器3000用于调用存储器4000中存储的程序代码，执行如下操作：The above-mentioned memory 4000 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. The above-mentioned memory 4000 is used to store a set of program codes, and the above-mentioned input device 1000, output device 2000 and processor 3000 are used to call the program codes stored in the memory 4000, and perform the following operations:

上述处理器3000，用于：The above processor 3000 is used for:

可以看出，通过本申请实施例情绪识别装置，获取指定时间段的针对目标用户的视频片段和音频片段，对视频片段进行分析，得到目标用户的目标肢体行为以及目标面部表情，对音频片段进行参数提取，得到目标用户的目标语音特征参数，依据目标肢体行为、面部表情以及语音特征参数决策出目标用户的目标情绪，如此，从视频中解析出肢体行为、面部表情以及语音特征，以上三个维度均在一定程度上反映了用户情绪，进而，通过该三个维度共同决策出用户的情绪，能够精准识别出用户情绪。It can be seen that, through the emotion recognition device of the embodiment of the present application, the video clips and audio clips of the target user in the specified time period are obtained, the video clips are analyzed, the target body behavior and the target facial expression of the target user are obtained, and the audio clips are analyzed. Parameter extraction, obtain the target voice feature parameters of the target user, and decide the target user's target emotion according to the target body behavior, facial expression and voice feature parameters. In this way, the body behavior, facial expression and voice features are parsed from the video. The above three All dimensions reflect the user's emotions to a certain extent, and then, the user's emotions can be jointly determined through the three dimensions, and the user's emotions can be accurately identified.

在一个可能的示例中，在所述对所述视频片段进行分析，得到所述目标用户的肢体行为以及面部表情方面，上述处理器3000具体用于：In a possible example, in the aspect of analyzing the video clip to obtain the physical behavior and facial expression of the target user, the processor 3000 is specifically configured to:

在一个可能的示例中，在所述依据所述目标肢体行为、所述目标面部表情以及所述目标语音特征参数决策出所述目标用户的目标情绪方面，上述处理器3000具体用于：In a possible example, in the aspect of determining the target emotion of the target user according to the target body behavior, the target facial expression and the target voice feature parameters, the processor 3000 is specifically configured to:

在所述依据所述多个目标图像进行行为识别，得到所述目标肢体行为方面，上述处理器3000具体用于：In the aspect of performing behavior recognition according to the multiple target images to obtain the target limb behavior, the processor 3000 is specifically configured to:

在所述确定所述目标肢体行为对应的第一情绪集，所述第一情绪集包括至少一个情绪，每一情绪对应一个第一情绪概率值方面，上述处理器3000具体用于：In the aspect of determining the first emotion set corresponding to the target limb behavior, the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value, the processor 3000 is specifically configured to:

在所述对所述音频片段进行参数提取，得到所述目标用户的目标语音特征参数方面，上述处理器3000具体用于：In the aspect of performing parameter extraction on the audio segment to obtain the target voice feature parameters of the target user, the processor 3000 is specifically configured to:

或者，or,

本申请实施例还提供一种计算机存储介质，其中，该计算机存储介质可存储有程序，该程序执行时包括上述方法实施例中记载的任何一种情绪识别方法的部分或全部步骤。An embodiment of the present application further provides a computer storage medium, wherein the computer storage medium may store a program, and when the program is executed, the program includes part or all of the steps of any emotion recognition method described in the above method embodiments.

本领域技术人员应明白，本申请的实施例可提供为方法、装置(设备)、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。计算机程序存储/分布在合适的介质中，与其它硬件一起提供或作为硬件的一部分，也可以采用其他分布形式，如通过Internet或其它有线或无线电信系统。It should be understood by those skilled in the art that the embodiments of the present application may be provided as a method, an apparatus (apparatus), or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The computer program is stored/distributed in a suitable medium, provided with or as part of other hardware, or may take other forms of distribution, such as over the Internet or other wired or wireless telecommunication systems.

尽管结合具体特征及其实施例对本申请进行了描述，显而易见的，在不脱离本申请的精神和范围的情况下，可对其进行各种修改和组合。相应地，本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明，且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然，本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样，倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内，则本申请也意图包含这些改动和变型在内。Although the application has been described in conjunction with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made therein without departing from the spirit and scope of the application. Accordingly, this specification and drawings are merely exemplary illustrations of the application as defined by the appended claims, and are deemed to cover any and all modifications, variations, combinations or equivalents within the scope of this application. Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

1. A method of emotion recognition, comprising:

acquiring a video clip and an audio clip aiming at a target user in a specified time period;

analyzing the video segments to obtain target limb behaviors and target facial expressions of the target user;

extracting parameters of the audio clip to obtain target voice characteristic parameters of the target user;

and deciding the target emotion of the target user according to the target limb behaviors, the facial expression and the voice characteristic parameters.

2. The method of claim 1, wherein analyzing the video segments for the target user's limb behavior and facial expression comprises:

analyzing the video clips to obtain multi-frame video images;

carrying out image segmentation on the multi-frame video image to obtain a plurality of target images, wherein each target image is a human body image of the target user;

performing behavior recognition according to the target images to obtain the target limb behaviors;

carrying out face recognition on the target images to obtain a plurality of face images;

and performing expression recognition on the plurality of facial images to obtain a plurality of expressions, and taking the expression with the largest occurrence frequency in the plurality of expressions as the target facial expression.

3. The method of claim 1 or 2, wherein said deciding a target emotion of the target user based on the target limb behavior, the target facial expression and the target voice feature parameters comprises:

determining a first emotion set corresponding to the target limb behavior, wherein the first emotion set comprises at least one emotion, and each emotion corresponds to a first emotion probability value;

determining a second emotion set corresponding to the target facial expression, wherein the second emotion set comprises at least one emotion, and each emotion corresponds to a second emotion probability value;

determining a third emotion set corresponding to the target voice characteristic parameter, wherein the third emotion set comprises at least one emotion, and each emotion corresponds to a third emotion probability value;

acquiring a first weight corresponding to the body behavior, a second weight corresponding to the facial expression and a third weight corresponding to the voice characteristic parameter;

determining score values of each type of emotion according to the first weight, the second weight, the third weight, the first emotion set, the second emotion set and the third emotion set to obtain a plurality of score values;

and selecting a maximum value from the plurality of score values, and taking the emotion corresponding to the maximum value as the target emotion.

4. The method of claim 3, wherein the target limb behavior comprises at least one limb behavior;

the behavior recognition according to the target images to obtain the target limb behaviors comprises the following steps:

inputting the target images into a preset neural network model to obtain at least one limb behavior and the identification probability corresponding to each limb behavior;

the determining a first emotion set corresponding to the target limb behavior, the first emotion set comprising at least one emotion, each emotion corresponding to a first emotion probability value, includes:

determining the emotion corresponding to each limb behavior in the at least one limb behavior according to a preset mapping relation between the limb behavior and the emotion to obtain at least one emotion, wherein each emotion corresponds to a preset probability value;

and calculating according to the recognition probability corresponding to each body behavior and a preset probability value corresponding to each emotion in the at least one emotion to obtain at least one first emotion probability value, wherein each emotion corresponds to one first emotion probability value.

5. The method according to any of claims 1-4, wherein the target speech feature parameters comprise at least one of: target speed of speech, target intonation, at least one target keyword;

the parameter extraction of the audio clip to obtain the target voice characteristic parameter of the target user comprises:

performing semantic analysis on the audio clip to obtain a plurality of characters;

determining pronunciation duration corresponding to the characters, and determining the speed of speech according to the characters and the pronunciation duration;

or,

performing character segmentation on the characters to obtain a plurality of keywords;

matching the plurality of keywords with a preset keyword set to obtain at least one target keyword;

or,

extracting a waveform image of the voice segment;

and analyzing the oscillogram to obtain the target intonation.

6. An emotion recognition apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video segment and an audio segment aiming at a target user in a specified time period;

the analysis unit is used for analyzing the video segments to obtain target limb behaviors and target facial expressions of the target users;

the extraction unit is used for extracting parameters of the audio clips to obtain target voice characteristic parameters of the target user;

and the decision unit is used for deciding the target emotion of the target user according to the target limb behaviors, the facial expression and the voice characteristic parameters.

7. The apparatus according to claim 6, wherein in the analyzing the video segments to obtain the body behaviors and facial expressions of the target user, the analyzing unit is specifically configured to:

analyzing the video clips to obtain multi-frame video images;

8. The apparatus according to claim 6 or 7, wherein in said deciding a target emotion of the target user based on the target limb behavior, the target facial expression and the target speech feature parameters, the deciding unit is specifically configured to:

9. The apparatus of claim 8, wherein the target limb behavior comprises at least one limb behavior;

in the aspect of performing behavior recognition according to the target images to obtain the target limb behaviors, the analysis unit is specifically configured to:

in the determining of the first emotion set corresponding to the target limb behavior, where the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value, the decision unit is specifically configured to:

10. A computer-readable storage medium storing a computer program for execution by a processor to implement the method of any one of claims 1-5.