CN108334806B

CN108334806B - Image processing method, device and electronic device

Info

Publication number: CN108334806B
Application number: CN201710282661.1A
Authority: CN
Inventors: 吴昊; 张振伟; 欧义挺; 董晓龙; 戚广全; 谢俊驰; 谢斯豪; 梁雪; 段韧; 张新磊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-04-26
Filing date: 2017-04-26
Publication date: 2021-12-14
Anticipated expiration: 2037-04-26
Also published as: CN108334806A

Abstract

The present invention relates to an image processing method, device and electronic equipment. The method comprises: acquiring image frames collected from a real scene; playing the acquired image frames frame by frame according to the acquired sequence; acquiring and identifying the image frames The facial emotion feature recognition result obtained by the facial image included in the ; according to the facial emotion feature recognition result, find the corresponding emotion feature image; obtain the display position of the emotion feature image in the currently playing image frame; In the display position, the emotional feature image is rendered in the currently playing image frame. The solution provided by this application improves the efficiency of image processing.

Description

Image processing method and device and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an image processing method and apparatus, and an electronic device.

Background

With the development of computer technology, image processing technology is also continuously advancing. The user can process the image through professional image processing software, so that the processed image is better in performance. The user can also add materials provided by the image processing software to the image through the image processing software, so that the processed image can transmit more information.

However, in the current image processing method, a user needs to expand a material library of image processing software, browse the material library, select a suitable material from the material library, and adjust the position of the material in an image, so as to confirm modification and complete image processing. Therefore, the current image processing mode needs a large amount of manual operation and is long in time, so that the efficiency of the image processing process is low.

Disclosure of Invention

Based on this, it is necessary to provide an image processing method, an image processing apparatus, and an electronic device, aiming at the problem of low efficiency of the conventional image processing process.

A method of image processing, the method comprising:

acquiring an image frame acquired from a real scene;

playing the acquired image frames frame by frame according to the acquired time sequence;

acquiring a face emotion feature recognition result obtained by recognizing a face image included in the image frame;

searching a corresponding emotional characteristic image according to the face emotional characteristic recognition result;

acquiring the display position of the emotional characteristic image in the currently played image frame;

and rendering the emotional characteristic image in the currently played image frame according to the display position.

An image processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring image frames acquired from a real scene;

the playing module is used for playing the acquired image frames frame by frame according to the acquired time sequence;

the identification result acquisition module is used for acquiring a human face emotion characteristic identification result obtained by identifying a human face image included in the image frame;

the searching module is used for searching a corresponding emotional characteristic image according to the face emotional characteristic recognition result;

the display position acquisition module is used for acquiring the display position of the emotional characteristic image in the currently played image frame;

and the rendering module is used for rendering the emotional characteristic image in the currently played image frame according to the display position.

An electronic device comprising a memory and a processor, the memory having stored therein computer-readable instructions that, when executed by the processor, cause the processor to perform the steps of:

acquiring an image frame acquired from a real scene;

According to the image processing method, the image processing device and the electronic equipment, the image frames reflecting the real scene are played, so that the played image frames can reflect the real scene. The emotion condition of the person in the real scene can be automatically determined by acquiring the face emotion feature recognition result obtained by recognizing the face image included in the image frame. After the display position of the emotional characteristic image in the currently played image frame is obtained, the emotional characteristic image is rendered in the currently played image frame according to the display position, so that the virtual emotional characteristic image can be automatically combined with people in the real scene to reflect the emotional condition of people in the real scene. The complicated steps of manual operation are avoided, and the image processing efficiency is greatly improved.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of an image processing method;

FIG. 2 is a diagram illustrating an internal structure of an electronic device for implementing an image processing method according to an embodiment;

FIG. 3 is a flow diagram illustrating a method for image processing according to one embodiment;

FIG. 4 is a flowchart illustrating an image processing method according to another embodiment;

FIG. 5 is a schematic diagram illustrating a comparison of front and back interfaces for rendering an affective feature image in one embodiment;

FIG. 6 is a diagram illustrating a comparison of a contextual interface based on speech data recognition in accordance with one embodiment;

FIG. 7 is a block diagram showing the configuration of an image processing apparatus according to an embodiment;

fig. 8 is a block diagram showing the configuration of an image processing apparatus according to another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

FIG. 1 is a diagram of an embodiment of an application environment of an image processing method. Referring to fig. 1, the image processing method is applied to an image processing system. The image processing system includes a terminal 110 and a server 120, and the terminal 110 is connected to the server 120 through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be a separate physical server or a cluster of physical servers. The terminal 110 may be configured to acquire image frames acquired from a real scene, and play the acquired image frames frame by frame according to an acquisition timing sequence. The terminal 110 may obtain a face emotion feature recognition result obtained by recognizing a face image included in an image frame when the image frame is played, search for a corresponding emotion feature image according to the face emotion feature recognition result, obtain a display position of the emotion feature image in the currently played image frame, and render the emotion feature image in the currently played image frame according to the display position. The above process including the recognition of the face image included in the image frame may be performed on the terminal 110 or the server 120.

Fig. 2 is a schematic diagram of an internal structure of an electronic device in one embodiment. The electronic device may be the terminal 110 of fig. 1 described above. As shown in fig. 2, the electronic device includes a processor, a nonvolatile storage medium, an internal memory, and a network interface, a sound collection device, a speaker, a display screen, and an input device, which are connected by a system bus. The non-volatile storage medium of the electronic equipment stores an operating system and further comprises an image processing device, and the image processing device is used for realizing the image processing method. The processor is used for providing calculation and control capability and supporting the operation of the whole terminal. An internal memory in the electronic device provides an environment for operation of the image processing apparatus in the non-volatile storage medium, wherein computer readable instructions are stored in the internal memory, and when executed by the processor, the computer readable instructions cause the processor to execute an image processing method. The network interface is used for performing network communication with the server 120, such as sending the acquired image frames to the server 120, receiving the face emotion feature recognition results returned by the server 120, and the like. The display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen, and the input device may be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the terminal housing, or an external keyboard, a touch pad or a mouse. The electronic equipment can be a desktop terminal, a mobile terminal or intelligent wearable equipment, and the mobile terminal can be at least one of a mobile phone, a tablet computer or a notebook computer. Those skilled in the art will appreciate that the architecture shown in fig. 2 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, as shown in FIG. 3, an image processing method is provided. The embodiment is mainly illustrated by applying the method to the terminal 110 in fig. 1. Referring to fig. 3, the image processing method specifically includes the steps of:

s302, image frames collected from a real scene are obtained.

Among them, a real scene is a scene existing in the natural world. An image frame is a unit in a sequence of image frames that can form a dynamic picture, used to record a picture in a real scene at a time.

In an embodiment, the terminal may specifically acquire an image frame from a real scene according to a fixed or dynamic frame rate, and acquire the acquired image frame. The fixed or dynamic frame rate can enable the image frames to form continuous dynamic pictures when being played according to the fixed or dynamic frame rate.

In one embodiment, the terminal may acquire, through the camera, an image frame of a real scene in a current field of view of the camera, and acquire the acquired image frame. Wherein, the visual field of the camera can be changed due to the change of the posture and the position of the terminal.

In one embodiment, the terminal may provide an AR (Augmented Reality) shooting mode through a social application, and after the AR shooting mode is selected, acquire an image frame from a real scene, and acquire the acquired image frame. The social application is an application capable of performing network social interaction based on a social network. The Social application includes an instant messaging application, an SNS (Social Network Service) application, a live application, or a photographing application.

In one embodiment, the terminal can receive an image frame collected from a real scene sent by another terminal, and acquire the received image frame. For example, when a terminal establishes a video session through a social application running on the terminal, image frames collected from a real scene and sent by terminals corresponding to other parties of the session are received.

In one embodiment, the terminal can collect image frames from a real scene through a shooting mode provided by a live broadcast application, and the collected image frames are used as live broadcast data so as to carry out live broadcast through the live broadcast application. The terminal can also receive image frames which are sent by another terminal and collected from a real scene through a shooting mode provided by a live broadcast application, and the received image frames are used as live broadcast data so as to play live broadcasts initiated by other users through the live broadcast application.

S304, the collected image frames are played frame by frame according to the collected time sequence.

The time sequence of the acquisition refers to the time sequence when the image frames are acquired, and can be represented by the size relationship of the timestamps recorded when the image frames are acquired. The frame-by-frame playing refers to image frame-by-image frame playing.

Specifically, the terminal may play the acquired image frames one by one according to the frame rate of the acquired image frames and according to the ascending order of the timestamps. The terminal can directly play the collected image frames, and can also store the collected image frames into the buffer area according to the collected time sequence and take out the image frames from the buffer area according to the collected time sequence for playing.

In one embodiment, the terminal may play the received image frames acquired from the real scene sent by the other terminal one by one according to the frame rate of the image frames acquired by the other terminal and according to the ascending order of the timestamps. The terminal can directly play the received image frames, and can also store the received image frames into the buffer area according to the acquired time sequence, and take out the image frames from the buffer area according to the acquired time sequence for playing.

And S306, obtaining a face emotion feature recognition result obtained by recognizing the face image in the image frame.

Wherein the emotional characteristics are characteristics reflecting human or animal emotions. Affective features are features that can be recognized and processed by a computer. Emotional features such as happiness, melancholy or anger, etc. The human face emotional characteristics refer to emotional characteristics reflected by human face expressions.

In one embodiment, the terminal may detect whether a human face image is included in the acquired image frames when the image frames are acquired from a real scene. And if the terminal judges that the acquired image frame comprises the face image, performing expression recognition on the face image in the image frame to acquire a face emotion feature recognition result obtained by recognition.

In one embodiment, the terminal may extract image data included in an image frame acquired after the image frame of the real scene is acquired through the camera under the current visual field of the camera, and detect whether the image data includes face feature data. And if the terminal detects that the image data contains the face feature data, judging that the image frame contains a face image. The terminal can further extract expression characteristic data from the face characteristic data, and according to the extracted expression characteristic data, the face image included in the collected image frame is subjected to expression recognition locally to obtain a face emotion characteristic recognition result. The expression feature data may be one or more kinds of feature information reflecting the outline of the face, the glasses, the nose, the mouth, the distance between the face organs, and the like.

For example, when people feel happy, the mouth corner will be raised, and if the terminal includes expression feature data extracted from the face feature data in the image frame as the mouth corner is raised, it can indicate that the emotional feature reflected by the face in the image frame is happy. When people feel surprise, the mouth opening amplitude is large, and if the expression feature data extracted by the terminal in the image frame and including the face feature data is large in mouth opening amplitude, the emotion feature reflected by the face in the image frame can be shown as surprise.

In an embodiment, the terminal may also send the detected image frame including the face image to the server, after receiving the image frame sent by the terminal, the server performs expression recognition on the face image included in the image frame to obtain a face emotion feature recognition result, and then feeds back the face emotion feature recognition result obtained by recognition to the terminal, and the terminal obtains the face emotion feature recognition result returned by the server.

In one embodiment, the terminal may also detect whether the received image frame includes a human face image after receiving an image frame acquired from a real scene and sent by another terminal. If the terminal judges that the received image frame comprises the face image, the face image in the image frame can be locally subjected to expression recognition to obtain a corresponding face emotion feature recognition result; the image frame can also be sent to a server, so that the server returns a face emotion feature recognition result after recognizing the face image included in the image frame.

And S308, searching a corresponding emotional characteristic image according to the face emotional characteristic recognition result.

The emotional feature image is an image that reflects emotional features. An emotional characteristic image reflecting the injury such as an image including tears or an image including a rainy scene, etc. An emotional characteristic image reflecting anger such as an image including flames, etc. The emotional characteristic image can be an image crawled from the internet by the terminal, and can also be an image shot by the terminal according to the camera equipment included by the terminal. The emotional characteristic image can be a dynamic picture or a static picture.

In one embodiment, the terminal can select the emotional features capable of image processing in advance and configure corresponding emotional feature images corresponding to the selected emotional features. And after the terminal acquires the face emotion feature recognition result, acquiring an emotion feature image corresponding to the emotion feature represented by the face emotion feature recognition result.

In one embodiment, the terminal can establish an emotional feature image library in advance, and map the emotional feature images reflecting the same emotional features in the emotional feature image library to the same emotional features. After the terminal obtains the face emotion feature recognition result, the terminal can search for the emotion feature image which reflects the emotion feature and is matched with the face emotion feature recognition result in the emotion feature image library.

In one embodiment, the emotion feature image library established in advance by the terminal can comprise a plurality of emotion feature image sets, and each emotion feature image set reflects one emotion feature. After the terminal obtains the face emotion feature recognition result, searching an emotion feature image set with the emotion features which are reflected in the emotion feature image library and consistent with the face emotion feature recognition result, and selecting an emotion feature image from the searched emotion feature image set.

S310, acquiring the display position of the emotional characteristic image in the currently played image frame.

And the display position of the emotional characteristic image in the currently played image frame represents the area occupied by the emotional characteristic image in the currently played image frame. The presentation position can be represented by coordinates of the region occupied by the emotional feature image in the currently played image frame in the coordinate system of the currently played image frame.

In one embodiment, the terminal can obtain the display position of the emotional feature image when searching the emotional feature image. The terminal can specifically obtain the drawing mode corresponding to the searched emotional characteristic image from the local, and the display position of the emotional characteristic image is determined according to the obtained drawing mode.

Further, the emotional feature image can be drawn in a manner of dynamically following the reference object. Specifically, the terminal can determine the display position of a searched reference object which needs to be followed dynamically by the emotional characteristic image in the currently played image frame, and then determine the display position of the emotional characteristic image in the currently played image frame according to the display position of the reference object.

The emotional characteristic image can be drawn in a static display mode. Specifically, for the static displayed emotional feature image, the terminal may directly set a display area of the emotional feature image in the currently played image frame in advance, and the terminal may directly obtain the emotional feature image when the terminal needs to draw the emotional feature image.

And S312, rendering the emotional characteristic image in the currently played image frame according to the display position.

Specifically, the terminal can render the emotional feature image at the acquired display position in the currently played image frame. The terminal can acquire the style data corresponding to the emotional characteristic image, and accordingly the emotional characteristic image is rendered in the played image frame according to the style data and the acquired display position. In one embodiment, the affective characteristic image is a dynamic image comprising a set of image frame sequences. The terminal can render the image frames included in the dynamic images one by one according to the frame rate and the display position corresponding to the dynamic images.

In one embodiment, the presentation position may be the position of the emotional feature image relative to a specific region in the currently played image frame; the terminal can track the specific area in the played image frame, so that the position of the emotional feature image in the currently played image frame relative to the tracked specific area is determined according to the display position and the tracked specific area, and the emotional feature image is rendered according to the determined position. The specific region is a region in the image that can represent a specific region in the real scene, and the specific region may be a human face region or the like.

According to the image processing method, the image frame reflecting the real scene is played, so that the played image frame can reflect the real scene. The emotion condition of the person in the real scene can be automatically determined by acquiring the face emotion feature recognition result obtained by recognizing the face image included in the image frame. After the display position of the emotional characteristic image in the currently played image frame is obtained, the emotional characteristic image is rendered in the currently played image frame according to the display position, so that the virtual emotional characteristic image can be automatically combined with people in the real scene to reflect the emotional condition of people in the real scene. The complicated steps of manual operation are avoided, and the image processing efficiency is greatly improved.

In one embodiment, step S306 specifically includes: adjusting the size of the image frame to a preset size; rotating the direction of the adjusted image frame to the direction meeting the emotional feature recognition condition; sending the rotated image frame to a server; and receiving a face emotion feature recognition result which is returned by the server and aims at the sent image frame.

Wherein the preset size refers to a size of a preset image frame. The direction meeting the emotional feature recognition condition is the direction of the image frame when the emotional feature recognition can be carried out.

In one embodiment, the terminal may pull, from the server, image features of preset image frames including a face image, where the image features are features that the image frames capable of performing expression recognition should have. Such as the size of the image frame or the orientation of the image frame, etc.

Specifically, the terminal acquires image frames acquired from a real scene, and after selecting the image frames including the face images, can detect whether the size of the selected image frames including the face images meets a preset size. And if the size of the image frame including the face image which is detected and screened does not accord with the preset size, carrying out size adjustment on the image frame.

The terminal can detect the current direction of the image frame after detecting that the size of the selected image frame including the face image accords with the preset size or the size of the image frame which does not accord with the preset size is adjusted. And if the current direction of the image frame does not accord with the emotional feature recognition condition, rotating the direction of the image frame to the direction which accords with the emotional feature recognition condition.

The terminal can send the image frame to the server when the current direction of the image frame meets the emotional feature recognition condition or after the non-met image frame rotating direction. The server extracts expression characteristic data included in the image frame after receiving the image frame sent by the terminal, performs expression recognition on a face image included in the received image frame according to the extracted expression characteristic data to obtain a face emotion characteristic recognition result, and feeds back the face emotion characteristic recognition result obtained through recognition to the terminal.

In one embodiment, after acquiring image frames acquired from a real scene and selecting the image frames including a face image, the terminal may perform downsizing processing on the image frames, and store the image frames subjected to the downsizing processing in a JPEG (Joint Photographic Experts Group) format. The terminal can detect the direction of the face image included in the image frame again, and rotate the direction of the image frame when the direction of the face image included in the image frame does not accord with the direction of the emotional characteristic identification condition.

The JPEG format is an image format compressed according to the international image compression standard. The direction meeting the emotional feature recognition condition may be specifically a direction in which an included angle between a central axis of the face image in the image frame and the vertical direction is not greater than 45 degrees.

In the embodiment, before the facial image in the image frame is subjected to expression recognition through the server, the size and the direction of the image frame are adjusted, so that the image frame meets the condition of expression recognition, the expression recognition speed and accuracy can be improved, and the hardware resource consumption can be reduced.

In one embodiment, after step S306, the image processing method further includes: extracting voice data recorded when the image frame is collected; and acquiring a voice emotion feature recognition result obtained by recognizing the voice data. Step S308 specifically includes: and searching a corresponding emotional characteristic image according to the face emotional characteristic recognition result and the voice emotional characteristic recognition result.

Specifically, when the terminal collects the image frames from the real scene, the terminal can record the voice data in the real scene at the same time, and when the collected image frames are played, the recorded voice data are played synchronously. The terminal can specifically call a sound collection device to collect voice data formed by environmental sounds, and the voice data is stored in a cache region corresponding to collection time.

The terminal can extract the acquisition time corresponding to the image frame currently carrying out expression recognition when carrying out expression recognition on the face image included in the acquired image frame, intercept the voice data segment with the preset time length from the voice data in the cache region, and the acquisition time interval corresponding to the extracted voice data segment covers the acquired acquisition time. The extracted voice data segment is the voice data recorded when the image frame is collected. The preset time length is a preset time length for intercepting the voice data segment, and the preset time length may be specifically 5 seconds or 10 seconds.

In one embodiment, the terminal may intercept a voice data segment of a preset time length from the voice data in the buffer with the acquired collection time as a midpoint. For example, the acquisition time corresponding to the image frame currently subjected to expression recognition is 2016, 10, 1, 18 hours, 30 minutes and 15 seconds, and the preset time length is 5 seconds, so that a voice data segment with the acquisition time interval of 2016, 10, 1, 18 hours, 30 minutes and 15 seconds as a midpoint can be intercepted, and the voice data segment is from 2016, 10, 1, 18 hours, 30 minutes and 13 seconds to 2016, 10, 1, 18 hours, 30 minutes and 17 seconds.

In one embodiment, when receiving an image frame acquired from a real scene sent by another terminal, the terminal may also receive voice data recorded when acquiring the image frame sent by the other terminal. The terminal can store the received voice data into the buffer area, and when the image frames are played according to the acquired time sequence, the voice data are taken out and played synchronously.

The terminal can extract the acquisition time corresponding to the image frame currently carrying out expression recognition when carrying out expression recognition on the face image included in the received image frame, intercept the voice data segment with the preset time length from the voice data in the buffer area, and the acquisition time interval corresponding to the extracted voice data segment covers the acquired acquisition time. The extracted voice data segment is the voice data recorded when the image frame is collected

And after acquiring the voice data recorded when the current image frame for expression recognition is acquired, the terminal recognizes the acquired voice data to obtain a voice emotion feature recognition result.

In one embodiment, the step of obtaining a speech emotion feature recognition result obtained by recognizing speech data in the image processing method specifically includes: recognizing the extracted voice data as a text; searching for emotional characteristic keywords included in the text; and acquiring a voice emotion characteristic recognition result corresponding to the voice data according to the searched emotion characteristic key words.

Specifically, the terminal can perform feature extraction on voice data to obtain voice feature data to be recognized, then perform voice framing processing on the voice feature data to be recognized based on an acoustic model to obtain a plurality of phonemes, convert the plurality of phonemes obtained through processing into a character sequence according to the corresponding relation between candidate words and phonemes in a candidate word library, and then adjust the converted character sequence by using a language model to obtain a text which accords with a natural language mode.

Where text is a character representation of speech data. Acoustic models such as GMM (Gaussian Mixture Model) or DNN (Deep Neural Network), etc. The candidate word library includes candidate words and phonemes corresponding to the candidate words. The Language Model is used for adjusting the character sequence recognized by the acoustic Model according to the natural Language mode, such as an N-Gram Model (CLM) and the like.

The terminal can set an emotional characteristic keyword library in advance, the emotional characteristic keyword library comprises a plurality of emotional characteristic keywords, and the emotional characteristic keywords reflecting the same emotional characteristics in the emotional characteristic keyword library are mapped to the same emotional characteristics. The emotional characteristic keyword library can be stored in a file, a database or a cache and acquired from the file, the database or the cache when needed. And after the terminal identifies the extracted voice data into a text, comparing characters included in the identified text with each emotional characteristic keyword in the emotional characteristic keyword library. And when the characters exist in the text and are matched with the emotional characteristic keywords in the emotional characteristic keyword library, acquiring the matched emotional characteristic keywords, and acquiring the emotional characteristics corresponding to the emotional characteristic keywords as a voice emotional characteristic recognition result.

For example, if the terminal recognizes the speech data to obtain a text "i happy today", which includes an emotional feature keyword "happy," and the emotional feature mapped to the happy "is" happy, "the speech emotional feature recognition result is" happy. Assuming that the text obtained by the terminal by recognizing the voice data is 'I' very happy ', wherein the text comprises an emotional feature keyword' happy ', and the emotional feature mapped to the happy' is 'happy', then the voice emotional feature recognition result is 'happy'.

In the embodiment, the recorded voice data is subjected to text recognition, and the voice emotion feature recognition result is obtained according to the characters which represent the emotion features and are included in the text, so that the accuracy of the voice emotion feature recognition result is improved.

In one embodiment, the terminal may further obtain a speech emotion feature recognition result according to the acoustic feature corresponding to the speech data. The terminal can extract acoustic features of the voice data, and corresponding emotional features are obtained according to the corresponding relation between the acoustic features and the emotional features established in advance to obtain a voice emotional feature recognition result.

In one embodiment, the acoustic features include timbre and prosodic features. The timbre refers to the characteristic of the sounding body that gives out sound, and different sounding bodies have different timbres due to different materials and structures. Timbre is physically characterized by spectral parameters. The prosodic features refer to basic tones and rhythms of sounds emitted by a sound-emitting body, and are characterized by fundamental frequency parameters, duration distribution and signal strength in physics.

For example, when people feel happy, the prosody appears cheerful when speaking, and if the terminal has higher basic tone and faster rhythm in the prosodic features extracted from the voice data, the terminal can show that the emotional features reflected by the voice data are happy.

In the embodiment, the acoustic feature extraction is performed on the recorded voice data, and the voice emotion feature recognition result is obtained according to the parameters representing the emotion features in the acoustic features, so that the accuracy of the voice emotion feature recognition result is improved.

In an embodiment, the step of searching for a corresponding emotion feature image according to the face emotion feature recognition result and the speech emotion feature recognition result in the image processing method may specifically include: and when the face emotional feature recognition result is matched with the voice emotional feature recognition result, searching a corresponding emotional feature image according to the face emotional feature recognition result.

Specifically, after acquiring a face emotion feature recognition result obtained by facial expression recognition of a face image included in an image frame and a voice emotion feature recognition result obtained by voice data recognition recorded during image frame acquisition, the terminal compares the face emotion feature recognition result with the voice emotion feature recognition result, and when the face emotion feature recognition result is matched with the voice emotion feature recognition result, searches for a corresponding emotion feature image according to the face emotion feature recognition result.

In one embodiment, the searching for the corresponding emotional feature image according to the human face emotional feature recognition result in the image processing method includes: extracting the emotional feature type and the recognition result confidence degree included in the human face emotional feature recognition result; searching an emotional feature image set corresponding to the emotional feature type; and selecting the emotional characteristic image corresponding to the confidence coefficient of the recognition result from the emotional characteristic image set.

The emotional feature type refers to the type of emotional features reflected by the human face. Such as "happy", "sad" or "angry", etc. The confidence coefficient of the recognition result represents the credibility of the human face emotional feature recognition result as the real emotional feature of the human face, and the higher the confidence coefficient of the recognition result is, the higher the possibility that the human face emotional feature recognition result is the real emotional feature of the human face is.

Specifically, the emotion feature image library established in advance by the terminal may include a plurality of emotion feature image sets, and each emotion feature image set reflects an emotion feature type. The terminal can map one emotion feature image corresponding to the confidence coefficient of the face emotion feature recognition result. After the terminal obtains the face emotion feature recognition result, searching an emotion feature image set with the emotion features reflected in the emotion feature image library and the emotion feature types included in the face emotion feature recognition result consistent with each other, and selecting an emotion feature image corresponding to the recognition result confidence included in the face emotion feature recognition result from the searched emotion feature image set.

In the above embodiment, corresponding emotional feature images are respectively set for the confidence degrees of the recognition results included in different human face emotional feature recognition results, and the confidence degree of the human face emotional feature recognition result is visually reflected through the emotional feature images, so that the image processing result is more accurate.

In an embodiment, when the face emotion feature recognition result is matched with the voice emotion feature recognition result, the terminal may also randomly select an emotion feature image from an emotion feature image set in which the emotion features reflected in the searched emotion feature image library are consistent with the emotion feature types included in the face emotion feature recognition result.

In this embodiment, when the face emotion feature recognition result is matched with the speech emotion feature recognition result, the corresponding emotion feature image is searched according to the face emotion feature recognition result, so that image processing is performed according to the face emotion feature recognition result under the guarantee of the speech emotion feature recognition result, and the image processing result is more accurate.

In an embodiment, the step of searching for a corresponding emotion feature image according to the face emotion feature recognition result and the speech emotion feature recognition result in the image processing method may specifically include: and when the face emotional feature recognition result is not matched with the voice emotional feature recognition result, searching a corresponding emotional feature image according to the voice emotional feature recognition result.

Specifically, after acquiring a face emotion feature recognition result obtained by facial expression recognition of a face image included in an image frame and a voice emotion feature recognition result obtained by voice data recognition recorded during image frame acquisition, the terminal compares the face emotion feature recognition result with the voice emotion feature recognition result, and searches for a corresponding emotion feature image according to the voice emotion feature recognition result when the face emotion feature recognition result is not matched with the voice emotion feature recognition result.

In one embodiment, the terminal may further obtain the degree adverb included in the text recognized by the voice data. The adverb is used to indicate the degree of emotion, such as: "very," "very," or "and" the like. The speech emotion feature recognition result obtained by the terminal through speech data recognition specifically comprises an emotion feature type and an emotion intensity degree.

Specifically, the emotion feature image library established in advance by the terminal may include a plurality of emotion feature image sets, and each emotion feature image set reflects an emotion feature type. The terminal can map one emotional characteristic image one by one corresponding to the emotional intensity degree. After the terminal obtains the voice emotion feature recognition result, searching an emotion feature image set with the emotion features reflected in the emotion feature image library and the emotion feature types included in the voice emotion feature recognition result consistent with each other, and selecting an emotion feature image corresponding to the emotion intensity included in the voice emotion feature recognition result from the searched emotion feature image set.

In this embodiment, when the face emotion feature recognition result is not matched with the speech emotion feature recognition result, the corresponding emotion feature image is searched according to the speech emotion feature recognition result, and the image processing is performed by using the emotion feature recognition result expressed by real speech data, so that the image processing result is more accurate.

In the embodiment, the human face emotional feature recognition result and the voice emotional feature recognition result are comprehensively considered, and the emotional feature image reflecting the emotional features expressed in the image frame is searched, so that the image processing result is more accurate.

In one embodiment, step S310 specifically includes determining a display position of the face image in the currently played image frame; inquiring the relative position of the emotional characteristic image and the face image; and determining the display position of the emotional characteristic image in the currently played image frame according to the display position and the relative position.

In this embodiment, the display position of the emotional feature image in the currently played image frame refers to a physical position where the emotional feature image is displayed in the currently played image frame. The terminal can obtain the reference object when the searched emotional characteristic image is drawn when the terminal searches the emotional characteristic image. The reference object may specifically be a face image included in the image frame.

Specifically, the terminal can acquire the display position of the reference object in the currently played image frame and the relative position of the emotional feature image and the reference object, and then the terminal determines the display position of the emotional feature image in the currently played image frame according to the display position of the reference object in the currently played image frame and the relative position of the emotional feature image and the reference object. The display position of the emotional feature image in the currently played image frame may specifically be a pixel coordinate interval or a coordinate interval of another preset positioning manner. A pixel refers to the smallest unit that can be displayed on a computer screen. In the present embodiment, the pixels may be logical pixels or physical pixels.

In the embodiment, the relative position of the emotional characteristic image and the face image is set, so that the position of the emotional characteristic image relative to the face image is displayed, and the display position of the emotional characteristic image is more reasonable.

In one embodiment, after step S312, the image processing method further includes tracking a motion trajectory of the face image in the played image frame; and according to the tracked motion trail, moving the emotional characteristic image along with the face image in the played image frame.

The motion track of the face image refers to a track formed by the face images included in the continuously played image frames. Specifically, the display position of the emotional feature image may be a position of the emotional feature image relative to a face image in a currently played image frame; the terminal can track the face image in the currently played image frame in the played image frame, so that the position of the emotional feature image in the currently played image frame relative to the tracked face image is determined according to the display position and the tracked face image, and the emotional feature image is rendered according to the determined position.

In the embodiment, the emotional characteristic image is displayed along with the face image, so that the emotional characteristic image is intelligently associated with the face in the real scene, and a new interaction mode is provided.

As shown in fig. 4, in a specific embodiment, the image processing method includes:

s402, acquiring an image frame collected from a real scene.

S404, the acquired image frames are played frame by frame according to the acquired time sequence.

S406, adjusting the size of the image frame to a preset size; rotating the direction of the adjusted image frame to the direction meeting the emotional feature recognition condition; sending the rotated image frame to a server; and receiving a face emotion feature recognition result returned by the server.

S408, extracting voice data recorded during image frame acquisition; and acquiring a voice emotion feature recognition result obtained by recognizing the voice data.

S410, judging whether the face emotion feature recognition result is matched with the voice emotion feature recognition result; if yes, go to step S412; if not, go to step S414.

S412, extracting the emotional feature type and the recognition result confidence degree included in the human face emotional feature recognition result; searching an emotional feature image set corresponding to the emotional feature type; and selecting the emotional characteristic image corresponding to the confidence coefficient of the recognition result from the emotional characteristic image set.

And S414, searching a corresponding emotion characteristic image according to the voice emotion characteristic recognition result.

S416, determining the display position of the face image in the currently played image frame; inquiring the relative position of the emotional characteristic image and the face image; and determining the display position of the emotional characteristic image in the currently played image frame according to the display position and the relative position.

And S418, rendering the emotional characteristic image in the currently played image frame according to the display position.

S420, tracking the motion trail of the face image in the played image frame; and according to the tracked motion trail, moving the emotional characteristic image along with the face image in the played image frame.

In this embodiment, image frames are collected from a real scene and played according to the collected time sequence, and an emotional feature image reflecting the emotional features of a person in the face image can be determined and displayed by the face emotional feature recognition result of the face image included in the collected image frames. Therefore, the emotional characteristic images are displayed immediately according to the image frames acquired in the real scene, the workload caused by manually selecting the emotional characteristic images and manually adjusting the emotional characteristic images for displaying can be avoided, the image processing efficiency is improved, and the image processing real-time performance is strong.

In one embodiment, after recognizing the text in the voice data, the terminal may further display the recognized text in the currently played image frame. The terminal may specifically draw a component for displaying text content in the currently played image frame, and display the identified text in the component. In the embodiment, the text obtained by recognition is displayed in the currently played image frame, so that the obstacle of interaction among deaf-mute people can be overcome, and the practicability of image processing is improved.

FIG. 5 is a diagram illustrating a comparison of the front and back interfaces for mapping emotional features in one embodiment. Referring to fig. 5, the left side is an interface schematic diagram before the emotional feature image is drawn, the interface schematic diagram includes a face image 510, and referring to fig. 5, the interface schematic diagram after the emotional feature image is drawn on the right side includes a face image 510 and an emotional feature image 520, and the emotional feature image 520 includes an emotional feature image 521 indicating that the emotional feature is open and an emotional feature image 522 indicating that the emotional feature is casual.

The terminal finds the corresponding emotion characteristic image according to the face emotion characteristic recognition result obtained by performing expression recognition on the face image 510 in the interface before the emotion characteristic image is drawn and the voice emotion characteristic recognition result obtained by recognizing the recorded voice data. If the terminal determines that the emotional feature reflected by the left image including the face image 510 in fig. 5 is happy, the face image 510 is tracked in the currently played image frame, and an emotional feature image 521 representing that the emotional feature is happy is drawn at the corresponding position. If the terminal determines that the emotional feature reflected by the left included face image 510 in fig. 5 is a casualty, the face image 510 is tracked in the currently played image frame, and an emotional feature image 522 representing that the emotional feature is a casualty is drawn at the corresponding position.

FIG. 6 is a diagram illustrating a comparison of a contextual interface resulting from speech data recognition in one embodiment. Referring to fig. 6, the left side is a schematic interface diagram before displaying a text recognized according to voice data, the schematic interface diagram includes a face image 610, and referring to fig. 6, the right side is a schematic interface diagram after displaying a text recognized according to voice data, the schematic interface diagram includes a face image 610, an emotion feature image 620 and a text 630. The text 630 is identified by the terminal according to the voice data recorded when the image frame is collected, specifically, "i'm is good and difficult today", the reflected emotional feature is a casualty, the face image 610 can be tracked in the currently played image frame, the identified text 630 is displayed at the corresponding position, and an emotional feature image 620 representing that the emotional feature is a casualty can be drawn at the corresponding position.

Fig. 7 is a block diagram of an image processing apparatus 700 according to an embodiment. Referring to fig. 7, the image processing apparatus 700 includes: an image frame obtaining module 701, a playing module 702, a recognition result obtaining module 703, a searching module 704, a display position obtaining module 705 and a rendering module 706.

An image frame acquiring module 701, configured to acquire an image frame acquired from a real scene.

And the playing module 702 is configured to play the acquired image frames frame by frame according to the acquired time sequence.

The recognition result obtaining module 703 is configured to obtain a face emotion feature recognition result obtained by recognizing a face image included in the image frame.

And the searching module 704 is used for searching a corresponding emotional characteristic image according to the face emotional characteristic recognition result.

A display position obtaining module 705, configured to obtain a display position of the emotional feature image in the currently played image frame.

And the rendering module 706 is configured to render the emotional feature image in the currently played image frame according to the display position.

The image processing apparatus 700 plays the image frames reflecting the real scene, so that the played image frames can reflect the real scene. The emotion condition of the person in the real scene can be automatically determined by acquiring the face emotion feature recognition result obtained by recognizing the face image included in the image frame. After the display position of the emotional characteristic image in the currently played image frame is obtained, the emotional characteristic image is rendered in the currently played image frame according to the display position, so that the virtual emotional characteristic image can be automatically combined with people in the real scene to reflect the emotional condition of people in the real scene. The complicated steps of manual operation are avoided, and the image processing efficiency is greatly improved.

In one embodiment, the recognition result obtaining module 703 is further configured to adjust the size of the image frame to a preset size; rotating the direction of the adjusted image frame to the direction meeting the emotional feature recognition condition; sending the rotated image frame to a server; and receiving a face emotion feature recognition result which is returned by the server and aims at the sent image frame.

In this embodiment, before the facial image in the image frame is subjected to expression recognition by the server, the size and the direction of the image frame are adjusted, so that the image frame meets the condition of expression recognition, the expression recognition speed and accuracy can be improved, and the hardware resource consumption can be reduced.

In one embodiment, the recognition result obtaining module 703 is further configured to extract voice data recorded when the image frame is collected; and acquiring a voice emotion feature recognition result obtained by recognizing the voice data. The searching module 704 is further configured to search a corresponding emotional feature image according to the face emotional feature recognition result and the voice emotional feature recognition result.

In this embodiment, the human face emotion feature recognition result and the voice emotion feature recognition result are comprehensively considered, and an emotion feature image reflecting the emotion features expressed in the image frame is searched, so that the image processing result is more accurate.

In one embodiment, the recognition result obtaining module 703 is further configured to recognize the extracted voice data as a text; searching for emotional characteristic keywords included in the text; and acquiring a voice emotion characteristic recognition result corresponding to the voice data according to the searched emotion characteristic key words.

In this embodiment, the recorded voice data is subjected to text recognition, and a voice emotion feature recognition result is obtained according to characters which represent emotion features and are included in the text, so that the accuracy of the voice emotion feature recognition result is improved.

In one embodiment, the searching module 704 is further configured to search a corresponding emotional feature image according to the face emotional feature recognition result when the face emotional feature recognition result matches the speech emotional feature recognition result.

In this embodiment, corresponding emotional feature images are respectively set for recognition result confidence levels included in different human face emotional feature recognition results, and the confidence level of the human face emotional feature recognition result is visualized and reflected through the emotional feature images, so that the image processing result is more accurate.

In one embodiment, the finding module 704 is further configured to extract an emotional feature type and a recognition result confidence included in the face emotional feature recognition result; searching an emotional feature image set corresponding to the emotional feature type; and selecting the emotional characteristic image corresponding to the confidence coefficient of the recognition result from the emotional characteristic image set.

In one embodiment, the searching module 704 is further configured to search for an emotion feature image corresponding to the speech emotion feature recognition result when the face emotion feature recognition result does not match the speech emotion feature recognition result.

In one embodiment, the display position obtaining module 705 is further configured to determine a display position of the face image in the currently played image frame; inquiring the relative position of the emotional characteristic image and the face image; and determining the display position of the emotional characteristic image in the currently played image frame according to the display position and the relative position.

In this embodiment, the relative position of the emotional feature image and the face image is set, so that the position of the emotional feature image relative to the face image is displayed, and the display position of the emotional feature image is more reasonable.

As shown in fig. 8, in one embodiment, the image processing apparatus 700 further comprises a render-follow module 707.

A rendering following module 707, configured to track a motion trajectory of the face image in the played image frame; and according to the tracked motion trail, moving the emotional characteristic image along with the face image in the played image frame.

In one embodiment, a computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, perform the steps of:

acquiring an image frame acquired from a real scene;

acquiring a face emotion feature recognition result obtained by recognizing a face image included in an image frame;

The computer readable instructions stored on the computer readable storage medium, when executed, play back image frames that reflect a real scene such that the played image frames are capable of reflecting the real scene. The emotion condition of the person in the real scene can be automatically determined by acquiring the face emotion feature recognition result obtained by recognizing the face image included in the image frame. After the display position of the emotional characteristic image in the currently played image frame is obtained, the emotional characteristic image is rendered in the currently played image frame according to the display position, so that the virtual emotional characteristic image can be automatically combined with people in the real scene to reflect the emotional condition of people in the real scene. The complicated steps of manual operation are avoided, and the image processing efficiency is greatly improved.

In one embodiment, the computer readable instructions stored on the computer readable storage medium, when executed, perform the step of obtaining a facial emotion feature recognition result obtained by recognizing a face image included in an image frame, including: adjusting the size of the image frame to a preset size; rotating the direction of the adjusted image frame to the direction meeting the emotional feature recognition condition; sending the rotated image frame to a server; and receiving a face emotion feature recognition result which is returned by the server and aims at the sent image frame.

In one embodiment, after the computer readable instructions stored on the computer readable storage medium are executed to obtain the human face emotion feature recognition result obtained by recognizing the human face image included in the image frame, the following steps may be further executed: extracting voice data recorded when the image frame is collected; and acquiring a voice emotion feature recognition result obtained by recognizing the voice data. The step of searching the corresponding emotional characteristic image according to the face emotional characteristic recognition result comprises the following steps: and searching a corresponding emotional characteristic image according to the face emotional characteristic recognition result and the voice emotional characteristic recognition result.

In one embodiment, computer readable instructions stored on a computer readable storage medium, when executed, perform the step of obtaining a speech emotion feature recognition result obtained by recognizing speech data, comprising: recognizing the extracted voice data as a text; searching for emotional characteristic keywords included in the text; and acquiring a voice emotion characteristic recognition result corresponding to the voice data according to the searched emotion characteristic key words.

In one embodiment, the computer readable instructions stored on the computer readable storage medium, when executed, perform the step of searching for the corresponding emotion feature image according to the face emotion feature recognition result and the speech emotion feature recognition result, including: and when the face emotional feature recognition result is matched with the voice emotional feature recognition result, searching a corresponding emotional feature image according to the face emotional feature recognition result.

In one embodiment, the computer readable instructions stored on the computer readable storage medium, when executed, perform the step of searching for the corresponding emotional feature image according to the result of the human face emotional feature recognition, including: extracting the emotional feature type and the recognition result confidence degree included in the human face emotional feature recognition result; searching an emotional feature image set corresponding to the emotional feature type; and selecting the emotional characteristic image corresponding to the confidence coefficient of the recognition result from the emotional characteristic image set.

In one embodiment, the computer readable instructions stored on the computer readable storage medium, when executed, perform the step of searching for the corresponding emotion feature image according to the face emotion feature recognition result and the speech emotion feature recognition result, including: and when the face emotional feature recognition result is not matched with the voice emotional feature recognition result, searching a corresponding emotional feature image according to the voice emotional feature recognition result.

In one embodiment, the computer readable instructions stored on the computer readable storage medium, when executed, perform the step of obtaining the display position of the emotional feature image in the currently played image frame, including: determining the display position of the face image in the currently played image frame; inquiring the relative position of the emotional characteristic image and the face image; and determining the display position of the emotional characteristic image in the currently played image frame according to the display position and the relative position.

In one embodiment, after the computer readable instructions stored on the computer readable storage medium, when executed, perform the step of rendering the emotional feature image in the currently played image frame according to the presentation position, the following steps may be further performed: tracking the motion trail of the face image in the played image frame; and according to the tracked motion trail, moving the emotional characteristic image along with the face image in the played image frame.

In one embodiment, an electronic device includes a memory and a processor, the memory having computer-readable instructions stored therein, which when executed by the processor, cause the processor to perform the steps of:

acquiring an image frame acquired from a real scene;

When the processor executes the steps, the electronic device plays the image frames reflecting the real scene, so that the played image frames can reflect the real scene. The emotion condition of the person in the real scene can be automatically determined by acquiring the face emotion feature recognition result obtained by recognizing the face image included in the image frame. After the display position of the emotional characteristic image in the currently played image frame is obtained, the emotional characteristic image is rendered in the currently played image frame according to the display position, so that the virtual emotional characteristic image can be automatically combined with people in the real scene to reflect the emotional condition of people in the real scene. The complicated steps of manual operation are avoided, and the image processing efficiency is greatly improved.

In one embodiment, the electronic device, when executing the computer readable instructions by the processor, performs the step of obtaining a face emotion feature recognition result obtained by recognizing a face image included in the image frame, including: adjusting the size of the image frame to a preset size; rotating the direction of the adjusted image frame to the direction meeting the emotional feature recognition condition; sending the rotated image frame to a server; and receiving a face emotion feature recognition result which is returned by the server and aims at the sent image frame.

In one embodiment, after the electronic device executes the computer readable instructions by the processor to perform the step of obtaining the face emotion feature recognition result obtained by recognizing the face image included in the image frame, the electronic device may further perform the following steps: extracting voice data recorded when the image frame is collected; and acquiring a voice emotion feature recognition result obtained by recognizing the voice data. The step of searching the corresponding emotional characteristic image according to the face emotional characteristic recognition result executed by the processor comprises the following steps: and searching a corresponding emotional characteristic image according to the face emotional characteristic recognition result and the voice emotional characteristic recognition result.

In one embodiment, the electronic device, when executing the computer readable instructions by the processor, performs the step of obtaining a speech emotion feature recognition result obtained by recognizing the speech data, including: recognizing the extracted voice data as a text; searching for emotional characteristic keywords included in the text; and acquiring a voice emotion characteristic recognition result corresponding to the voice data according to the searched emotion characteristic key words.

In one embodiment, the step of searching for the corresponding emotion feature image according to the face emotion feature recognition result and the speech emotion feature recognition result executed by the electronic device when the computer readable instructions are executed by the processor comprises: and when the face emotional feature recognition result is matched with the voice emotional feature recognition result, searching a corresponding emotional feature image according to the face emotional feature recognition result.

In one embodiment, when the electronic device executes the computer readable instructions through the processor, the step of searching the corresponding emotional feature image according to the human face emotional feature recognition result includes: extracting the emotional feature type and the recognition result confidence degree included in the human face emotional feature recognition result; searching an emotional feature image set corresponding to the emotional feature type; and selecting the emotional characteristic image corresponding to the confidence coefficient of the recognition result from the emotional characteristic image set.

In one embodiment, the step of searching for the corresponding emotion feature image according to the face emotion feature recognition result and the speech emotion feature recognition result executed by the electronic device when the computer readable instructions are executed by the processor comprises: and when the face emotional feature recognition result is not matched with the voice emotional feature recognition result, searching a corresponding emotional feature image according to the voice emotional feature recognition result.

In one embodiment, the electronic device, when executing the computer readable instructions by the processor, performs the step of obtaining the display position of the emotional feature image in the currently played image frame, including: determining the display position of the face image in the currently played image frame; inquiring the relative position of the emotional characteristic image and the face image; and determining the display position of the emotional characteristic image in the currently played image frame according to the display position and the relative position.

In one embodiment, after the electronic device executes the computer readable instructions by the processor to perform the step of rendering the emotional feature image in the currently played image frame according to the display position, the electronic device may further perform the following steps: tracking the motion trail of the face image in the played image frame; and according to the tracked motion trail, moving the emotional characteristic image along with the face image in the played image frame.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), or the like.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring an image frame acquired from a real scene;

extracting voice data recorded when the image frame is collected;

acquiring a voice emotion feature recognition result obtained by recognizing the voice data;

when the face emotional feature recognition result is matched with the voice emotional feature recognition result, selecting an emotional feature image from an emotional feature image set which is consistent with the emotional feature type included in the face emotional feature recognition result;

when the face emotional feature recognition result is not matched with the voice emotional feature recognition result, selecting an emotional feature image from an emotional feature image set which is consistent with the emotional feature type included in the voice emotional feature recognition result;

tracking the motion track of the face image in the continuously played image frame, and displaying the emotional characteristic image in the continuously played image frame according to the tracked motion track so as to enable the emotional characteristic image to move and display along with the face image.

2. The method according to claim 1, wherein the obtaining of the human face emotion feature recognition result obtained by recognizing the human face image included in the image frame comprises:

adjusting the size of the image frame to a preset size;

rotating the adjusted direction of the image frame to a direction meeting the emotional feature recognition condition;

sending the rotated image frame to a server;

and receiving a face emotion feature recognition result which is returned by the server and aims at the sent image frame.

3. The method of claim 1, wherein obtaining the speech emotion feature recognition result obtained by recognizing the speech data comprises:

recognizing the extracted voice data as text;

searching for emotional characteristic keywords included in the text;

and acquiring a voice emotion feature recognition result corresponding to the voice data according to the searched emotion feature keyword.

4. The method according to claim 1, wherein selecting an emotional feature image from the set of emotional feature images consistent with the types of emotional features included in the face emotional feature recognition result comprises:

extracting the emotional feature type and the recognition result confidence degree included in the human face emotional feature recognition result;

and selecting an emotional feature image corresponding to the confidence coefficient of the recognition result from an emotional feature image set consistent with the emotional feature types included in the face emotional feature recognition result.

5. The method according to any one of claims 1 to 4, wherein the presenting the emotional feature images in the continuously played image frames according to the tracked motion trajectory comprises:

inquiring the relative position of the emotional feature image and the face image;

and displaying the emotional characteristic image in the continuously played image frame according to the tracked motion track and the relative position so as to enable the emotional characteristic image to move and display along with the face image.

6. An image processing apparatus, characterized in that the apparatus comprises:

the image frame acquisition module is used for acquiring image frames acquired from a real scene;

the identification result acquisition module is used for acquiring a human face emotion characteristic identification result obtained by identifying a human face image included in the image frame; extracting voice data recorded when the image frame is collected; acquiring a voice emotion feature recognition result obtained by recognizing the voice data;

the searching module is used for selecting an emotional feature image from an emotional feature image set which is consistent with the emotional feature type included in the face emotional feature recognition result when the face emotional feature recognition result is matched with the voice emotional feature recognition result; when the face emotional feature recognition result is not matched with the voice emotional feature recognition result, selecting an emotional feature image from an emotional feature image set which is consistent with the emotional feature type included in the voice emotional feature recognition result;

and the rendering module is used for tracking the motion track of the face image in the continuously played image frame and displaying the emotional characteristic image in the continuously played image frame according to the tracked motion track so as to enable the emotional characteristic image to move and display along with the face image.

7. The apparatus of claim 6, wherein the identification result obtaining module is further configured to: adjusting the size of the image frame to a preset size; rotating the adjusted direction of the image frame to a direction meeting the emotional feature recognition condition; sending the rotated image frame to a server; and receiving a face emotion feature recognition result which is returned by the server and aims at the sent image frame.

8. The apparatus of claim 6, wherein the identification result obtaining module is further configured to: recognizing the extracted voice data as text; searching for emotional characteristic keywords included in the text; and acquiring a voice emotion feature recognition result corresponding to the voice data according to the searched emotion feature keyword.

9. The apparatus of claim 6, wherein the lookup module is further configured to: extracting the emotional feature type and the recognition result confidence degree included in the human face emotional feature recognition result; and selecting an emotional feature image corresponding to the confidence coefficient of the recognition result from an emotional feature image set consistent with the emotional feature types included in the face emotional feature recognition result.

10. The apparatus according to any one of claims 6 to 9, wherein the image processing apparatus further comprises a presentation position acquisition module configured to: inquiring the relative position of the emotional feature image and the face image; the rendering module is further to: and displaying the emotional characteristic image in the continuously played image frame according to the tracked motion track and the relative position so as to enable the emotional characteristic image to move and display along with the face image.

11. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 5 when executing the computer program.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.