CN121170671A - Personality trait prediction methods, systems, devices, media, and program products based on multimodal large language models and contrastive learning. - Google Patents
Personality trait prediction methods, systems, devices, media, and program products based on multimodal large language models and contrastive learning.Info
- Publication number
- CN121170671A CN121170671A CN202511276064.9A CN202511276064A CN121170671A CN 121170671 A CN121170671 A CN 121170671A CN 202511276064 A CN202511276064 A CN 202511276064A CN 121170671 A CN121170671 A CN 121170671A
- Authority
- CN
- China
- Prior art keywords
- features
- visual
- feature
- audio
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Image Analysis (AREA)
Abstract
Preprocessing a sample video to obtain original data, extracting visual, audio and text features of the sample video based on the original data, carrying out multi-level visual association on the visual features, carrying out cross-modal semantic alignment on the visual features and the audio and text features after the multi-level visual association by adopting contrast learning, and finally further enhancing interaction and fusion among modes after the cross-modal semantic alignment by using a multi-head attention mechanism and finally respectively predicting five big personality by using a multi-layer perceptron; the system, the device and the medium are used for realizing the method, the program product comprises a computer program for realizing the method, and the multi-modal feature fusion expression capability is obviously enhanced while the personality prediction accuracy and the model robustness are improved.
Description
Technical Field
The invention belongs to the technical field of multi-mode perception, and particularly relates to a personality characteristic prediction method, a system, equipment, a medium and a program product based on a multi-mode large language model and contrast learning.
Background
For multi-modal large five personality predictions, there are some problems with existing techniques that have not been solved. Firstly, the current method mainly adopts a single-level visual feature extraction mode, and cannot establish hierarchical semantic association among scene features, facial features, sight lines and action unit sequence features. For example, in an interview scenario, the macro environment of the candidate, such as a meeting room layout, mesoscopic facial expressions, such as smile frequencies, and microscopic eye movements, such as gaze avoidance, together constitute a complete representation of personality traits, whereas the prior art lacks the ability to systematically model such multi-level semantics. Secondly, the prior art completely relies on voice to transfer text, and extracts richer semantic descriptions from visual information to cause insufficient information density of a text mode. Finally, the existing cross-modal fusion method generally has the problem of rough semantic alignment. Although some researches try to achieve multi-modal fusion through feature stitching, semantic contributions of different modalities cannot be effectively distinguished. For example, when the audio modality contains ambient noise interference, this simple fusion approach may result in the key personality characteristics being swamped by the noise.
The invention provides a large five personality characteristic prediction method based on multi-mode fusion, which mainly comprises the steps of (1) intercepting an image sequence to be processed containing the face of an image testee from a target dialogue video, extracting an audio file containing dialogue information of the testee from the target dialogue video, (2) respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network, (3) extracting the audio characteristic sequence and text information of the audio file and extracting text characteristics in the text information, (4) carrying out multi-mode fusion on the facial expression characteristic sequence, the head posture characteristic sequence, the audio characteristic sequence and the text characteristics to obtain target fusion characteristics, and (5) carrying out regression on the target fusion characteristics based on a trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the testee. However, due to rough semantic alignment in multi-modal fusion, the semantic contribution degree of different modalities cannot be effectively distinguished.
The invention provides a multi-mode personality perception method and device with a plurality of associated characteristics and drawing relation attention, which is disclosed in the patent application with the publication number of CN119323002A, and mainly comprises the steps of (1) intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; the method comprises the steps of (1) respectively extracting a facial expression feature sequence and a head gesture feature sequence from an image sequence to be processed by using a trained facial expression prediction network and a trained head gesture estimation network, (3) extracting an audio feature sequence of an audio file and text features transcribed by audio, and (4) carrying out multi-modal fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text features to obtain target fusion features, training the whole network by using a loss function based on label distribution, and (5) carrying out weighted regression on the target fusion features based on a trained multi-layer perceptron to obtain a quantization prediction result of each dimension of a large five personality of a tested person. However, since text data is completely dependent on voice transcription text and semantic alignment is rough in multi-modal fusion, the information density of text modalities is insufficient and semantic contribution degrees of different modalities cannot be effectively distinguished.
The patent application with the publication number of CN119337325A provides a multi-mode personality prediction method based on a pre-training model and a pyramid graph fusion, which mainly comprises the steps of (1) recording a video by using video acquisition equipment to obtain an original video collar, carrying out data preprocessing on the original video to obtain scene data, audio data and text data, (2) carrying out single-mode feature extraction through the pre-training model according to the scene data, the audio data and a preset personality descriptor list based on an extended long-short-term memory network, obtaining three-type single-mode features, namely, the pre-training model comprises a pre-training CLIP model and a pre-training CLAP model, (3) carrying out similarity calculation between two single modes according to three-type single-mode features based on a similarity function to obtain three-mode associated features, (4) carrying out feature fusion according to the three-mode associated features based on the pre-training model, a universal feature encoder and a double-layer perceptron, and carrying out feature fusion according to the text data and the three-type single-mode associated features, and carrying out feature fusion through the graph fusion network to obtain three-mode combined features, (6) carrying out multi-mode associated feature fusion, and carrying out the multi-mode feature fusion on the three-mode associated feature fusion based on the three-mode feature, and carrying out the double-mode pyramid prediction layer prediction result. However, a single-level visual feature extraction mode, namely scene features, cannot establish hierarchical semantic association among scene features, facial features, vision lines and action unit sequence features, so that insufficient utilization of visual information is caused.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a personality characteristic prediction method, a system, equipment, a medium and a program product based on a multi-mode large language model and contrast learning, wherein the whole framework comprises a multi-level visual association module based on a bidirectional gating circulation unit, a visual text alignment module based on the multi-mode large language model and a five-large personality prediction module based on cross-mode semantic alignment of contrast learning; the method realizes effective utilization of visual information by establishing hierarchical semantic association among scene features, facial features, action units and line-of-sight sequence data features, realizes effective improvement of text information density by enhancing text data, realizes semantic mapping and significance distinction among visual, audio and text modes by adopting a contrast learning strategy based on cross-mode semantic alignment, enhances collaborative modeling capability among modes, remarkably enhances expression capability of multi-mode feature fusion while improving personality prediction accuracy and model robustness, and has good application prospect and popularization value.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
A personality characteristic prediction method based on a multi-mode large language model and contrast learning comprises the following steps:
Step 1, recording a video by using video acquisition equipment to obtain a sample video;
step 2, preprocessing a sample video to obtain original data;
step 3, extracting visual features, audio features and text features of the sample video based on the original data;
step 4, performing multi-level visual correlation on the visual characteristics;
step 5, realizing cross-modal semantic alignment by adopting contrast learning based on the visual characteristics after multi-level visual association and the audio characteristics and the text characteristics extracted in the step 3;
and 6, further enhancing interaction and fusion among modes after cross-mode semantic alignment by using a multi-head attention mechanism, and finally respectively predicting five personality by using a multi-layer perceptron.
The preprocessing in the step 2 specifically includes the following operations on the sample video respectively:
Intercepting sample video scene frames using computer vision processing software;
Extracting facial frames, action units and line-of-sight sequence data from the sample video using facial analysis software;
Audio data is extracted from the sample video by using audio-video processing software, and then the audio data is transcribed by using voice recognition software to obtain text data.
The step 3 specifically comprises the following steps:
visual characteristics extraction, wherein the visual characteristics comprise scene characteristics, facial characteristics, action units and sight line sequence data characteristics, and the extraction process is as follows:
Dividing each section of sample video into n sections, randomly sampling a frame from each section, including a scene frame and a face frame, adjusting the frame to adapt to the resolution of the input of a visual feature extraction model, adjusting the output feature number of the visual feature extraction model to be E, and extracting scene features and face features by using the visual feature extraction model;
Inputting the action unit and the sight line sequence data into a time sequence feature extraction model, outputting the feature number as E, and extracting the action unit and the sight line sequence data by using the time sequence feature extraction model;
The audio feature extraction comprises the steps of enhancing a high-frequency part in audio data by pre-emphasis treatment, simulating the perceptibility of human ears to different frequencies in the audio data by a Mel frequency spectrum feature extraction method to obtain audio feature representation conforming to human auditory characteristics, and finally aggregating the audio feature representation into emotion and personality feature representation of an speaking level by an audio feature aggregation model;
And extracting text features, namely forming a high-quality text fused with multi-modal information by utilizing a multi-modal large language model and combining a scene frame obtained by random sampling in a sample video and text data transcribed by audio, splicing the high-quality text with the text data transcribed by the original audio to form a composite text containing explicit language information and implicit visual clues, and transmitting the composite text into a language representation model to obtain the text features.
The step 4 builds and applies a multi-level visual feature modeling module G which synthesizes two sub-processes of time sequence modeling and cross-modal feature interaction modeling to realize hierarchical fusion of visual features, and the method specifically comprises the following steps:
And 4.1, constructing a multi-level visual feature modeling module G, wherein the input of the module G is two levels of visual features, which are expressed as X and Y, the corresponding input visual features of the time step i are expressed as X i-1 and Y i-1, and the time sequence modeling is as follows:
Wherein, the AndThe hidden states of the GRU at the current time step,AndThe hidden states of the GRU in the last time step are respectively, and X 'i-1 and Y' i-1 are respectively the output of the time sequence modeling of the current time step;
after the time series modeling is completed, cross-modal feature interaction modeling is performed, and the formula is as follows:
Wherein, the AndThe method is characterized in that the method is respectively a forward output and a reverse output of the bidirectional GRU, wherein [ ] is vector splicing operation, and h i-1 is a hiding state of the bidirectional GRU in the last time step;
The forward output and the reverse output of the bidirectional GRU are further subjected to feature compression and nonlinear transformation through a full connection layer, and finally visual correlation features are generated:
Wherein, W p and b p are weight matrix and bias term of the full connection layer, which represent matrix multiplication operation, and Z i is visual association feature of the current time step;
Step 4.2, applying a multi-level visual feature modeling module G, namely firstly, fusing an action unit, a sight line sequence data feature and a facial feature input module G to obtain an intermediate feature, and then further fusing the intermediate feature with a scene feature input module G to finally obtain a multi-level visual associated visual feature, wherein the whole process is shown in the following formula:
V=G(Ds,G(Df,Dg))
Wherein, D s、Df、Dg respectively represents the action unit and the sight sequence data characteristics, the facial characteristics and the scene characteristics, G is a multi-level visual characteristic modeling module, and the finally output V is the visual characteristics after multi-level visual association.
The step 5 specifically comprises the following steps:
And 5.1, mapping the visual features, the audio features and the text features after multi-level visual association by using a projection network, wherein the projection network adopts a three-layer full-connection structure, and receives batch normalization and nonlinear activation functions ReLU after each layer to stabilize training, and the formula is as follows:
Wherein m epsilon { V, A, T } represents the mode type, V, A, T are respectively the visual characteristics, audio characteristics and text characteristics after multi-level visual association, f m is the characteristics of different modes, W i (m) and For the learnable parameters, i e {1,2,3}, z m is the feature after projection of different modalities;
And 5.2, further adjusting the projected characteristics of different modes by adopting a prediction network, wherein the prediction network adopts a two-layer fully-connected network structure, the first layer is responsible for characteristic dimension reduction, and the second layer is responsible for restoring characteristic dimension, and the formula is as follows:
wherein P m is the characteristic of different modes after being regulated by the prediction network;
And 5.3, taking the projected features of different modes obtained by training in the step 5.1 as low-dimensional features of the modes, taking the features adjusted by the network predicted in the step 5.2 as high-dimensional features of the modes, taking the high-dimensional features of the mode m as anchor point samples, taking the high-dimensional features of other modes as positive samples, taking the low-dimensional features of all modes as negative samples, and representing the contrast loss as follows:
In the formula, AndRespectively representing a high-dimensional characteristic anchor point sample and a positive sample thereof of a mode m, sim (·) is a similarity function, τ is a temperature coefficient,Representing a negative sample;
the final cross-modal semantic alignment loss is expressed as:
Where M represents the modality type, m= { V, a, T }.
A personality characteristic prediction system based on a multi-modal large language model and contrast learning, comprising:
the video acquisition module is used for recording videos by using video acquisition equipment to obtain sample videos;
the video preprocessing module is used for preprocessing the sample video to obtain original data;
the feature extraction module is used for extracting visual features, audio features and text features of the sample video based on the original data;
The multi-level visual correlation module is used for carrying out multi-level visual correlation on visual characteristics;
The visual text alignment module is used for realizing cross-modal semantic alignment by adopting contrast learning based on the visual characteristics after multi-level visual association and the audio characteristics and the text characteristics extracted in the step 3;
And the five-big personality prediction module further enhances interaction and fusion among modes after cross-mode semantic alignment by using a multi-head attention mechanism, and finally predicts the five big personality respectively by using a multi-layer perceptron.
A personality characteristic prediction apparatus based on a multimodal large language model and contrast learning, comprising:
the memory is used for storing a computer program for realizing a personality characteristic prediction method based on a multi-mode large language model and contrast learning;
and the processor is used for realizing a personality characteristic prediction method based on the multi-mode large language model and the contrast learning when executing the computer program.
A computer readable storage medium storing a computer program which when executed by a processor implements the steps of a personality characteristic prediction method based on a multimodal large language model and contrast learning.
A computer program product comprising a computer program which when executed by a processor implements a personality characteristic prediction method based on a multimodal large language model and contrast learning.
Compared with the prior art, the invention has the beneficial effects that:
1. The invention adopts the bidirectional gating circulation unit to realize the association and fusion of visual features from fine granularity to coarse granularity, not only can capture the context dependency relationship between local frames, but also can fuse global sequence information, thereby effectively avoiding information loss caused by unidirectional modeling. Through the structural design, the model can more fully excavate and utilize dynamic characteristics in visual data, so that the expression capacity and discrimination of the visual characteristics are enhanced, and the personality prediction accuracy is further improved.
2. According to the invention, a multi-Mode Large Language Model (MLLM) is introduced to enhance the text, and external knowledge and semantic reasoning can be automatically introduced on the basis of original text information, so that the text characterization is richer and the semantics are more complete. When interacting with the visual mode, the method not only realizes the 'soft alignment' of the semantic level, but also improves the interpretation capability and the supplementation capability of texts on visual information in the information fusion process, thereby effectively improving the complementarity and the information density of the cross-mode representation.
3. According to the invention, a contrast learning mechanism is adopted, positive and negative sample pairs are constructed in the training process, and the accurate alignment of semantic layers is realized by shortening the semantic distance of high-dimensional features of different modes and pulling the distance between the high-dimensional features and the low-dimensional features of the modes. The method not only improves the fusion effect among different modes, but also remarkably enhances the personality prediction accuracy and robustness of the model under complex scenes such as noise input, mode deletion and the like.
In summary, the invention introduces innovative design in the aspects of multi-level visual feature modeling, visual and text fusion, cross-modal semantic alignment and the like, can remarkably improve the accuracy and the robustness of personality prediction tasks, and has higher application value and popularization prospect.
Drawings
FIG. 1 is a diagram of a model overall architecture of the personality characteristic prediction method of the present invention.
Fig. 2 is a flow chart of the invention for extracting facial frames from a sample video.
FIG. 3 is a graph of the density difference of the line of sight distribution kernels according to the present invention.
FIG. 4 is a diagram of a Vit model structure employed in an embodiment of the present invention.
Fig. 5 is a diagram of a TCN model structure used in an embodiment of the present invention.
Fig. 6 is a residual network configuration diagram of an attention introducing mechanism according to an embodiment of the present invention.
FIG. 7 is a block diagram of a text feature encoding method based on visual language big model assistance in an embodiment of the invention.
Fig. 8 is a diagram illustrating a multi-level visual association structure in accordance with an embodiment of the present invention.
FIG. 9 is a cross-modal semantic alignment block diagram based on contrast learning in an embodiment of the invention.
FIG. 10 is a block diagram of multi-modal feature interaction and fusion based on multi-head attention in accordance with an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
As shown in FIG. 1, the personality characteristic prediction method based on the multi-mode large language model and the contrast learning comprises the following steps:
Step 1, recording video by using video acquisition equipment, wherein in the embodiment, a Canon EOS 90D is adopted, and short video clips freely expressed by multiple testees facing a lens are recorded in a natural light environment of a real social scene to obtain a sample video;
step 2, preprocessing the sample video to obtain original data, specifically, respectively performing the following operations on the sample video:
Intercepting a sample video scene frame by using computer vision processing software, wherein the computer vision processing software adopted by the embodiment is OpenCV;
facial frames (fig. 2), action units and line-of-sight sequence data (fig. 3) are extracted from a sample video by using facial analysis software, wherein the facial analysis software adopted by the embodiment is OpenFace;
extracting audio data from the sample video by using audio and video processing software, wherein the audio and video processing software adopted in the embodiment is FFmpeg, and then transcribing the audio data by using voice recognition software to obtain text data, and the voice recognition software adopted in the embodiment is OpenAI Whisper;
Step 3, extracting visual features, audio features and text features of a sample video based on original data, wherein the visual features are obtained by using a Vit model and a TCN model, the audio features are obtained by using a pre-emphasis technology, a Mel frequency spectrum feature extraction method and a residual error network, and the text features are obtained by using a Qwen-VL visual language big model and a RoBERTa model, and the specific steps are as follows:
visual characteristics extraction, wherein the visual characteristics comprise scene characteristics, facial characteristics, action units and sight line sequence data characteristics, and the extraction process is as follows:
Dividing each section of sample video into n segments, dividing the sample video into 16 segments, randomly sampling a frame from each segment, including a scene frame and a facial frame, adjusting the frame to adapt to the input resolution of a visual feature extraction model, adjusting the resolution of 224×224, adjusting the output feature number of the visual feature extraction model (including Mamba model and Vit model) to be E, extracting scene features and facial features by using the Vit model, wherein the feature number is E=256, as shown in fig. 4;
inputting the motion unit and the sight line sequence data into a time sequence feature extraction model, wherein the embodiment adopts a TCN model, the output feature number is E=256, and the motion unit and the sight line sequence data features are extracted and obtained, as shown in fig. 5;
The audio feature extraction, namely, enhancing a high-frequency part in audio data by pre-emphasis treatment, then simulating the perceptibility of human ears to different frequencies in the audio data by a Mel frequency spectrum feature extraction method to obtain an audio feature representation conforming to the human auditory characteristics, and finally, aggregating the audio feature representation into emotion and personality feature representation of an speaking level by an audio feature aggregation model, as shown in figure 6;
The text feature extraction, namely, utilizing a multi-mode large language model, wherein the multi-mode large language model adopted in the embodiment is a Qwen-VL visual language large model, combining scene frames obtained by random sampling in a sample video and text data transcribed from audio to form a high-quality text fused with multi-mode information, splicing the high-quality text with the text data transcribed from the original audio to form a composite text containing dominant language information and implicit visual clues, and transmitting the composite text into a language representation model, wherein the language representation model adopted in the embodiment is a RoBERTa model, so that text features are obtained, and the text feature is shown in figure 7;
step 4, performing multi-level visual correlation on the visual characteristics;
In this step, a multi-level visual feature modeling module G is constructed and applied. The module G integrates two sub-processes of time sequence modeling and cross-modal feature interaction modeling to realize layering fusion of visual features, and comprises the following specific steps:
Step 4.1, constructing a multi-level visual feature modeling module G, wherein the input of the module G is two levels of visual features, which are represented as X and Y, and the corresponding input visual features are represented as X i-1 and Y i-1 by taking a time step i as an example, and the time sequence modeling is as follows:
Wherein, the AndThe hidden states of the GRU at the current time step,AndThe hidden states of the GRU at the previous time step are respectively, and X 'i-1 and Y' i-1 are respectively the output of the modeling of the time sequence of the current time step.
After the time series modeling is completed, cross-modal feature interaction modeling is performed, and the formula is as follows:
Wherein, the AndThe method is characterized in that the method is respectively a forward output and a reverse output of the bidirectional GRU, wherein the forward output and the reverse output are vector splicing operation, and h i-1 is the hidden state of the bidirectional GRU in the last time step.
The forward output and the reverse output of the bidirectional GRU are further subjected to feature compression and nonlinear transformation through a full connection layer, and finally visual correlation features are generated:
Wherein, W p and b p are weight matrix and bias term of the full connection layer, which represent matrix multiplication operation, and Z i is visual association feature of the current time step;
And 4.2, applying a multi-level visual feature modeling module G, namely firstly fusing the action unit, the sight line sequence data feature and the facial feature input module G to obtain intermediate features, and then further fusing the intermediate features with the scene feature input module G to finally obtain the visual features after multi-level visual association. The whole process is shown in the following formula:
V=G(Ds,G(Df,Dg))
Wherein D s、Df、Dg represents the motion unit and the line-of-sight sequence data feature, the facial feature and the scene feature, G is a multi-level visual feature modeling module, and V finally output is a multi-level visual associated visual feature, as shown in fig. 8;
step 5, realizing cross-modal semantic alignment by adopting contrast learning based on the visual characteristics after multi-level visual association and the audio characteristics and the text characteristics extracted in the step 3;
And 5.1, mapping the visual features, the audio features and the text features after multi-level visual association by using a projection network, wherein the projection network adopts a three-layer full-connection structure, and receives batch normalization and nonlinear activation functions ReLU after each layer to stabilize training, and the formula is as follows:
Wherein m epsilon { V, A, T } represents the mode type, V, A, T are respectively the visual characteristics, audio characteristics and text characteristics after multi-level visual association, f m is the characteristics of different modes, W i (m) and For the learnable parameters, i e {1,2,3}, z m is the feature after projection of different modalities;
And 5.2, further adjusting the projected features of different modes by adopting a prediction network, wherein the main function is to guide feature optimization, so that feature collapse cannot occur in the optimization process, the prediction network adopts a two-layer fully-connected network structure, the first layer is responsible for feature dimension reduction, the second layer is responsible for feature dimension recovery, but batch normalization is not used, so that the stability of information flow in the optimization process is ensured, and the formula is as follows:
wherein P m is the characteristic of different modes after being regulated by the prediction network;
and 5.3, taking the projected features of different modes obtained by training in the step 5.1 as low-dimensional features of the modes, taking the features adjusted by the network predicted in the step 5.2 as high-dimensional features of the modes, taking the high-dimensional features of the mode m as anchor point samples, taking the high-dimensional features of other modes as positive samples, taking the low-dimensional features of all modes as negative samples, and representing the contrast loss as follows:
In the formula, AndRespectively representing a high-dimensional characteristic anchor point sample and a positive sample thereof of a mode m, sim (·) is a similarity function, τ is a temperature coefficient,Representing a negative sample;
the final cross-modal semantic alignment loss is expressed as:
where M represents the modality type, m= { V, a, T }, as shown in fig. 9;
And 6, further enhancing interaction and fusion among modes by using a multi-head attention mechanism, ensuring that information among different mode characteristics can be fully transmitted and utilized, and finally predicting five personality categories by using a multi-layer perceptron, as shown in fig. 10.
A personality characteristic prediction system based on a multi-modal large language model and contrast learning, comprising:
the video acquisition module is used for recording videos by using video acquisition equipment to obtain sample videos, and the sample videos are used for realizing the step 1 of the invention;
the video preprocessing module is used for preprocessing the sample video to obtain original data, and the original data are used for realizing the step 2 of the invention;
The feature extraction module is used for extracting visual features, audio features and text features of the sample video based on the original data and is used for realizing the step 3 of the invention;
The multi-level visual correlation module carries out multi-level visual correlation on visual characteristics and is used for realizing the step 4 of the invention;
the visual text alignment module is used for realizing cross-modal semantic alignment by adopting contrast learning based on the visual characteristics after multi-level visual association and the audio characteristics and the text characteristics extracted in the step 3 and is used for realizing the step 5 of the invention;
the five-large personality prediction module further enhances interaction and fusion among modes after cross-mode semantic alignment by using a multi-head attention mechanism, and finally predicts the five-large personality respectively by using a multi-layer perceptron, so as to realize step 6 of the invention.
A personality characteristic prediction apparatus based on a multimodal large language model and contrast learning, comprising:
the memory is used for storing a computer program for realizing a personality characteristic prediction method based on a multi-mode large language model and contrast learning;
and the processor is used for realizing a personality characteristic prediction method based on the multi-mode large language model and the contrast learning when executing the computer program.
A computer readable storage medium storing a computer program which when executed by a processor implements the steps of a personality characteristic prediction method based on a multimodal large language model and contrast learning.
A computer program product comprising a computer program which when executed by a processor implements a personality characteristic prediction method based on a multimodal large language model and contrast learning.
Simulation experiment
In the experimental part, the present invention was validated based on CHALEARN FIRST Impressions V data set. The data set is a large-scale and widely-used multi-mode data set and is specially oriented to personality characteristic prediction tasks. The data set comprises 10,000 video clips, and is derived from 3,000 high-definition YouTube videos, wherein the video content is English expression of an individual facing a lens. The data set is divided into a training set, a verification set and a test set according to the ratio of 3:1:1, and all experimental results are evaluated based on the test set.
In this dataset, the method of the invention was compared with the existing representative work. The evaluation index is the prediction accuracy of five personality, the value range is 0-1, and the closer to 1, the more accurate the prediction is. The experimental results are shown in the following table. Representative work included a CRNet, DFNN, TMIN model In which CRNet was from the paper "Cr-net: A deep classification-regression network for multimodal apparent personality analysis", DFNN was from the paper "Self-Supervised Learning of Person-Specific Facial Dynamics for Automatic Personality Recognition",TMIN "In you Eyes: modality Disentangling for Personality ANALYSIS IN Short Video".
As can be seen from the table below, the process of the present invention exhibits excellent performance in all personality dimensions. The method has the advantages of capturing the dynamic behavior and the emotional state of the individual, and obtaining the highest accuracy of 0.9241 on the index of responsibility, thus the method has stronger discrimination in describing the stability and the self-discriminant characteristics of the individual. Although the improvement amplitude is relatively limited in two indexes of the pleasure and the openness, the method still keeps the leading level of the existing method, and the overall balance of the model is reflected.
More importantly, the average accuracy of the method reaches 0.9202, and compared with the model 0.9188 of the CRNet model, the model 0.9168 of the DFNN model and the model 0.9161 of the TMIN model, the method is improved by 0.14%, 0.32% and 0.41% respectively. The result shows that the method has advantages on single index, and the overall performance is obviously superior to that of the prior art, so that the advancement, effectiveness and robustness of the method in personality characteristic prediction tasks are fully verified.
| Method of | Outward direction | Neural matter | Is suitable for humanization | Responsibility center | Opening of the door | Average accuracy rate |
| TMIN | 0.9158 | 0.9142 | 0.9143 | 0.9210 | 0.9153 | 0.9161 |
| DFNN | 0.9183 | 0.9133 | 0.9262 | 0.9182 | 0.9180 | 0.9168 |
| CRNet | 0.9202 | 0.9146 | 0.9177 | 0.9218 | 0.9195 | 0.9188 |
| The invention is that | 0.9226 | 0.9184 | 0.9170 | 0.9241 | 0.9189 | 0.9202 |
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202511276064.9A CN121170671A (en) | 2025-09-08 | 2025-09-08 | Personality trait prediction methods, systems, devices, media, and program products based on multimodal large language models and contrastive learning. |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202511276064.9A CN121170671A (en) | 2025-09-08 | 2025-09-08 | Personality trait prediction methods, systems, devices, media, and program products based on multimodal large language models and contrastive learning. |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN121170671A true CN121170671A (en) | 2025-12-19 |
Family
ID=98042584
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202511276064.9A Pending CN121170671A (en) | 2025-09-08 | 2025-09-08 | Personality trait prediction methods, systems, devices, media, and program products based on multimodal large language models and contrastive learning. |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN121170671A (en) |
-
2025
- 2025-09-08 CN CN202511276064.9A patent/CN121170671A/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN115329779B (en) | Multi-person dialogue emotion recognition method | |
| CN117349675B (en) | Multi-mode large model construction system for multiple information sources | |
| CN108717856B (en) | A speech emotion recognition method based on multi-scale deep convolutional neural network | |
| CN117765981A (en) | An emotion recognition method and system based on cross-modal fusion of speech and text | |
| CN116933051A (en) | Multi-mode emotion recognition method and system for modal missing scene | |
| WO2023284435A1 (en) | Method and apparatus for generating animation | |
| CN119904901B (en) | Emotion recognition methods and related devices based on large models | |
| CN114140885A (en) | Emotion analysis model generation method and device, electronic equipment and storage medium | |
| WO2023226239A1 (en) | Object emotion analysis method and apparatus and electronic device | |
| US20250200855A1 (en) | Method for real-time generation of empathy expression of virtual human based on multimodal emotion recognition and artificial intelligence system using the method | |
| CN115858726A (en) | A multi-stage multi-modal sentiment analysis method based on mutual information representation | |
| CN119516054B (en) | A method for generating speaking digital humans based on large-model learnable text latent codes | |
| CN117251057A (en) | AIGC-based method and system for constructing AI number wisdom | |
| CN118334549A (en) | Short video label prediction method and system for multi-mode collaborative interaction | |
| CN119174609A (en) | Patient emotion analysis device based on multiple modes | |
| CN118888154A (en) | A multimodal depression recognition system based on multi-level feature fusion | |
| CN120104009B (en) | Human-computer interaction method of naked eye 3D digital person based on AI large model | |
| CN119206424B (en) | Intent recognition method and system based on multi-modal fusion of voice and sight | |
| CN121170671A (en) | Personality trait prediction methods, systems, devices, media, and program products based on multimodal large language models and contrastive learning. | |
| CN116682158A (en) | Method, device, storage medium and equipment for emotion recognition | |
| CN120670962B (en) | Deletion mode emotion recognition method and system based on cross-mode attention fusion | |
| CN119397359B (en) | Dialogue sentiment recognition methods based on context awareness and cross-modal shared attention | |
| CN120105158B (en) | Multi-mode emotion analysis method and system | |
| CN119323002B (en) | Multimodal personality perception method and device based on multiple correlation features and graph relationship attention | |
| CN119903393A (en) | A noise-enhanced multimodal emotion recognition method based on auxiliary network |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |