CN121170671A

CN121170671A - Personality trait prediction methods, systems, devices, media, and program products based on multimodal large language models and contrastive learning.

Info

Publication number: CN121170671A
Application number: CN202511276064.9A
Authority: CN
Inventors: 李宇楠; 卜浩哲; 马阳; 苗启广; 马卓奇; 赵博程; 卢子祥; 刘向增
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2025-09-08
Filing date: 2025-09-08
Publication date: 2025-12-19

Abstract

Preprocessing a sample video to obtain original data, extracting visual, audio and text features of the sample video based on the original data, carrying out multi-level visual association on the visual features, carrying out cross-modal semantic alignment on the visual features and the audio and text features after the multi-level visual association by adopting contrast learning, and finally further enhancing interaction and fusion among modes after the cross-modal semantic alignment by using a multi-head attention mechanism and finally respectively predicting five big personality by using a multi-layer perceptron; the system, the device and the medium are used for realizing the method, the program product comprises a computer program for realizing the method, and the multi-modal feature fusion expression capability is obviously enhanced while the personality prediction accuracy and the model robustness are improved.

Description

Personality characteristic prediction method, system, equipment, medium and program product based on multi-mode large language model and contrast learning

Technical Field

The invention belongs to the technical field of multi-mode perception, and particularly relates to a personality characteristic prediction method, a system, equipment, a medium and a program product based on a multi-mode large language model and contrast learning.

Background

For multi-modal large five personality predictions, there are some problems with existing techniques that have not been solved. Firstly, the current method mainly adopts a single-level visual feature extraction mode, and cannot establish hierarchical semantic association among scene features, facial features, sight lines and action unit sequence features. For example, in an interview scenario, the macro environment of the candidate, such as a meeting room layout, mesoscopic facial expressions, such as smile frequencies, and microscopic eye movements, such as gaze avoidance, together constitute a complete representation of personality traits, whereas the prior art lacks the ability to systematically model such multi-level semantics. Secondly, the prior art completely relies on voice to transfer text, and extracts richer semantic descriptions from visual information to cause insufficient information density of a text mode. Finally, the existing cross-modal fusion method generally has the problem of rough semantic alignment. Although some researches try to achieve multi-modal fusion through feature stitching, semantic contributions of different modalities cannot be effectively distinguished. For example, when the audio modality contains ambient noise interference, this simple fusion approach may result in the key personality characteristics being swamped by the noise.

The invention provides a large five personality characteristic prediction method based on multi-mode fusion, which mainly comprises the steps of (1) intercepting an image sequence to be processed containing the face of an image testee from a target dialogue video, extracting an audio file containing dialogue information of the testee from the target dialogue video, (2) respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network, (3) extracting the audio characteristic sequence and text information of the audio file and extracting text characteristics in the text information, (4) carrying out multi-mode fusion on the facial expression characteristic sequence, the head posture characteristic sequence, the audio characteristic sequence and the text characteristics to obtain target fusion characteristics, and (5) carrying out regression on the target fusion characteristics based on a trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the testee. However, due to rough semantic alignment in multi-modal fusion, the semantic contribution degree of different modalities cannot be effectively distinguished.

The invention provides a multi-mode personality perception method and device with a plurality of associated characteristics and drawing relation attention, which is disclosed in the patent application with the publication number of CN119323002A, and mainly comprises the steps of (1) intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; the method comprises the steps of (1) respectively extracting a facial expression feature sequence and a head gesture feature sequence from an image sequence to be processed by using a trained facial expression prediction network and a trained head gesture estimation network, (3) extracting an audio feature sequence of an audio file and text features transcribed by audio, and (4) carrying out multi-modal fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text features to obtain target fusion features, training the whole network by using a loss function based on label distribution, and (5) carrying out weighted regression on the target fusion features based on a trained multi-layer perceptron to obtain a quantization prediction result of each dimension of a large five personality of a tested person. However, since text data is completely dependent on voice transcription text and semantic alignment is rough in multi-modal fusion, the information density of text modalities is insufficient and semantic contribution degrees of different modalities cannot be effectively distinguished.

The patent application with the publication number of CN119337325A provides a multi-mode personality prediction method based on a pre-training model and a pyramid graph fusion, which mainly comprises the steps of (1) recording a video by using video acquisition equipment to obtain an original video collar, carrying out data preprocessing on the original video to obtain scene data, audio data and text data, (2) carrying out single-mode feature extraction through the pre-training model according to the scene data, the audio data and a preset personality descriptor list based on an extended long-short-term memory network, obtaining three-type single-mode features, namely, the pre-training model comprises a pre-training CLIP model and a pre-training CLAP model, (3) carrying out similarity calculation between two single modes according to three-type single-mode features based on a similarity function to obtain three-mode associated features, (4) carrying out feature fusion according to the three-mode associated features based on the pre-training model, a universal feature encoder and a double-layer perceptron, and carrying out feature fusion according to the text data and the three-type single-mode associated features, and carrying out feature fusion through the graph fusion network to obtain three-mode combined features, (6) carrying out multi-mode associated feature fusion, and carrying out the multi-mode feature fusion on the three-mode associated feature fusion based on the three-mode feature, and carrying out the double-mode pyramid prediction layer prediction result. However, a single-level visual feature extraction mode, namely scene features, cannot establish hierarchical semantic association among scene features, facial features, vision lines and action unit sequence features, so that insufficient utilization of visual information is caused.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a personality characteristic prediction method, a system, equipment, a medium and a program product based on a multi-mode large language model and contrast learning, wherein the whole framework comprises a multi-level visual association module based on a bidirectional gating circulation unit, a visual text alignment module based on the multi-mode large language model and a five-large personality prediction module based on cross-mode semantic alignment of contrast learning; the method realizes effective utilization of visual information by establishing hierarchical semantic association among scene features, facial features, action units and line-of-sight sequence data features, realizes effective improvement of text information density by enhancing text data, realizes semantic mapping and significance distinction among visual, audio and text modes by adopting a contrast learning strategy based on cross-mode semantic alignment, enhances collaborative modeling capability among modes, remarkably enhances expression capability of multi-mode feature fusion while improving personality prediction accuracy and model robustness, and has good application prospect and popularization value.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A personality characteristic prediction method based on a multi-mode large language model and contrast learning comprises the following steps:

Step 1, recording a video by using video acquisition equipment to obtain a sample video;

step 2, preprocessing a sample video to obtain original data;

step 3, extracting visual features, audio features and text features of the sample video based on the original data;

step 4, performing multi-level visual correlation on the visual characteristics;

step 5, realizing cross-modal semantic alignment by adopting contrast learning based on the visual characteristics after multi-level visual association and the audio characteristics and the text characteristics extracted in the step 3;

and 6, further enhancing interaction and fusion among modes after cross-mode semantic alignment by using a multi-head attention mechanism, and finally respectively predicting five personality by using a multi-layer perceptron.

The preprocessing in the step 2 specifically includes the following operations on the sample video respectively:

Intercepting sample video scene frames using computer vision processing software;

Extracting facial frames, action units and line-of-sight sequence data from the sample video using facial analysis software;

Audio data is extracted from the sample video by using audio-video processing software, and then the audio data is transcribed by using voice recognition software to obtain text data.

The step 3 specifically comprises the following steps:

visual characteristics extraction, wherein the visual characteristics comprise scene characteristics, facial characteristics, action units and sight line sequence data characteristics, and the extraction process is as follows:

Dividing each section of sample video into n sections, randomly sampling a frame from each section, including a scene frame and a face frame, adjusting the frame to adapt to the resolution of the input of a visual feature extraction model, adjusting the output feature number of the visual feature extraction model to be E, and extracting scene features and face features by using the visual feature extraction model;

Inputting the action unit and the sight line sequence data into a time sequence feature extraction model, outputting the feature number as E, and extracting the action unit and the sight line sequence data by using the time sequence feature extraction model;

The audio feature extraction comprises the steps of enhancing a high-frequency part in audio data by pre-emphasis treatment, simulating the perceptibility of human ears to different frequencies in the audio data by a Mel frequency spectrum feature extraction method to obtain audio feature representation conforming to human auditory characteristics, and finally aggregating the audio feature representation into emotion and personality feature representation of an speaking level by an audio feature aggregation model;

And extracting text features, namely forming a high-quality text fused with multi-modal information by utilizing a multi-modal large language model and combining a scene frame obtained by random sampling in a sample video and text data transcribed by audio, splicing the high-quality text with the text data transcribed by the original audio to form a composite text containing explicit language information and implicit visual clues, and transmitting the composite text into a language representation model to obtain the text features.

The step 4 builds and applies a multi-level visual feature modeling module G which synthesizes two sub-processes of time sequence modeling and cross-modal feature interaction modeling to realize hierarchical fusion of visual features, and the method specifically comprises the following steps:

And 4.1, constructing a multi-level visual feature modeling module G, wherein the input of the module G is two levels of visual features, which are expressed as X and Y, the corresponding input visual features of the time step i are expressed as X _i-1 and Y _i-1, and the time sequence modeling is as follows:

Wherein, the AndThe hidden states of the GRU at the current time step,AndThe hidden states of the GRU in the last time step are respectively, and X '_i-1 and Y' _i-1 are respectively the output of the time sequence modeling of the current time step;

after the time series modeling is completed, cross-modal feature interaction modeling is performed, and the formula is as follows:

Wherein, the AndThe method is characterized in that the method is respectively a forward output and a reverse output of the bidirectional GRU, wherein [ ] is vector splicing operation, and h _i-1 is a hiding state of the bidirectional GRU in the last time step;

The forward output and the reverse output of the bidirectional GRU are further subjected to feature compression and nonlinear transformation through a full connection layer, and finally visual correlation features are generated:

Wherein, W _p and b _p are weight matrix and bias term of the full connection layer, which represent matrix multiplication operation, and Z _i is visual association feature of the current time step;

Step 4.2, applying a multi-level visual feature modeling module G, namely firstly, fusing an action unit, a sight line sequence data feature and a facial feature input module G to obtain an intermediate feature, and then further fusing the intermediate feature with a scene feature input module G to finally obtain a multi-level visual associated visual feature, wherein the whole process is shown in the following formula:

V=G(D_s,G(D_f,D_g))

Wherein, D _s、D_f、D_g respectively represents the action unit and the sight sequence data characteristics, the facial characteristics and the scene characteristics, G is a multi-level visual characteristic modeling module, and the finally output V is the visual characteristics after multi-level visual association.

The step 5 specifically comprises the following steps:

And 5.1, mapping the visual features, the audio features and the text features after multi-level visual association by using a projection network, wherein the projection network adopts a three-layer full-connection structure, and receives batch normalization and nonlinear activation functions ReLU after each layer to stabilize training, and the formula is as follows:

Wherein m epsilon { V, A, T } represents the mode type, V, A, T are respectively the visual characteristics, audio characteristics and text characteristics after multi-level visual association, f _m is the characteristics of different modes, W _i ^(m) and For the learnable parameters, i e {1,2,3}, z _m is the feature after projection of different modalities;

And 5.2, further adjusting the projected characteristics of different modes by adopting a prediction network, wherein the prediction network adopts a two-layer fully-connected network structure, the first layer is responsible for characteristic dimension reduction, and the second layer is responsible for restoring characteristic dimension, and the formula is as follows:

wherein P _m is the characteristic of different modes after being regulated by the prediction network;

And 5.3, taking the projected features of different modes obtained by training in the step 5.1 as low-dimensional features of the modes, taking the features adjusted by the network predicted in the step 5.2 as high-dimensional features of the modes, taking the high-dimensional features of the mode m as anchor point samples, taking the high-dimensional features of other modes as positive samples, taking the low-dimensional features of all modes as negative samples, and representing the contrast loss as follows:

In the formula, AndRespectively representing a high-dimensional characteristic anchor point sample and a positive sample thereof of a mode m, sim (·) is a similarity function, τ is a temperature coefficient,Representing a negative sample;

the final cross-modal semantic alignment loss is expressed as:

Where M represents the modality type, m= { V, a, T }.

A personality characteristic prediction system based on a multi-modal large language model and contrast learning, comprising:

the video acquisition module is used for recording videos by using video acquisition equipment to obtain sample videos;

the video preprocessing module is used for preprocessing the sample video to obtain original data;

the feature extraction module is used for extracting visual features, audio features and text features of the sample video based on the original data;

The multi-level visual correlation module is used for carrying out multi-level visual correlation on visual characteristics;

The visual text alignment module is used for realizing cross-modal semantic alignment by adopting contrast learning based on the visual characteristics after multi-level visual association and the audio characteristics and the text characteristics extracted in the step 3;

And the five-big personality prediction module further enhances interaction and fusion among modes after cross-mode semantic alignment by using a multi-head attention mechanism, and finally predicts the five big personality respectively by using a multi-layer perceptron.

A personality characteristic prediction apparatus based on a multimodal large language model and contrast learning, comprising:

the memory is used for storing a computer program for realizing a personality characteristic prediction method based on a multi-mode large language model and contrast learning;

and the processor is used for realizing a personality characteristic prediction method based on the multi-mode large language model and the contrast learning when executing the computer program.

A computer readable storage medium storing a computer program which when executed by a processor implements the steps of a personality characteristic prediction method based on a multimodal large language model and contrast learning.

A computer program product comprising a computer program which when executed by a processor implements a personality characteristic prediction method based on a multimodal large language model and contrast learning.

Compared with the prior art, the invention has the beneficial effects that:

1. The invention adopts the bidirectional gating circulation unit to realize the association and fusion of visual features from fine granularity to coarse granularity, not only can capture the context dependency relationship between local frames, but also can fuse global sequence information, thereby effectively avoiding information loss caused by unidirectional modeling. Through the structural design, the model can more fully excavate and utilize dynamic characteristics in visual data, so that the expression capacity and discrimination of the visual characteristics are enhanced, and the personality prediction accuracy is further improved.

2. According to the invention, a multi-Mode Large Language Model (MLLM) is introduced to enhance the text, and external knowledge and semantic reasoning can be automatically introduced on the basis of original text information, so that the text characterization is richer and the semantics are more complete. When interacting with the visual mode, the method not only realizes the 'soft alignment' of the semantic level, but also improves the interpretation capability and the supplementation capability of texts on visual information in the information fusion process, thereby effectively improving the complementarity and the information density of the cross-mode representation.

3. According to the invention, a contrast learning mechanism is adopted, positive and negative sample pairs are constructed in the training process, and the accurate alignment of semantic layers is realized by shortening the semantic distance of high-dimensional features of different modes and pulling the distance between the high-dimensional features and the low-dimensional features of the modes. The method not only improves the fusion effect among different modes, but also remarkably enhances the personality prediction accuracy and robustness of the model under complex scenes such as noise input, mode deletion and the like.

In summary, the invention introduces innovative design in the aspects of multi-level visual feature modeling, visual and text fusion, cross-modal semantic alignment and the like, can remarkably improve the accuracy and the robustness of personality prediction tasks, and has higher application value and popularization prospect.

Drawings

FIG. 1 is a diagram of a model overall architecture of the personality characteristic prediction method of the present invention.

Fig. 2 is a flow chart of the invention for extracting facial frames from a sample video.

FIG. 3 is a graph of the density difference of the line of sight distribution kernels according to the present invention.

FIG. 4 is a diagram of a Vit model structure employed in an embodiment of the present invention.

Fig. 5 is a diagram of a TCN model structure used in an embodiment of the present invention.

Fig. 6 is a residual network configuration diagram of an attention introducing mechanism according to an embodiment of the present invention.

FIG. 7 is a block diagram of a text feature encoding method based on visual language big model assistance in an embodiment of the invention.

Fig. 8 is a diagram illustrating a multi-level visual association structure in accordance with an embodiment of the present invention.

FIG. 9 is a cross-modal semantic alignment block diagram based on contrast learning in an embodiment of the invention.

FIG. 10 is a block diagram of multi-modal feature interaction and fusion based on multi-head attention in accordance with an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

As shown in FIG. 1, the personality characteristic prediction method based on the multi-mode large language model and the contrast learning comprises the following steps:

Step 1, recording video by using video acquisition equipment, wherein in the embodiment, a Canon EOS 90D is adopted, and short video clips freely expressed by multiple testees facing a lens are recorded in a natural light environment of a real social scene to obtain a sample video;

step 2, preprocessing the sample video to obtain original data, specifically, respectively performing the following operations on the sample video:

Intercepting a sample video scene frame by using computer vision processing software, wherein the computer vision processing software adopted by the embodiment is OpenCV;

facial frames (fig. 2), action units and line-of-sight sequence data (fig. 3) are extracted from a sample video by using facial analysis software, wherein the facial analysis software adopted by the embodiment is OpenFace;

extracting audio data from the sample video by using audio and video processing software, wherein the audio and video processing software adopted in the embodiment is FFmpeg, and then transcribing the audio data by using voice recognition software to obtain text data, and the voice recognition software adopted in the embodiment is OpenAI Whisper;

Step 3, extracting visual features, audio features and text features of a sample video based on original data, wherein the visual features are obtained by using a Vit model and a TCN model, the audio features are obtained by using a pre-emphasis technology, a Mel frequency spectrum feature extraction method and a residual error network, and the text features are obtained by using a Qwen-VL visual language big model and a RoBERTa model, and the specific steps are as follows:

Dividing each section of sample video into n segments, dividing the sample video into 16 segments, randomly sampling a frame from each segment, including a scene frame and a facial frame, adjusting the frame to adapt to the input resolution of a visual feature extraction model, adjusting the resolution of 224×224, adjusting the output feature number of the visual feature extraction model (including Mamba model and Vit model) to be E, extracting scene features and facial features by using the Vit model, wherein the feature number is E=256, as shown in fig. 4;

inputting the motion unit and the sight line sequence data into a time sequence feature extraction model, wherein the embodiment adopts a TCN model, the output feature number is E=256, and the motion unit and the sight line sequence data features are extracted and obtained, as shown in fig. 5;

The audio feature extraction, namely, enhancing a high-frequency part in audio data by pre-emphasis treatment, then simulating the perceptibility of human ears to different frequencies in the audio data by a Mel frequency spectrum feature extraction method to obtain an audio feature representation conforming to the human auditory characteristics, and finally, aggregating the audio feature representation into emotion and personality feature representation of an speaking level by an audio feature aggregation model, as shown in figure 6;

The text feature extraction, namely, utilizing a multi-mode large language model, wherein the multi-mode large language model adopted in the embodiment is a Qwen-VL visual language large model, combining scene frames obtained by random sampling in a sample video and text data transcribed from audio to form a high-quality text fused with multi-mode information, splicing the high-quality text with the text data transcribed from the original audio to form a composite text containing dominant language information and implicit visual clues, and transmitting the composite text into a language representation model, wherein the language representation model adopted in the embodiment is a RoBERTa model, so that text features are obtained, and the text feature is shown in figure 7;

In this step, a multi-level visual feature modeling module G is constructed and applied. The module G integrates two sub-processes of time sequence modeling and cross-modal feature interaction modeling to realize layering fusion of visual features, and comprises the following specific steps:

Step 4.1, constructing a multi-level visual feature modeling module G, wherein the input of the module G is two levels of visual features, which are represented as X and Y, and the corresponding input visual features are represented as X _i-1 and Y _i-1 by taking a time step i as an example, and the time sequence modeling is as follows:

Wherein, the AndThe hidden states of the GRU at the current time step,AndThe hidden states of the GRU at the previous time step are respectively, and X '_i-1 and Y' _i-1 are respectively the output of the modeling of the time sequence of the current time step.

Wherein, the AndThe method is characterized in that the method is respectively a forward output and a reverse output of the bidirectional GRU, wherein the forward output and the reverse output are vector splicing operation, and h _i-1 is the hidden state of the bidirectional GRU in the last time step.

And 4.2, applying a multi-level visual feature modeling module G, namely firstly fusing the action unit, the sight line sequence data feature and the facial feature input module G to obtain intermediate features, and then further fusing the intermediate features with the scene feature input module G to finally obtain the visual features after multi-level visual association. The whole process is shown in the following formula:

V=G(D_s,G(D_f,D_g))

Wherein D _s、D_f、D_g represents the motion unit and the line-of-sight sequence data feature, the facial feature and the scene feature, G is a multi-level visual feature modeling module, and V finally output is a multi-level visual associated visual feature, as shown in fig. 8;

And 5.2, further adjusting the projected features of different modes by adopting a prediction network, wherein the main function is to guide feature optimization, so that feature collapse cannot occur in the optimization process, the prediction network adopts a two-layer fully-connected network structure, the first layer is responsible for feature dimension reduction, the second layer is responsible for feature dimension recovery, but batch normalization is not used, so that the stability of information flow in the optimization process is ensured, and the formula is as follows:

the final cross-modal semantic alignment loss is expressed as:

where M represents the modality type, m= { V, a, T }, as shown in fig. 9;

And 6, further enhancing interaction and fusion among modes by using a multi-head attention mechanism, ensuring that information among different mode characteristics can be fully transmitted and utilized, and finally predicting five personality categories by using a multi-layer perceptron, as shown in fig. 10.

the video acquisition module is used for recording videos by using video acquisition equipment to obtain sample videos, and the sample videos are used for realizing the step 1 of the invention;

the video preprocessing module is used for preprocessing the sample video to obtain original data, and the original data are used for realizing the step 2 of the invention;

The feature extraction module is used for extracting visual features, audio features and text features of the sample video based on the original data and is used for realizing the step 3 of the invention;

The multi-level visual correlation module carries out multi-level visual correlation on visual characteristics and is used for realizing the step 4 of the invention;

the visual text alignment module is used for realizing cross-modal semantic alignment by adopting contrast learning based on the visual characteristics after multi-level visual association and the audio characteristics and the text characteristics extracted in the step 3 and is used for realizing the step 5 of the invention;

the five-large personality prediction module further enhances interaction and fusion among modes after cross-mode semantic alignment by using a multi-head attention mechanism, and finally predicts the five-large personality respectively by using a multi-layer perceptron, so as to realize step 6 of the invention.

Simulation experiment

In the experimental part, the present invention was validated based on CHALEARN FIRST Impressions V data set. The data set is a large-scale and widely-used multi-mode data set and is specially oriented to personality characteristic prediction tasks. The data set comprises 10,000 video clips, and is derived from 3,000 high-definition YouTube videos, wherein the video content is English expression of an individual facing a lens. The data set is divided into a training set, a verification set and a test set according to the ratio of 3:1:1, and all experimental results are evaluated based on the test set.

In this dataset, the method of the invention was compared with the existing representative work. The evaluation index is the prediction accuracy of five personality, the value range is 0-1, and the closer to 1, the more accurate the prediction is. The experimental results are shown in the following table. Representative work included a CRNet, DFNN, TMIN model In which CRNet was from the paper "Cr-net: A deep classification-regression network for multimodal apparent personality analysis", DFNN was from the paper "Self-Supervised Learning of Person-Specific Facial Dynamics for Automatic Personality Recognition",TMIN "In you Eyes: modality Disentangling for Personality ANALYSIS IN Short Video".

As can be seen from the table below, the process of the present invention exhibits excellent performance in all personality dimensions. The method has the advantages of capturing the dynamic behavior and the emotional state of the individual, and obtaining the highest accuracy of 0.9241 on the index of responsibility, thus the method has stronger discrimination in describing the stability and the self-discriminant characteristics of the individual. Although the improvement amplitude is relatively limited in two indexes of the pleasure and the openness, the method still keeps the leading level of the existing method, and the overall balance of the model is reflected.

More importantly, the average accuracy of the method reaches 0.9202, and compared with the model 0.9188 of the CRNet model, the model 0.9168 of the DFNN model and the model 0.9161 of the TMIN model, the method is improved by 0.14%, 0.32% and 0.41% respectively. The result shows that the method has advantages on single index, and the overall performance is obviously superior to that of the prior art, so that the advancement, effectiveness and robustness of the method in personality characteristic prediction tasks are fully verified.

Method of	Outward direction	Neural matter	Is suitable for humanization	Responsibility center	Opening of the door	Average accuracy rate
							TMIN	0.9158	0.9142	0.9143	0.9210	0.9153	0.9161
DFNN	0.9183	0.9133	0.9262	0.9182	0.9180	0.9168
							CRNet	0.9202	0.9146	0.9177	0.9218	0.9195	0.9188
The invention is that	0.9226	0.9184	0.9170	0.9241	0.9189	0.9202

Claims

1. A personality trait prediction method based on a multimodal large language model and contrastive learning, characterized by comprising the following steps:

Step 1: Record video using a video capture device to obtain sample video;

Step 2: Preprocess the sample video to obtain the raw data;

Step 3: Extract visual, audio, and text features from the sample videos based on the raw data;

Step 4: Perform multi-level visual association of visual features;

Step 5: Based on the visual features obtained after multi-level visual association and the audio and text features extracted in Step 3, contrastive learning is used to achieve cross-modal semantic alignment.

Step 6: Use a multi-head attention mechanism to further enhance the interaction and fusion between modalities after cross-modal semantic alignment, and finally use a multilayer perceptron to predict the five personality traits respectively.

2. The prediction method according to claim 1, characterized in that the preprocessing in step 2 specifically involves performing the following operations on the sample videos:

Use computer vision processing software to extract scene frames from sample videos.

Facial frame, motion unit, and gaze sequence data were extracted from sample videos using facial analysis software.

Audio data is extracted from sample videos using audio and video processing software, and then speech recognition software is used to transcribe the audio data to obtain text data.

3. The prediction method according to claim 1, wherein step 3 specifically includes the following steps:

Visual feature extraction: The visual features include scene features, facial features, action unit and gaze sequence data features, and the extraction process is as follows:

Each sample video is divided into n segments, and one frame is randomly sampled from each segment, including scene frames and face frames. The sampled frame is adjusted to adapt to the resolution of the input of the visual feature extraction model. The number of output features of the visual feature extraction model is adjusted to E. The scene features and face features are extracted using the visual feature extraction model.

The action unit and gaze sequence data are input into the temporal feature extraction model, and the number of output features is E. The features of the action unit and gaze sequence data are extracted using the temporal feature extraction model.

Audio feature extraction: First, pre-emphasis processing is used to enhance the high-frequency part of the audio data. Then, the Mel spectrum feature extraction method is used to simulate the human ear's ability to perceive different frequencies in the audio data, so as to obtain an audio feature representation that conforms to the characteristics of human hearing. Finally, the audio feature aggregation model is used to aggregate the audio feature representation into speech-level emotion and personality feature representation.

Text feature extraction: Using a multimodal large language model, combined with scene frames randomly sampled from the sample video and text data transcribed from audio, a high-quality text fused with multimodal information is formed. The high-quality text is then concatenated with the original text data transcribed from audio to form a composite text containing explicit linguistic information and implicit visual cues. The composite text is then fed into a language representation model to obtain text features.

4. The prediction method according to claim 1, characterized in that step 4 constructs and applies a multi-level visual feature modeling module G, which integrates two sub-processes: time series modeling and cross-modal feature interaction modeling, to achieve hierarchical fusion of visual features; specifically including the following steps:

Step 4.1, construct a multi-level visual feature modeling module G: The input of module G is two levels of visual features, represented as X and Y. The time step i corresponds to the input visual features X _{i-1} and Y _{i-1} . The time series modeling is as follows:

in, and These represent the hidden states of the GRU at the current time step. and X'i _-1 and Y'i _-1 are the hidden states of the GRU at the previous time step, respectively, and the outputs of the current time step after time series modeling are the outputs of the GRU at the current time step.

After completing time series modeling, cross-modal feature interaction modeling is performed, as shown in the following formula:

in, and These are the forward and backward outputs of the bidirectional GRU, respectively; [:] represents the vector concatenation operation; and h _{i-1} represents the hidden state of the bidirectional GRU at the previous time step.

The forward and backward outputs of the bidirectional GRU are further processed through fully connected layers for feature compression and nonlinear transformation, ultimately generating visual association features:

Where _Wp and _bp are the weight matrix and bias term of the fully connected layer, * denotes matrix multiplication operation, and _Zi is the visual association feature at the current time step;

Step 4.2, applying the multi-level visual feature modeling module G: First, the action unit, gaze sequence data features, and facial feature input module G are fused to obtain intermediate features; then, the intermediate features are further fused with the scene feature input module G to finally obtain the multi-level visual features after visual association. The entire process is shown in the following formula:

V = G( _Ds , G( _Df , _Dg )).

Where _Ds , _Df , and _Dg represent action unit and gaze sequence data features, facial features, and scene features, respectively, G is a multi-level visual feature modeling module, and the final output V is the visual feature after multi-level visual association.

5. The prediction method according to claim 1, wherein step 5 specifically includes the following steps:

Step 5.1: Map the visual features, audio features, and text features after multi-level visual association using a projection network. The projection network adopts a three-layer fully connected structure, and batch normalization and the non-linear activation function ReLU are applied after each layer for stable training. The formula is as follows:

Where m∈{V,A,T} represents the modality type, V,A,T are the visual features, audio features, and text features after multi-level visual association, respectively, _{f_m} represents different modal features, and _{W_i} ⁽⁰⁾ and Let i be a learnable parameter, i∈{1,2,3}, and z _m be the features projected onto different modalities.

Step 5.2: The prediction network is used to further adjust the features projected from different modalities. The prediction network adopts a two-layer fully connected network structure. The first layer is responsible for feature dimensionality reduction, and the second layer is responsible for restoring the feature dimensionality. The formula is as follows:

Wherein, P _m represents the features of different modalities after adjustment by the prediction network;

Step 5.3: The projected features of different modalities obtained in Step 5.1 are used as the low-dimensional features of the modality, and the features adjusted by the prediction network in Step 5.2 are used as the high-dimensional features of the modality. The high-dimensional features of modality m are used as anchor samples, the high-dimensional features of other modalities are used as positive samples, and the low-dimensional features of all modalities are used as negative samples. The contrastive loss is expressed as:

In the formula, and Let sim(·) represent the high-dimensional feature anchor point sample and its positive sample of mode m, respectively, where sim(·) is the similarity function and τ is the temperature coefficient. Represents negative samples;

The final cross-modal semantic alignment loss is expressed as:

Where m represents the modal type, M = {V, A, T}.

6. A personality trait prediction system based on a multimodal large language model and contrastive learning, according to the method of any one of claims 1 to 5, characterized in that it comprises:

The video capture module records video using video capture equipment to obtain sample videos.

The video preprocessing module preprocesses the sample videos to obtain the raw data;

The feature extraction module extracts visual, audio, and text features from the sample video based on the raw data;

The multi-level visual association module performs multi-level visual association of visual features;

The visual text alignment module, based on the visual features after multi-level visual association and the audio and text features extracted in step 3, uses contrastive learning to achieve cross-modal semantic alignment.

The five personality prediction modules use a multi-head attention mechanism to further enhance the interaction and fusion between modalities after cross-modal semantic alignment, and finally use a multilayer perceptron to predict the five personality types respectively.

7. A personality trait prediction device based on a multimodal large language model and contrastive learning, characterized in that it comprises:

Memory: for storing computer programs that implement the personality trait prediction method based on multimodal large language model and contrastive learning as described in any one of claims 1 to 5;

Processor: Used to implement the personality trait prediction method based on multimodal large language model and contrastive learning as described in any one of claims 1 to 5 when executing the computer program.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps of the personality trait prediction method based on a multimodal large language model and contrastive learning as described in any one of claims 1 to 5.

9. A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, it implements the personality trait prediction method based on a multimodal large language model and contrastive learning as described in any one of claims 1 to 5.