[go: up one dir, main page]

CN118507053A - Psychological health intelligent assessment method and system - Google Patents

Psychological health intelligent assessment method and system Download PDF

Info

Publication number
CN118507053A
CN118507053A CN202410664703.8A CN202410664703A CN118507053A CN 118507053 A CN118507053 A CN 118507053A CN 202410664703 A CN202410664703 A CN 202410664703A CN 118507053 A CN118507053 A CN 118507053A
Authority
CN
China
Prior art keywords
video
features
model
standard
evaluation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410664703.8A
Other languages
Chinese (zh)
Inventor
姜怀臣
李宜兵
张东
赵峂
刘金林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Zhongke Huixin Intelligent Technology Co ltd
Original Assignee
Shandong Zhongke Huixin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Zhongke Huixin Intelligent Technology Co ltd filed Critical Shandong Zhongke Huixin Intelligent Technology Co ltd
Priority to CN202410664703.8A priority Critical patent/CN118507053A/en
Publication of CN118507053A publication Critical patent/CN118507053A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses a psychological health intelligent assessment method and system, comprising the following steps: acquiring a historical behavior video data set and a corresponding scoring tag, and taking the historical behavior video data set and the corresponding scoring tag as a training set; carrying out standardized processing on the historical behavior video dataset according to the difference between the video duration and the standard duration through a sliding window, carrying out feature extraction on the obtained standard video frame, extracting ViViT model video global features, expression features, psychological features and physiological features, splicing to form fusion features, and training a pre-constructed evaluation model according to the fusion features and corresponding scoring tags; and acquiring a behavior video file of the object to be evaluated, and acquiring an evaluation result by adopting the trained evaluation model. And comprehensively evaluating the multi-modal characteristics and adding a multi-modal loss function to consider the relationship among the characteristics at the same time, thereby improving the evaluation accuracy.

Description

Psychological health intelligent assessment method and system
Technical Field
The invention relates to the technical field of data processing, in particular to a psychological health intelligent assessment method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Traditional mental health assessment mainly comprises observation methods, interview methods, investigation methods, mental testing methods and the like, and a comprehensive assessment result is obtained by combining the above modes. Auxiliary examination means such as blood routine, electrocardiogram, craniocerebral CT and the like can be introduced according to actual conditions, so that the psychological health condition of an evaluation object is comprehensively evaluated by comprehensively considering the examination result on the basis of the query result and the psychological test result.
Therefore, the whole evaluation process of the traditional mental health evaluation method needs the guidance of a professional doctor, has higher requirements on the objectivity and coverage of the questionnaire content, and needs that an evaluation object can fully master the situation of the evaluation object and has no intentional camouflage or concealing. The above factors lead to the evaluation result being influenced by subjective factors of various personnel participating in the evaluation process to a great extent, the accuracy and reliability of the evaluation result are difficult to be ensured, and the evaluation result has low application and popularization values in actual life.
The mental health status can also be represented by physiological signals, so that the mental health status of the person to be tested is often analyzed based on the physiological signals in the prior art, and then the mental health status of the person to be tested is evaluated. However, although the existing psychological health state assessment method based on the physiological signals has a certain objectivity, the method is influenced by the acquisition quality of the physiological signals, and when the acquisition quality of the signals is poor, the accuracy of psychological health state assessment is greatly influenced. That is, the existing mental health state assessment method has low assessment accuracy and poor robustness.
Furthermore, some mental health evaluation systems analyze facial expressions of users, but the mental health evaluation systems have simple functions, can only analyze facial expressions of users, have single indexes, have insufficient judgment accuracy, and cannot obtain accurate test results.
In addition, along with rapid progress of artificial intelligence technology in recent years, more and more psychological health assessment methods based on deep learning are continuously emerging, but the methods generally have the problems of no basis for feature extraction, unclear training process, poor persuasion of predicted results and the like.
Disclosure of Invention
In order to solve the problems, the invention provides a psychological health intelligent assessment method and system, which comprehensively assesses multimodal features such as global features, expression features, psychological features and physiological features of ViViT model videos, and increases a multimodal loss function so as to consider the relationship among the features at the same time and improve the assessment accuracy.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
In a first aspect, the present invention provides a mental health intelligentized assessment method, comprising:
acquiring a historical behavior video data set and a corresponding scoring tag, and taking the historical behavior video data set and the corresponding scoring tag as a training set;
Carrying out standardized processing on the historical behavior video dataset according to the difference between the video duration and the standard duration through a sliding window, carrying out feature extraction on the obtained standard video frame, extracting ViViT model video global features, expression features, psychological features and physiological features, splicing to form fusion features, and training a pre-constructed evaluation model according to the fusion features and corresponding scoring tags;
and acquiring a behavior video file of the object to be evaluated, and acquiring an evaluation result by adopting the trained evaluation model.
As an alternative embodiment, the process of normalizing the historical behavior video dataset includes: and cleaning and segmenting the video according to the video duration, the set standard duration and the sliding window, and training the obtained video in the standard duration and the video exceeding the standard duration as a video training set.
In an alternative embodiment, the process of cleaning the video includes:
setting a first threshold value and a second threshold value, wherein the first threshold value is smaller than the second threshold value;
if the video duration is smaller than the first threshold, deleting the current video;
If the video time length is larger than the first threshold value and smaller than the second threshold value, reserving the current video;
if the video time length is greater than the second threshold value, the current video is segmented according to the set standard time length and the sliding window.
In an alternative embodiment, the dividing process includes: dividing according to the set standard time length, dividing the video exceeding the standard time length by adopting the set sliding window, thereby obtaining a plurality of fragments, and treating each fragment as an independent data point.
As an alternative embodiment, the evaluation model adopts a modified ViViT model, and the modified ViViT model is to add a full-connection layer as a regression layer on the basis of a ViViT model so as to adjust a model output layer and output a label scoring continuous value instead of a classification label;
the evaluation model adopts a multi-modal loss function during training, and the multi-modal loss function is as follows: Where MSE i is the loss function of the ith mode and α i is the weight of the ith mode; n is the number of modes, i.e. the number of evaluation index dimensions.
As an alternative embodiment, the process of feature extraction includes:
extracting ViViT model video global features by adopting a transducer architecture;
After face recognition is carried out on the standard video frame, a convolutional neural network is adopted to extract expression features from the face information;
Extracting psychological characteristics from continuous standard video frames by adopting an optical flow method so as to reflect behavior and posture information;
Converting a standard video frame into a gray image, and then calculating the average value of pixel intensity of each frame to obtain a time sequence; and performing fast Fourier transform on the time sequence to obtain a frequency domain signal, wherein the frequency domain signal is used as a physiological characteristic.
In a second aspect, the present invention provides a mental health intelligentized assessment system comprising:
The acquisition module is configured to acquire a historical behavior video data set and a corresponding scoring tag and is used as a training set;
The training module is configured to perform standardized processing on the historical behavior video dataset according to the difference between the video duration and the standard duration through a sliding window, perform feature extraction on the obtained standard video frame, extract ViViT model video global features, expression features, psychological features and physiological features, form fusion features after splicing, and train a pre-constructed evaluation model according to the fusion features and corresponding scoring tags;
the evaluation module is configured to acquire a behavior video file of the object to be evaluated, and an evaluation result is obtained by adopting the trained evaluation model.
In a third aspect, the invention provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of the first aspect.
In a fourth aspect, the present invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the first aspect.
In a fifth aspect, the invention provides a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
The invention provides a psychological health intelligent assessment method and system, which comprehensively assesses multimodal features such as ViViT model features, expression features, psychological features, physiological features and the like, and adds a multimodal loss function on the basis of a traditional MSE method so as to simultaneously consider the relation between the features, improve the model assessment accuracy, and comprehensively consider the relation between different modalities by the multimodal loss function so as to better train a model and extract the multimodal features and avoid the problems of lower assessment accuracy and poor robustness caused by single index.
The invention provides a psychological health intelligent assessment method and system, which aim to provide a more standardized and diversified video analysis basis by cleaning and dividing a training set, so that the core content of a video is reserved, the complexity possibly caused by overlong videos in the analysis process is avoided, and the efficiency and the accuracy of subsequent model training are improved. Meanwhile, a dynamic update mechanism is added in the data update part, namely the data processing flow supports periodical update to incorporate new video data and labels, so that the continuous integrity and timeliness of the data set are ensured.
According to the invention, through training ViViT models, the input video is analyzed, so that the psychological health level of the testee is obtained. The feature extraction method is added before model training to more comprehensively express video information, the accuracy of the model is improved by optimizing a loss function during training, in addition, the training data set is 11-dimensional typical features selected from a plurality of industry universal scales are used as evaluation items, and the training data set video is manually scored to realize the capability access of an expert system, so that the mental health intelligent assessment task is completed.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
Fig. 1 is a flowchart of a mental health intelligent assessment method provided in embodiment 1 of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, e.g., processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Example 1
The embodiment provides a mental health intelligent assessment method, as shown in fig. 1, including:
acquiring a historical behavior video data set and a corresponding scoring tag, and taking the historical behavior video data set and the corresponding scoring tag as a training set;
Carrying out standardized processing on the historical behavior video dataset according to the difference between the video duration and the standard duration through a sliding window, carrying out feature extraction on the obtained standard video frame, extracting ViViT model video global features, expression features, psychological features and physiological features, splicing to form fusion features, and training a pre-constructed evaluation model according to the fusion features and corresponding scoring tags;
and acquiring a behavior video file of the object to be evaluated, and acquiring an evaluation result by adopting the trained evaluation model.
In this embodiment, a historical behavior video dataset of a subject is obtained, and each video is scored based on 11 set different measurement dimensions, so as to obtain a corresponding scoring tag, and a tag set is formed, wherein the tag set is obtained by scoring psychological characteristic items of a plurality of general psychological health scales for a professional doctor, and each video has the corresponding scoring tag, so that the video is used as an expert system capability interface entry, and a more accurate evaluation result can be obtained by training a model with the video tags of an expert system.
The 11-dimensional measurement dimensions are shown in table 1, and an evaluation questionnaire can be preset based on the 11-dimensional measurement dimensions.
Table 1 measure dimensions
Fields Meaning of Meter source
pious Feeling of responsibility Five-person grid
exoteric Opening of the door Five-person grid
pressure Pressure sensation Psychological pressure measuring meter
depressed Depression of SDS
dysphoric Anxiety of people SAS
delightful Pleasant to the skin Five-person grid
deliration Nerves Five-person grid
selfrespect Self-esteem Self-esteem meter
extroversion Camber of camber Five-person grid
sensory needs Psychological needs (sense of security) Safety sensor meter
risk orientation Comprehensive risk rating Diagnostic conclusion (doctor, non-scale)
When training the model, the acquired historical video data set is subjected to cleaning and segmentation pretreatment.
The data cleaning step mainly carries out the processing of the video length, namely, carries out the time length screening on the video data set; the method comprises the following steps:
(1) Setting a first threshold and a second threshold, wherein the first threshold is smaller than the second threshold, for example, the first threshold is 30 seconds, and the second threshold is 60 seconds;
if the video duration is smaller than the first threshold, the information quantity of the current video data set is considered to be insufficient for effective analysis, and the current video data set is deleted;
(2) If the video time period is greater than the first threshold and less than the second threshold, such as between 30 and 60 seconds, then the video is preserved as they are deemed most appropriate for accurately describing the mental well-being of the teenager;
(3) If the video time length is longer than the second threshold, the video is considered to be too long, and video standardization operation is needed, namely the video is standardized to be within the second threshold (for example, 60 seconds) through a segmentation method, so that the processing not only maintains the core content of the video, but also avoids complexity possibly caused by the too long video in the analysis process.
The original video data set can be further standardized and optimized through data cleaning, so that the expression degree of each video segment for expressing the teenager cleaning health degree is the same.
Then, the segmentation method specifically includes: dividing the video with the video time length longer than a second threshold value according to the set standard time length, and then dividing the video exceeding the standard time length by adopting a set sliding window; for example, a 65 second video will be split into multiple segments of 0-60 seconds, 1-61 seconds, 2-62 seconds, etc.; thus, each segment is considered an independent data point.
The naming rule of the new fragment is named according to the mode of original video name, starting frame and ending frame, so that the uniqueness of the name and the integrity of information are ensured.
When the original video data set is amplified, the tag set also needs to be amplified to ensure the consistency of input data, and the amplification rule of the tag set copies the tag information of the original video for each divided video segment.
The final updated video data set contains two parts: one part is the video with the duration between the first threshold value and the second threshold value (such as 30 to 60 seconds) in the original video data set and the corresponding label thereof, the other part is the segment generated by segmenting the video with the duration exceeding the standard duration (such as 60 seconds) and the corresponding label thereof, the augmentation data update the part of the original data meeting the requirements to jointly form a training data set, the video duration of the finally obtained training data set is uniform, the expression capacity of each segment of video is consistent, the influence capacity of the model is not greatly different, and therefore the generalization capacity of the model can be improved.
The updated data set aims at providing a more standardized and diversified video analysis basis, and is beneficial to improving the efficiency and accuracy of subsequent model training. Meanwhile, a dynamic update mechanism is added in the data update part, namely the data processing flow supports periodical update to incorporate new video data and labels, so that the continuous integrity and timeliness of the data set are ensured.
In this embodiment, feature extraction is performed on the updated video dataset, and ViViT model video global features, expression features, psychological features and physiological features are extracted;
Specifically:
ViViT model video global feature F ViViT is a pre-training model based on video, using a transducer architecture, and pre-training on a large-scale video dataset, so that rich features can be extracted from the video, so ViViT model video global feature F ViViT is used to express video global information.
Expression feature F face is accomplished by performing both facial detection and facial expression recognition steps on the video frame. Face detection is performed using Dlib algorithm, and the detected face information is input into convolutional neural network CNN to extract expression features after face recognition. Therefore, the expression feature F face is used to reflect the emotional state of the interviewee, and the emotional fluctuation of the interviewee in a special state is often larger.
Psychological characteristics F inner are obtained by analyzing action behavior characteristics of people in the video, such as people without more action behaviors, and emotion is more stable. In particular to extraction from multi-frame video by an optical flow method. Optical flow is a vector field used to estimate each pixel moving between successive video frames, and can capture motion information of objects and motion information of cameras, and thus can be used for behavior recognition and pose estimation. The basic idea of the optical flow method is to calculate the temporal and spatial variations of the pixels from this assumption, assuming that the luminance of the pixels remains constant between successive video frames, thus obtaining an optical flow, i.e. a psychological characteristic. Therefore, psychological characteristics F inner are adopted to reflect the behavioral and posture information of the interviewee, and psychological characteristics of the interviewee in a normal psychological state are more stable.
The physiological characteristic F phy is obtained by converting the video into a frequency domain signal and by a signal processing method of fast fourier transform. Specifically, firstly, converting a video frame into a gray image, and then calculating an average value of pixel intensities of each frame to obtain a time sequence; then, the time sequence is subjected to a fast fourier transform to obtain a frequency domain signal, and the frequency domain signal is used as a physiological characteristic F phy to reflect the physiological state of the interviewee, wherein the fluctuation of the physiological characteristic of the interviewee under the special state is more obvious.
Preferably, the ViViT model video global feature F ViViT, the expression feature F face, the psychological feature F inner and the physiological feature F phy are spliced and fused to be used as the fusion feature F fusion of the psychological condition assessment model, so that more abundant and comprehensive feature information can be obtained, and the accurate analysis of the psychological condition is realized:
Ffusion=[FViViT,Fface,Finner,Fphy]。
In this embodiment, the evaluation model is trained based on the fusion features and the corresponding scoring tags, and adopts ViViT model and ViViT model as neural network model, which maps the video frames to the token sequence, and then embeds and reorganizes by adding positions, i.e. input of the transducer.
The ViViT model is described below.
One of the methods of tokenise input video is unified frame sampling, uniformly sampling n t frames from an input video clip, embedding each 2D frame independently using the same method as ViT, and putting all of these token concats together. Specifically, if n h·nw non-overlapping image patches are extracted from each frame, then the transform encoder needs to process n t·nh·nw token.
The other is Tubelet embedding, extracting non-overlapping spatiotemporal "tubes" from the input video, and linearly projecting to R d. The method is ViT embedded in a three-dimensional extension, corresponding to a three-dimensional convolution. For a dimension of t×h×wThe token is extracted from the time, height and width dimensions, respectively. The smaller tubelet dimensions thus yield more tokens and thus increase the computational effort. Intuitively, the method fuses spatio-temporal information in tokenise, while "unified frame sampling (unified FRAME SAMPLING)" fuses temporal information from different frames by a transducer.
The ViViT model consists of two independent transducer encoders. The first is a spatial encoder that models interactions between only tokens extracted from the same time index. After the layer time transform, a representation h i∈Rd of each frame can be obtained. All frame-level representations h are concat togetherThe time encoder consisting of L t convertors layers is then processed to model interactions between tokens from different time indices. And finally classifying the output token of the encoder.
This structure corresponds to "late fusion" of temporal information, and the initial spatial encoder is the same as the encoder for image classification. Thus, it is similar to the CNN architecture, in that it first extracts per-frame features and then aggregates them into a final representation before classifying it L 8. Although the model has more transducer layers, it requires fewer floating point operations because it has two independent transducer blocks of complexity
In this embodiment, the training process of ViViT models includes preparing datasets (loading tab sets and video datasets), modifying model structures, loading pre-training models, logging, weight saving, and partitioning and validation of validation sets.
The data format of the tag dataset is a json type dictionary list, wherein the keys are unique identifiers of videos, namely corresponding video file names, and the values are scoring results of corresponding 11-dimensional expert systems, and the representation method is convenient for quickly matching the videos and the tags thereof when the video data are loaded. By traversing the folders containing video files, the corresponding tags are read for each video file, ensuring that all video files are considered and correctly associated with their tags.
The video data set is loaded in a manner that a PyAV library is used for decoding the video file, so that the decoding process of the video file can be effectively processed and video frames can be extracted. A number of frame indices are sampled from the video. Instead of using all frames, the frames are selected according to a certain strategy (e.g., uniform sampling or random sampling). This helps reduce the amount of computation and improves training efficiency while retaining enough information for efficient video understanding.
In the model structure part, in order to better adapt to mental health assessment tasks, a full-connection layer is added as a regression layer on the basis of ViViT models so as to adjust an output layer of the model to output label scoring continuous values instead of classifying labels.
This is because ViViT model is not originally used in the task of scoring video evaluations, but can only be used to classify video types, such as one of the large categories of interviewee video being happy, sad, without obvious emotion; in this embodiment, by adding a fully connected layer to the model structure as a regression layer, the regression layer may output a series of regression values, so as to obtain a label scoring result with 11 dimensions, such as { "video1": { "pious":60.42, "exoteric":58.75, "pressure": 36.89.
ViViT-b-16x2-kinetics400 was used as the pre-training model in the selection of the pre-training model. This means that the model has been pre-trained on the Kinetics-400 dataset before training begins, which is a widely used video understanding dataset. The pre-trained model provides a good starting point that helps to speed convergence and improve the performance of the final model.
In the training process, detailed information such as loss, accuracy, performance index of regression task and the like is recorded by using the log. At the same time, the weights of the model and the states of the optimizers are saved periodically by using a check point mechanism, which not only facilitates monitoring the training process, but also allows recovery from the latest state when training is interrupted.
The principle of dividing the dataset into training and testing sets at a ratio of 8:2, and evaluating the performance of the model on the validation set at the end of each epoch. This helps to monitor whether the model is overfitted and adjust the training strategy.
Through these extended and rich steps, the training process of ViViT models becomes more detailed and specific. These steps not only cover the preparation and preprocessing of data, but also include the adjustment and optimization of the model structure, and effective training strategies, ensuring that the model can learn effectively and perform well in video understanding tasks.
The model test data set adopts test set data which is not input into the model during training, so that the data represent the actual scene of the expected application of the model, the data set is processed in a consistent way, and finally the test data set is inferred by using the trained model to generate a prediction result.
A commonly used method for evaluating the loss of ViViT models is the mean square error Method (MSE), which evaluates the performance of the model as a loss function, and the accuracy of the prediction is measured by calculating the average of the squares of the differences between the predicted and actual values of the model:
wherein Y i is a true value, Is the predicted value and n is the number of samples.
However, expert definition of the scale analysis index of teenager psychological videos can find that teenager psychological features are usually multi-modal, including 11 dimensions of responsibility, openness, stress, depression, anxiety, and the like. Therefore, the embodiment increases the multi-mode loss function based on the traditional MSE method so as to consider the relation between the characteristics at the same time and improve the accuracy of the model. The multi-modal loss function may comprehensively consider relationships between different modalities to better train the model and extract multi-modal features.
The form of the multi-modal loss function is as follows:
model performance is optimized by training the loss function corresponding to each dimension index, namely, for the input O 1,O2…O11 of 11 modes and the real label Y, the loss function is defined as the weighted sum of errors among modes, namely:
Where L i is the loss function of the ith mode, i.e. MSE, α i is the weight of the ith mode for adjusting the importance of the different modes.
Respectively carrying out loss function MES calculation according to modes, and then combining according to mode weights to form a final loss function as follows:
Reliability test: in order to test the effectiveness of the model in psychological assessment, video data of the same person at different time points can be collected and inferred on the model. Comparing the consistency of the results obtained at different time points, statistical methods, such as pearson correlation coefficients, can be used to quantify the consistency between the results. The high consistency indicates that the model has better efficiency.
Evaluation of effectiveness: the credibility of the model may be assessed by comparing the differences between the model results and the specialized labels (or standardized test results). Such differences may employ Mean Absolute Error (MAE) or correlation coefficients to quantify the consistency of the evaluation model results with the labeling results. The low error and high correlation indicate that the model has better confidence.
In this embodiment, a ViViT model may be deployed, where the deployment ViViT model includes encapsulation of the model, deployment of the service, definition of interfaces, processing of data security, management of concurrent access, and stress testing of the service for the network service.
And (3) model packaging: the model and its dependent environment are encapsulated using a containerization technique such as Docker to ensure consistency and portability across different servers.
Frame selection: HTTP requests and responses are processed using Flask framework. The framework provides a simple API to receive requests, process data, and return predictions.
Service deployment: according to the actual industry and mining, the containerized application is deployed to a local or cloud server.
Defining interfaces and calling methods: model services are provided through REST APIs, defining well-defined endpoints (Endpoints) and data for submitting to analysis and returning the model's predictions.
Data security: and the HTTPS is used for encrypting the communication between the client and the server, so that the privacy and the safety in the data transmission process are protected. An OAuth2 authentication mechanism is implemented to verify the identity of the user and to control the access rights of different users to the API. To ensure the security of the original video during processing and during storage, such as mental health assessment, a data encryption and secure storage solution is employed herein.
Concurrent access management: a load balancer is used to distribute requests to multiple service instances to improve the usability and extensibility of applications. Cloud services typically provide an auto-expansion function by automatically increasing or decreasing the number of service instances depending on the load situation.
Service stress test: the scenario of multiple users accessing the service simultaneously is simulated using the stress test tools JMeter and locusts. Key performance indicators such as delay, throughput, error rate, etc. are focused on to evaluate the performance of the service under high load. And optimizing the application and server configuration according to the test result, and improving concurrency processing capacity and resource utilization rate.
And (5) model application. The method comprises the steps that a corresponding URL is input through a browser and enters a front-end page, a corresponding applet is performed through a code scanning two-dimensional code, a robot obtains video data through a camera of the device, the data are uploaded through a TCP/IP protocol and are sent to a corresponding interface to call an inference service, a related algorithm model in the inference service can analyze the uploaded video and return analysis results, and a terminal display interface performs result display.
When the method is applied specifically, 60 seconds of video recording is completed according to text content, the recorded video content is uploaded to a designated storage server path, a trained video analysis ViViT model is used for reasoning, and the reasoning result outputs an 11-dimensional expert system test result and returns to a front-end system page to serve as an evaluation result to be displayed. In addition, for the result collection condition, each piece of evaluation information can be queried from a back-end interface corresponding to the system.
The method adopts the methods of expert participation, data enhancement and artificial intelligence combination, and can remarkably improve the detection efficiency, reduce the labor cost and accurately analyze the mental health condition of teenagers.
Has the following advantages:
1. Expert participation: and analyzing the determined 11 psychological health dimensions by a medical expert group, quantifying the video presentation content of the teenagers under different physiological conditions, and forming an expert sample tag set which corresponds to the data set one by one. The method can effectively ensure the authority of the training data set and lay a solid foundation for the subsequent learning of the artificial intelligent model.
2. Data enhancement: the quality standardization and sample diversification of the model input data are realized through methods such as data cleaning, data amplification and data updating, and the label content corresponding to the label set is amplified through a label matching method, so that the data enhancement effect is achieved.
3. Artificial intelligence: the main stream video analysis method in the artificial intelligence field is researched and compared, and finally, the video analysis is determined to be carried out by adopting a neural network model based on ViViT.
It should be noted that, all data are obtained based on compliance with laws and regulations and user consent, and the data are legally applied.
Example 2
The embodiment provides a mental health intelligent evaluation system, which comprises:
The acquisition module is configured to acquire a historical behavior video data set and a corresponding scoring tag and is used as a training set;
The training module is configured to perform standardized processing on the historical behavior video dataset according to the difference between the video duration and the standard duration through a sliding window, perform feature extraction on the obtained standard video frame, extract ViViT model video global features, expression features, psychological features and physiological features, form fusion features after splicing, and train a pre-constructed evaluation model according to the fusion features and corresponding scoring tags;
the evaluation module is configured to acquire a behavior video file of the object to be evaluated, and an evaluation result is obtained by adopting the trained evaluation model.
It should be noted that the above modules correspond to the steps described in embodiment 1, and the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.
In further embodiments, there is also provided:
An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method described in embodiment 1. For brevity, the description is omitted here.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.
The method in embodiment 1 may be directly embodied as a hardware processor executing or executed with a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.
A computer program product comprising a computer program which, when executed by a processor, implements the method described in embodiment 1.
The present invention also provides at least one computer program product tangibly stored on a non-transitory computer-readable storage medium. The computer program product comprises computer executable instructions, such as instructions comprised in program modules, being executed in a device on a real or virtual processor of a target to perform the processes/methods as described above. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. In various embodiments, the functionality of the program modules may be combined or split between program modules as desired. Machine-executable instructions for program modules may be executed within local or distributed devices. In distributed devices, program modules may be located in both local and remote memory storage media.
Computer program code for carrying out methods of the present invention may be written in one or more programming languages. These computer program code may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the computer or other programmable data processing apparatus, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server.
In the context of the present invention, computer program code or related data may be carried by any suitable carrier to enable an apparatus, device or processor to perform the various processes and operations described above. Examples of carriers include signals, computer readable media, and the like. Examples of signals may include electrical, optical, radio, acoustical or other form of propagated signals, such as carrier waves, infrared signals, etc.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (10)

1. A method for intelligently assessing mental health, comprising:
acquiring a historical behavior video data set and a corresponding scoring tag, and taking the historical behavior video data set and the corresponding scoring tag as a training set;
Carrying out standardized processing on the historical behavior video dataset according to the difference between the video duration and the standard duration through a sliding window, carrying out feature extraction on the obtained standard video frame, extracting ViViT model video global features, expression features, psychological features and physiological features, splicing to form fusion features, and training a pre-constructed evaluation model according to the fusion features and corresponding scoring tags;
and acquiring a behavior video file of the object to be evaluated, and acquiring an evaluation result by adopting the trained evaluation model.
2. The mental health intelligent assessment method according to claim 1, wherein the standardized processing procedure for the historical behavioral video data set comprises: and cleaning and segmenting the video according to the video duration, the set standard duration and the sliding window, and training the obtained video in the standard duration and the video exceeding the standard duration as a video training set.
3. The intelligent assessment method according to claim 2, wherein the process of cleaning the video file comprises:
setting a first threshold value and a second threshold value, wherein the first threshold value is smaller than the second threshold value;
if the video duration is smaller than the first threshold, deleting the current video;
If the video time length is larger than the first threshold value and smaller than the second threshold value, reserving the current video;
if the video time length is greater than the second threshold value, the current video is segmented according to the set standard time length and the sliding window.
4. A method for intelligent assessment of mental health as claimed in claim 3, wherein the process of segmentation comprises: dividing according to the set standard time length, dividing the video exceeding the standard time length by adopting the set sliding window, thereby obtaining a plurality of fragments, and treating each fragment as an independent data point.
5. The intelligent evaluation method of mental health according to claim 1, wherein the evaluation model adopts a modified ViViT model, and the modified ViViT model is to add a full-connection layer as a regression layer on the basis of a ViViT model so as to adjust a model output layer and output a label score continuous value instead of a classification label;
the evaluation model adopts a multi-modal loss function during training, and the multi-modal loss function is as follows: Where MSE i is the loss function of the ith mode and α i is the weight of the ith mode; n is the number of modes, i.e. the number of evaluation index dimensions.
6. The intelligent mental health assessment method according to claim 1, wherein the feature extraction process comprises:
extracting ViViT model video global features by adopting a transducer architecture;
After face recognition is carried out on the standard video frame, a convolutional neural network is adopted to extract expression features from the face information;
Extracting psychological characteristics from continuous standard video frames by adopting an optical flow method so as to reflect behavior and posture information;
Converting a standard video frame into a gray image, and then calculating the average value of pixel intensity of each frame to obtain a time sequence; and performing fast Fourier transform on the time sequence to obtain a frequency domain signal, wherein the frequency domain signal is used as a physiological characteristic.
7. An intelligent mental health assessment system, comprising:
The acquisition module is configured to acquire a historical behavior video data set and a corresponding scoring tag and is used as a training set;
The training module is configured to perform standardized processing on the historical behavior video dataset according to the difference between the video duration and the standard duration through a sliding window, perform feature extraction on the obtained standard video frame, extract ViViT model video global features, expression features, psychological features and physiological features, form fusion features after splicing, and train a pre-constructed evaluation model according to the fusion features and corresponding scoring tags;
the evaluation module is configured to acquire a behavior video file of the object to be evaluated, and an evaluation result is obtained by adopting the trained evaluation model.
8. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of any one of claims 1-6.
9. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-6.
CN202410664703.8A 2024-05-27 2024-05-27 Psychological health intelligent assessment method and system Pending CN118507053A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410664703.8A CN118507053A (en) 2024-05-27 2024-05-27 Psychological health intelligent assessment method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410664703.8A CN118507053A (en) 2024-05-27 2024-05-27 Psychological health intelligent assessment method and system

Publications (1)

Publication Number Publication Date
CN118507053A true CN118507053A (en) 2024-08-16

Family

ID=92238349

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410664703.8A Pending CN118507053A (en) 2024-05-27 2024-05-27 Psychological health intelligent assessment method and system

Country Status (1)

Country Link
CN (1) CN118507053A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118800405A (en) * 2024-09-14 2024-10-18 湖南安智网络科技有限公司 Student mental health monitoring and early warning system based on big data and artificial intelligence
CN119296797A (en) * 2024-12-13 2025-01-10 湖南安智网络科技有限公司 Campus mental health risk assessment platform and method based on artificial intelligence
CN119475100A (en) * 2025-01-09 2025-02-18 中南大学 A health status prediction method based on multi-vital sign time series data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118800405A (en) * 2024-09-14 2024-10-18 湖南安智网络科技有限公司 Student mental health monitoring and early warning system based on big data and artificial intelligence
CN119296797A (en) * 2024-12-13 2025-01-10 湖南安智网络科技有限公司 Campus mental health risk assessment platform and method based on artificial intelligence
CN119475100A (en) * 2025-01-09 2025-02-18 中南大学 A health status prediction method based on multi-vital sign time series data

Similar Documents

Publication Publication Date Title
CN108922622B (en) Animal health monitoring method, device and computer readable storage medium
CN118507053A (en) Psychological health intelligent assessment method and system
CN110097037B (en) Intelligent monitoring method and device, storage medium and electronic equipment
Kording et al. How are complex cell properties adapted to the statistics of natural stimuli?
CN109241842B (en) Fatigue driving detection method, device, computer equipment and storage medium
US20190087686A1 (en) Method and apparatus for detecting human face
WO2021063056A1 (en) Facial attribute recognition method and apparatus, and electronic device and storage medium
WO2021068781A1 (en) Fatigue state identification method, apparatus and device
CN113080907A (en) Pulse wave signal processing method and device
CN112365551A (en) Image quality processing system, method, device and medium
CN112785585B (en) Training method and device for image video quality evaluation model based on active learning
CN117198468A (en) Intervention scheme intelligent management system based on behavior recognition and data analysis
CN119151015B (en) Multimode federal learning method for compensating modal deficiency and related equipment
CN116311539A (en) Sleep motion capture method, device, equipment and storage medium based on millimeter waves
US11963771B2 (en) Automatic depression detection method based on audio-video
CN111125186A (en) A data processing method and system based on a questionnaire
CN114550918A (en) A method and system for evaluating mental disorders based on drawing feature data
CN118094420B (en) Questionnaire assessment method, model training method, device, equipment and storage medium
CN111259698B (en) Method and device for acquiring image
CN117033956A (en) Data processing method, system, electronic equipment and medium based on data driving
CN111582404B (en) Content classification method, device and readable storage medium
Cao et al. Research on improved sound recognition model for oestrus detection in sows
CN115908969A (en) Method and apparatus for image processing and model training
CN115035608A (en) Liveness detection method, device, equipment and system
CN118278822B (en) Working data acquisition method and system based on image data analysis and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination