CN118094216B

CN118094216B - Multi-modal model optimization retrieval training method and storage medium

Info

Publication number: CN118094216B
Application number: CN202410055834.6A
Authority: CN
Inventors: 许翔; 厉向东; 邹宁
Original assignee: Hangzhou Deep Thinking Artificial Intelligence Co ltd
Current assignee: Hangzhou Deep Thinking Artificial Intelligence Co ltd
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-09-27
Anticipated expiration: 2044-01-15
Also published as: CN118094216A

Abstract

The invention discloses a multimode model optimization retrieval training method and a storage medium; the invention relates to the technical field of multi-mode identification; for any two data points P and Q, their coordinates in space are respectivelyAnd; In three dimensions, two dimensions (x, y) may be used for the video frame, the coordinates in time being respectivelyAnd. Defining a spatiotemporal distance metricas follows: ; the extended DBSCAN algorithm provided by the invention can more accurately identify dense areas and noise points in a space by considering airspace characteristics, thereby obtaining more accurate clustering results. By analyzing the airspace characteristics, the expanded DBSCAN algorithm can more clearly define the boundaries of the clusters.

Description

Multi-modal model optimization retrieval training method and storage medium

Technical Field

The invention relates to the technical field of multi-modal identification, in particular to a multi-modal model optimization retrieval training method and a storage medium.

Background

Multimodal models are a class of AI architecture models that can process and integrate multiple (at least two) perceptual data, such as text, images, audio, or electronically processed signals, simultaneously. Common forms thereof include, but are not limited to, a CLIP model, viLBERT model, LXMERT model, VL-BERT model, UNITER model, and the like.

The model can open the barrier of the physical world and the digital world, directly generate operation by using the most basic sensing world capability, and realize the most natural interaction with the physical world. With the continuous progress of computer technology and the expansion of application scenarios, the demand for multimodal models is also increasing. Meanwhile, along with the continuous development of artificial intelligence, more and more multi-mode models of different types are applied to the fields of man-machine interaction and the like.

Such as emotion multi-modal models used in the field of Virtual Reality (VR) and Augmented Reality (AR) applications (e.g., AO-BERT model, a multi-modal model based on a transducer encoder structure and multi-modal mask language), this type of model may create a more realistic and immersive user experience based on feedback of user emotion information. By capturing the emotional response of the user in real time, the model system can dynamically adjust the virtual environment or augmented reality elements to better adapt to the emotional state and needs of the user. Similarly, the emotion multi-modal model can be applied to the fields of vehicle navigation, intelligent medical treatment, big data, intelligent home and the like.

The existing multi-modal model has achieved a certain result, but when performing training multi-modal and large-scale data training sets, the sample selection (retrieval) method of the traditional multi-modal model often faces some problems:

(1) Under the traditional multi-mode model training, repeated training with many similar data exists, and more useful samples of the model are often difficult to effectively retrieve;

(2) Traditional active learning sample selection methods often face inefficient data scanning and data position adjustment problems;

Namely, the technical problems to be solved are as follows: when facing a large-scale data training set, how to formulate a multi-mode model optimizing retrieval training method solves the problems of position adjustment and repeated training.

Therefore, the invention provides a multi-modal model optimization retrieval training method and a storage medium.

Disclosure of Invention

In view of this, the embodiments of the present invention wish to provide a multi-modal model optimization search training method and a storage medium; the technical scheme of the invention is realized as follows:

in a first aspect, a multi-modal model optimization search training method:

Summary (a):

The invention aims to formulate a multi-mode model optimization retrieval training method and solve the problems of position adjustment and repeated training. The invention further expands and modifies the characteristics of the multi-mode model on the basis of the traditional DBSCAN algorithm (Density-Based Spatial Clustering of Applications with Noise-based clustering algorithm), so that the invention can further adapt to the space-time characteristics of the multi-mode model training set on the premise of utilizing the advantages of the DBSCAN algorithm. And during the training process, the information quantity and the contribution degree of the training set are measured, samples which have low value and are easy to train repeatedly are filtered, and finally, the multi-mode model is used for carrying out intelligent retrieval and optimized training on the training set, namely, compared with the original training set information, the sample selection and position adjustment operation can be carried out more quickly.

(II) the technical effect which is specifically expected to be achieved:

first for a given training multimodal dataset D:

where (x _i,y_i) is the i-th input sample x and its corresponding label y, N is the maximum value of i.

Sample x _i is audio data, text data, or video data; namely: the training set includes N samples x=x ₁,x₂,...,x_N, where any one sample x _i represents a training data (data point), and for the multi-modal dataset D, the following solution target problem is introduced:

2.1 position adjustment problem:

for a given sample selection method M (D), where M (·) is an adjustment function for selecting samples and adjusting data locations; the goal of this term is to minimize sample selection and minimize the time complexity of data position adjustment; the optimization objective is defined as minimizing complexity (Min):

MinTime(M(D))；

where Time (M (D)) represents the Time complexity of the sample selection method M (D); this term therefore requires the definition of an efficient sample selection strategy to reduce the time complexity and to maintain the accuracy of the sample selection.

2.2 Retraining problem:

Introducing an information metric function I (x _i) to measure the information quantity of the sample x _i and a model contribution metric function C (x _i) to measure the contribution of the sample x _i to the model; the optimization objective is defined to maximize the amount of information (priority) and contribution (MAX):

Thus, there is a need to formulate a sample measurement strategy that is required and accurate for I (x _i) and C (x _i).

(III) technical scheme:

3.1 step S1, sample selection strategy:

According to the technical scheme, a DBSCAN algorithm is selected as a sample selection strategy, a cluster is defined as the maximum set of points connected in density by the traditional DBSCAN algorithm, a region with high enough density can be divided into clusters, and clusters with any shape can be found in a noisy spatial database.

However, based on the above 2.1, since the conventional DBSCAN algorithm does not directly consider the characteristics of the time domain and the space domain, some modification or expansion is required to better adapt to the data (samples) having the characteristics of the time domain and the space domain.

3.1.1 Step S100, define the space-time neighborhood:

For spatio-temporal data, such as continuous video frames or audio signals, this temporal continuity may be exploited to reduce the number of samples that need to be processed.

Defining a space-time neighborhood (including a space neighborhood and a time neighborhood), namely a space-time distance measure; the neighborhood considers both spatial and temporal proximity, representing the time domain and spatial domain properties of sample x _i, respectively. Which is a weighted sum of spatial and temporal distances. This spatiotemporal distance metric is used to determine the epsilon-neighborhood of each sample (data point).

2.1.1.1 Step S1000, combining the spatial distance and the temporal distance is performed:

For any two data points P and Q, their coordinates in space are (x _P,y_P,z_P) and (x _Q,y_Q,z_Q), respectively;

In three-dimensional space, two-dimensional (x, y) may be used for the video frame, with coordinates in time t _P and t _Q, respectively. The spatio-temporal distance metric D _ST (P, Q) is defined as follows:

Where D _S(P,Q)² is the spatial distance between points P and Q, the Euclidean distance can be used to calculate:

For two-dimensional space (e.g., video frames), then we reduce to:

where D _T (P, Q) is the time distance between points P and Q, and is formulated using the time difference:

D_T(P,Q)＝|t_Q-t_P|；

where α is a weight factor for balancing the importance of spatial and temporal distances. If time continuity is important, α may be set relatively large; if spatial proximity is more important, α may be set relatively small.

2.1.1.1 Step S1001, capture and determine neighborhood range:

Once the spatiotemporal distance metric D _ST (P, Q) is defined, this metric can be used to determine the ε -neighborhood for each sample x _i (data point). For a given data point P and distance threshold ε, point Q belongs to the ε -neighborhood of P if and only if (iff):

D_ST(P,Q)＜ε；

the spatiotemporal neighborhood thus defined can take into account both spatial and temporal proximity of the data points, thereby capturing more accurately the characteristics of the spatiotemporal data.

Where epsilon-neighborhood is the core concept in the DBSCAN algorithm, which defines a hypersphere range of radius epsilon centered on a given sample. If one sample point is within epsilon-neighborhood of the other sample point, then the two sample points are considered "density reachable" to each other (i.e., the "density-reachable" concept of the DBSCAN algorithm). By "density reachable", the DBSCAN algorithm is able to identify regions with sufficient density and divide them into different clusters.

A neighborhood range for each sample is determined using the defined spatiotemporal neighborhood. This allows only samples that are spatially close to the current sample to be considered, rather than the entire data set. For each sample, the sample with the highest density or the nearest sample to the other samples is selected as the representative. In this way the number of samples to be processed can be reduced, thereby reducing the time complexity of sample selection.

3.1.2 Step S101, execute the multi-metric DBSCAN algorithm:

the proposed extended DBSCAN algorithm supports multiple distance metrics and is therefore referred to as a multi-metric DBSCAN algorithm. For each sample (data point) it is necessary to consider its spatial and temporal neighborhood simultaneously and require that the point meet the density requirement under both metrics to be considered a core point. Additional parameters are therefore required to balance the importance of the spatial and temporal metrics and corresponding algorithms are required to handle the combination of multiple metrics.

3.1.2.1 Step S1010, combine distance functions:

The function combines the spatial and temporal distances of the spatial and temporal neighbors, respectively, and uses weights to adjust their impact. When spatial distance metric D _S (P, Q) and temporal distance metric D _T (P, Q) have been specified so far, combined distance metric D _C (P, Q) can be defined as follows:

Wherein w _s and w _T are weights of a spatial distance and a temporal distance, respectively, and the sum of them is 1; d _S (P, Q) is a spatial distance measure representing the spatial distance; d _T (P, Q) is a temporal distance measure representing the temporal distance.

3.1.2.2 Step S1011, cluster analysis:

After the combined distance metric D _C (P, Q) is obtained, the dataset is clustered and a plurality of groups of clusters are generated by utilizing a multi-metric DBSCAN algorithm. By considering both spatial and temporal metrics, more compact and meaningful clustering results can be obtained. After clustering, a center point or representative sample of each cluster may be selected as training data. This can further reduce the number of samples that need to be trained and maintain sample diversity. The method specifically comprises the steps S10110-S10113.

Step S10110, clustering: the dataset is first clustered using a multi-metric DBSCAN algorithm. In this process, the algorithm will determine which samples x _i belong to the same cluster based on the combined distance metric D _C (P, Q). The multi-metric DBSCAN algorithm considers the spatial distance and the temporal distance simultaneously, and requires that the points meet the density requirement under both metrics to be regarded as core points; the specific steps for judging whether the density requirement is met include:

P1, two distance measures D _S and D _T are chosen, representing the spatial and temporal distances, respectively, and two corresponding neighborhood radii ε _S and ε _T, and the minimum point number MinPts.

P2, for any two points P and Q in the dataset, calculate their combined distance D _C (P, Q).

P3, determining a neighborhood: for each point P in the dataset, all points in its neighborhood are found.

P4, marking core points: if the neighborhood of point P contains at least MinPts points, then the point P is marked as a core point.

P5, expanding clusters: starting from any unviewed core point, a new cluster is created. Then, all points within the neighborhood of the core point are recursively searched and added to the same cluster. If the point in the neighborhood is also a core point, the cluster continues to be recursively extended.

P6, repeat P5: until all core points have been accessed and their clusters are fully determined.

P7, processing noise points: points that do not belong to any cluster are marked as noise points.

Step S10111, determining a center point: for each cluster formed, the center point CK may be selected as a representative sample. Calculating the average of all points within the cluster to obtain a center point CK, which may be chosen as a weighted average in the spatial and temporal dimensions:

Where C _k represents (the set of points of) the kth cluster, |c _k | is the number of points in cluster C _k, and x _i is the number of points in the cluster. If the data points have dimensions in both space and time, this average is calculated separately for each dimension.

Step S10112 (optional step), selecting a representative sample: in addition to the center point, other methods may be selected to determine the representative sample. Wherein the closest point to the center point may be selected or the most representative point may be selected based on a density evaluation criterion.

Step S10113 (if step S10112 is not performed, the step is regarded as S10112), and the training set is optimized: once the representative samples for each cluster are determined, these samples may be used to construct a reduced training set. The method comprises the following steps of:

X＝{x₁,x₂,...,x_N}；

Where N is the total number of samples x (data points). K clusters can be obtained after the clustering of the multi-metric DBSCAN, and the K clusters are recorded as:

C＝{C₁,C₂,...,C_K}；

for each cluster C _i, a representative sample r is selected; then, the optimization training set X _reduced may be composed of or include these representative samples:

X_reduced＝{r₁,r₂,...,r_K}；

Here, the optimal training set X _reduced contains a representative sample of each cluster, and thus its size is K, which is much smaller than the size N of the original data set.

The method of selecting the representative sample may select a center point of the cluster as the representative sample. If the center point is defined as the mean of all points within the cluster (weighted average in both spatial and temporal dimensions), then the difference in that center point r _k for the kth cluster C _k can be calculated as:

Where, |c _k | represents the number of points in cluster C _k.

A large number of data points can be reduced by clustering to a representative sample for each cluster, thereby significantly reducing the size of the training set. I.e., combining representative samples of each cluster, to form a reduced training set. The size of this training set is typically much smaller than the size of the original data set because it contains only representative samples of each cluster, not all data points.

Once the optimal training set X _reduced is obtained, it can be used for subsequent machine learning model training to increase computational efficiency and possibly maintain some sample diversity.

3.1.3 Step S102, hierarchical clustering:

independently clustering the space dimension and the time dimension of the data of the sample x _i, judging epsilon-neighborhood of all samples x _i, and determining whether the epsilon-neighborhood is a core point or not; recursion is performed on all core points. The method specifically comprises three substeps S1020-S1022.

3.1.3.1 Step S1020, spatiotemporal clustering:

firstly, clustering the space coordinates and the time coordinates of data points by using a DBSCAN algorithm:

1) Spatial clustering: for spatial clustering, the DBSCAN algorithm will determine clusters based on the spatial distances between the data points. The spatial vector of each sample X _i in the optimization training set X _reduced is read, representing the coordinates of the data point in space. The method comprises the following steps:

S10200, DBSCAN algorithm requires two parameters: neighborhood radius ε _S and minimum number MinPts:

for each sample x _i∈X_reduced, find all points within its ε -neighborhood, i.e., satisfy the spatial distance Is x _j.

S10201, if a sample x _i contains at least MinPts points in the ε -neighborhood, then the sample x _i (i.e., data point) is marked as a core point.

S10202, starting from any core point, recursively finding all points with reachable densities to form a cluster.

S10203, repeat S10202 until all core points have been accessed and their clusters are fully determined.

Wherein the spatial distance metric D _S(x_i,x_j) may be calculated using the euclidean distance formula:

Where s _id and s _jd are the coordinates of data points x _i and x _j, respectively, in the D-th spatial dimension, and D is the total number of spatial dimensions.

2) Time clustering: for temporal clustering, the DBSCAN algorithm will determine clusters based on the temporal distance between data points. Likewise, the optimal training set X _reduced is given, but the time properties of each sample should be paid attention to at this time. The temporal distance measure is a temporal difference measure.

The DBSCAN algorithm is similar in step on temporal clustering to spatial clustering, but uses different distance metrics and parameters. Specifically:

S10204, for each sample x _i∈X_reduced, find all points within its ε -neighborhood, i.e. satisfy the time distance Is x _j.

S10205, if at least MinPts points are contained in the epsilon-neighborhood of one sample x _i, marking the sample x _i (i.e., the data point) as a core point.

S10206, starting from any core point, recursively finding all points with reachable densities to form a cluster.

S10207, repeating S10206 until all core points have been accessed and their clusters are fully determined.

Wherein the temporal distance measure D _T(x_i,x_j) may be calculated according to a specific temporal representation. If time is a one-dimensional numerical representation (e.g., a timestamp), then the time distance may be the absolute difference between two points in time:

D_T(x_i,x_j)＝|t_i-t_j|；

Where t _i and t _j are the time attributes of data points x _i and x _j, respectively.

3.1.3.2 Step S1021, merge strategy:

A merging strategy is defined to merge the independent clustering results (clusters) into a coarse spatio-temporal clustering hierarchy. The strategy considers the similarity or consistency between spatial and temporal clusters. Wherein, two schemes are given as follows:

1) Based on the number of shared data points: for each spatial cluster C _S and each temporal cluster C _T, calculate the number of data points shared between them |c _S∩C_T |; if the number of shared data points exceeds a threshold T1, the two clusters are merged into one coarse-granularity spatio-temporal cluster.

2) Based on the distance between cluster centers:

For each spatial cluster C _S and each temporal cluster C _T, calculating the distance D (C _S,c_T) between them; if this distance is less than the threshold T2, the two clusters are merged into one coarse-granularity spatio-temporal cluster. The center point may be the mean or other statistic of all data points within the cluster.

3.1.3.3 Step S1022, constructing a fine space-time clustering hierarchical structure:

P1, intra-cluster fine-grained clustering: inside each coarse-grained cluster, the multi-metric DBSCAN algorithm is applied again to perform fine-grained clustering. This step may be repeated a number of times (i.e., step S1021 is repeated, requiring a maximum number of loops to be formulated) to progressively refine the cluster structure.

P2, obtaining a thinned space-time clustering hierarchical structure: and combining the clustering results of different levels again through a combining strategy (such as based on the number of shared data points or the distance between cluster centers), so as to obtain a hierarchical structure of the thinned space-time clusters. This structure may be represented as a tree diagram (dendrogram) for ease of visualization and understanding.

3.1.4 Step S103, feature engineering:

The statistics (including mean, variance, and covariance) of the data points of each sample x _i and their neighbors in time and space are calculated and constructed as new features into a (new) feature set. An extended feature set is obtained. The method specifically comprises three substeps from S1030 to S1032.

3.1.4.1 Step S1030, defining a neighboring point:

for each data point x _i∈X_reduced, it is first necessary to define a set of its neighbors. The set of neighboring points is denoted as N (x _i), which contains all data points that are sufficiently close in space and time to data point x _i.

3.1.4.2 Step S1031, calculating statistics:

For each original (sample point) data point x _i and its set of neighboring points N (x _i), the following statistics are calculated:

1) Spatial mean μ _S(x_i): and calculating the mean vector of the adjacent points in the space dimension.

Where s _j is the spatial coordinate vector of data point x _j, |N (x _i) | is the size of the set of neighboring points.

2) Time mean μ _T(x_i): and calculating the average value of the adjacent points in the time dimension.

Where t _j is the time coordinate of data point x _j.

3) Spatial varianceThe degree of spatial dispersion of neighboring points is measured.

Where i s _j-μ_S(x_i) i represents the euclidean norm of the vector of set N (x _i).

4) Time varianceThe degree of dispersion in time of the neighboring points is measured.

5) Covariance Cov (x _i) (optional): if there is a correlation between the spatial and temporal dimensions, their covariance can be calculated.

Here, s '_j and t' _j are the values of the spatial and temporal coordinates after being centered (i.e., subtracting the respective mean values), μ '_S(x_i) and μ' _T(x_i), respectively, are the spatial and temporal mean values after being centered, respectively, and they are all practically 0.

3.1.4.3 Step S1032, build new feature set:

The statistics (mean, variance, covariance) calculated above are taken as new features, and an extended feature set is formed with the features of the original data points. This new feature set will be the input to the DBSCAN algorithm.

Characterization of the original (sample point) data point x _i was vectorized:

F(x_i)＝[f₁(x_i),f₂(x_i),...,f_n(x_i)]；

Where n is the number of original features. Based on the statistics (spatial mean, temporal mean, spatial variance, temporal variance, and possibly covariance) described above.

F ₁(x_i),f₂(x_i),...,f_n(x_i) is a feature function mapping data point x _i to different dimensions in feature space.

These statistics are now added to the end of the original feature vector, forming an extended feature vector F _extended(x_i):

If the scale of the original features and statistics varies greatly (e.g., some features range from 0 to 1 and others range from 1000 to 10000), it may be beneficial to scale (e.g., normalize or normalize) the features before combining them. This helps to ensure that all features have similar impact in the clustering algorithm.

An extended feature set will eventually result, where (data points of) each sample x _i is represented by an extended feature vector F _extended(x_i). This extended feature set can be used directly as an input to the DBSCAN algorithm for re-optimization or directly for training of the multimodal model.

The advantage of this is: the spatio-temporal characteristics of the data are extracted by feature engineering techniques and encoded as new features. These features can capture important information of the data in the time domain and the space domain, and simultaneously reduce the dimension of the data. The so-called "extended feature set" can be regarded as a "reduced-dimension data set" which allows for faster sample selection and position adjustment operations (because the number of features that need to be processed is reduced).

3.2 Step S2, information metric strategy:

This step selects to perform the LOF algorithm (Local Outlier Factor, LOF) as an information metric function I (x _i) to measure the maximum information quantity of sample x _i; by calculating local outliers of the samples. The core idea of the LOF algorithm is that the density at outliers (or outliers) should be less than the density of other points in its neighborhood. Thus, the algorithm determines whether each point is an outlier by comparing the density of that point to its neighbors.

3.2.1 Step S200, define neighborhood:

For a given sample x _i and a radius ε (which can be considered a threshold in this step), the ε -neighborhood N _ε(x_i of sample x _i is defined as the set of all sample points that are less than or equal to ε from x _i.

The scheme does not select the 'k-distance neighborhood' of the traditional LOF algorithm, but selects the 'epsilon-neighborhood', so as to establish a complementary relation with the DBSCAN algorithm. Because the epsilon neighborhood given by the DBSCAN algorithm provides an additional parameter (i.e., radius epsilon), the algorithm can be more flexibly adapted to data sets of different densities. By adjusting epsilon, the scope of the neighborhood can be controlled, thereby capturing local anomalies of different scales.

Also in some cases, the density of the data set may change, and the use of a fixed k-distance neighborhood may not be well suited to such changes. While epsilon-neighbors allow the algorithm to use neighbors of different sizes in areas of different densities to better capture local anomaly patterns.

3.2.2 Step S201, calculate local reachable density (Local Reachability Density):

For sample x _i and any point o within its epsilon-neighborhood, the reachable distance REACHDIST (o, x _i) is defined as the maximum between the distance of point o to x _i and epsilon, namely:

reachDist(o,x_i)＝max(|o-x_i|,ε)；

The local reachable density lrd _ε(x_i of sample x _i) is defined as the inverse of the sum of the inverse of the reachable distances of x _i to all points within the epsilon-neighborhood of x _i, namely:

if N _ε(x_i) is empty, then an infinity needs to be defined for lrd _ε(x_i) to handle this situation.

3.2.3 Step S202, calculating local outliers:

The local outlier factor LOF _ε(x_i for sample x _i) is calculated by comparing the local reachable densities of points within its epsilon-neighborhood with the local reachable densities of sample x _i.

Specifically, it is the average of the ratio of the average local reachable density lrd _ε (o) of all points within the ε -neighborhood of sample x _i to the local reachable density lrd _ε(x_i) of sample x _i, namely:

Where N _ε(x_i represents the number of points within the ε -neighborhood of x _i.

3.2.3 Step S203, interpret the LOF score as an information measure:

A LOF score of greater than 1 indicates that sample x _i is more sparse than neighbors within its epsilon-neighborhood, possibly outliers; a near 1 indicates that sample x _i has a similar density as neighbors within its epsilon-neighborhood; a value of less than 1 indicates that sample x _i is denser than neighbors within its epsilon-neighborhood. The LOF score is taken as an informative measure of sample x _i. A high LOF score means that sample x _i is anomalous with respect to neighbors within its epsilon-neighborhood and therefore carries more information. In this case, maximizing the amount of information is equivalent to finding the sample with the highest LOF score; therefore, the samples can be allocated by taking the data as the credentials, and the training set after reduction can be formed.

3.3 Step S3, contribution metric strategy:

The scheme selects a contribution metric strategy as a Distance-based weighting (Distance-Based Weighting) as a contribution metric function C (x _i); i.e. the contribution of a sample point to the model is inversely proportional to its distance from the center of the model. In other words, sample points closer to the center of the model contribute more to the model, while sample points farther away contribute less.

3.3.1 Step S300, determining a model center or reference point:

The traditional distance-based weighting method is to determine a model center or reference point, but for this scheme the epsilon-neighborhood center of each sample x _i is the reference point itself. There is no need to additionally calculate a model center or reference point. The calculation difficulty is reduced, and the relation is established with the DBSCAN algorithm.

3.3.2 Step S301, calculate distance:

For each sample x _i, its distance to all other samples is calculated, i.e., which points fall within epsilon-neighborhood of x _i. For any two sample points x _i and o, if the distance between them, |x _i -o|+.epsilon, then it is said that "o falls within the ε -neighborhood of x _i".

3.3.3 Step S302, calculate contribution:

A contribution function based on the number of sample points in the epsilon-neighborhood is defined, and the contribution of all samples x _i is calculated. There are two options:

1) Scheme one: for each sample x _i, (the output of) its contribution function C (x _i) is proportional to the number of sample points in its epsilon-neighborhood:

C(x_i)＝|N_ε(x_i)|；

Where N _ε(x_i) represents the set of sample points within epsilon-neighborhood of sample x _i, |·| represents the number of elements in the set.

2) Scheme II: unlike scenario one, this scenario expects that the contribution is related to the inverse of the distance, so a variant of distance weighting can be further considered:

where δ is a small positive number (e.g., 0.001) to avoid dividing by zero.

3.3.4 Step S303, standardization (optional):

if it is desired to normalize the contribution so that the sum of the contributions of all sample points is 1, then:

Where X represents the set of all sample points. C' (x _i) is the normalized contribution function.

3.3.5 Step S304, apply:

From the calculated or normalized contribution, it may be determined which samples x _i are most important or prioritized to the model. During model training, samples x _i may be assigned different weights based on these contributions, thereby optimizing the learning bias affecting the model. For example, in training of the classifier, a sample point with a higher contribution degree can be given a higher misclassification cost, so that a training set with reduced cost can be formed.

3.4 Step S4, forming an optimized training set:

after the execution of the steps S1 to S3 is finished, a training set with reduced samples is formed; training is performed on the multimodal model by using the training set.

In a second aspect, a storage medium: the storage medium has stored therein program instructions for performing the multimodal model optimization search training method as described above.

Compared with the prior art, the invention has the beneficial effects that:

1. The extended DBSCAN algorithm acts as a sample selection strategy: in the multi-mode model optimization search, the invention can help to select representative sample points by using the extended DBSCAN algorithm, and the points are usually positioned in a core area of data distribution, so that the overall structure of the data can be better reflected. By eliminating noise points and boundary points, interference in the model training process can be reduced, and the robustness and generalization capability of the model are improved. Meanwhile, the DBSCAN algorithm is expanded, so that the traditional clustering effect is enhanced, dynamic data can be processed more accurately, anomalies and boundaries can be identified, and time sequence analysis and event detection can be performed. The characteristics enable the extended DBSCAN algorithm to have wide application prospects in various complex data processing and analysis tasks.

2. Clustering time domain and space domain characteristics: although the traditional DBSCAN algorithm can find any shape clusters in the spatial data, the expanded DBSCAN algorithm provided by the invention can more accurately identify dense areas and noise points in the space by further considering airspace characteristics, so that more accurate clustering results are obtained. By analyzing spatial characteristics, the extended DBSCAN algorithm can more clearly define the boundaries of clusters, which helps to more accurately understand the spatial distribution of the data. While in time-varying datasets, the distribution and characteristics of the data may change over time. The extended DBSCAN algorithm can recognize and adapt to such changes, thereby capturing the dynamics of the data more accurately. For time series data, the extended DBSCAN algorithm may help identify repetitive patterns, abnormal events, or periodic behavior in the time series. And in applications such as monitoring or sensor networks, the ability to identify specific events in the time domain (e.g., sudden activity, changes in periodic behavior) is critical to timely response and decision making. The extended DBSCAN algorithm can more effectively detect these events by identifying time domain characteristics.

3. The LOF algorithm acts as an information metric strategy: the invention evaluates the amount of information or degree of abnormality carried by each sample point. By giving higher attention to sample points with a larger amount of information (i.e., higher LOF scores), the importance of these points to the model can be emphasized during training, thereby improving the model's ability to identify anomalies or rare conditions.

4. The distance-based weighting acts as a contribution metric function and as a contribution metric strategy: in multi-modal model optimization retrieval, the invention uses distance-based weighting as a contribution metric function to quantify the contribution of each sample point to the model. By giving higher weight to sample points with closer distance and higher similarity, the learning of the points can be enhanced in the model training process, so that the retrieval performance of the model on similar examples is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of the method of steps S1 to S4 of the present invention;

FIG. 2 is a flow chart of the method of step S1 of the present invention;

FIG. 3 is a flow chart of the method of the present invention in step S2 and step S3;

fig. 4 is a flow chart of the method of step S5 of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below;

It should be specifically noted that, in the present specification, the terms "point", "sample point", "data point" are collectively denoted by x or x _i. Specifically, x is taken as a generic term representing any one of a "point", "sample point" or "data point"; and x _i refers to any "point", "sample point", or "data point". Although the terms "point," "sample point," and "data point" are used interchangeably in the literal expressions of the different steps due to structural limitations, these terms are all referring to the same thing in the context of this specification, without substantial distinction.

It should be specifically stated that, in the present specification, the terms "training set", "data set", "time-varying data set" refer to the same thing in the context of the present specification, although different term forms are used interchangeably in the literal expressions of different steps due to structural limitations.

Embodiment one: referring to fig. 1, aiming at the optimization problem of the multi-mode model training set, the present embodiment aims to reduce the time complexity, maintain the accuracy of sample selection, and intelligently measure the contribution degree and information content of each sample x _i to determine whether to incorporate it into the training set, so as to avoid unnecessary repeated execution. To achieve this object, the present embodiment sequentially executes the following steps S1 to S3.

In the present embodiment, regarding step S1, the sample selection policy: this step uses the DBSCAN algorithm as a sample selection strategy, but considering the deficiencies of the conventional DBSCAN algorithm in terms of time and space characteristics, it needs to be extended to accommodate data with these characteristics. Step S1 includes three sub-steps S100 to S103.

In the present embodiment, regarding step S100, a spatiotemporal neighborhood is defined: defining a spatio-temporal neighborhood is a key step in optimizing search training when processing multimodal models involving spatio-temporal data. The present embodiment introduces the concept of a spatiotemporal distance metric by combining spatial and temporal distances and determines the epsilon-neighborhood of each sample based thereon. Step S100 specifically includes two sub-steps S1000 to S1001.

Specifically, with respect to step S1000, the combination of the spatial distance and the temporal distance is performed:

in this substep, a spatio-temporal distance measure D _ST (P, Q) is defined, which considers both the spatial distance D _S (P, Q) and the temporal distance D _T (P, Q). Specifically, for any two data points P and Q, their coordinates in space are (x _P,y_P,z_P) and (x _Q,y_Q,z_Q), respectively, and their coordinates in time are t _P and t _Q, respectively. The spatio-temporal distance metric D _ST (P, Q) is:

Where D _S(P,Q)² is the spatial distance between points P and Q, which can be calculated using Euclidean distance; for points in three-dimensional space:

For two-dimensional space (e.g., video frames), then we reduce to:

D_T(P,Q)＝|t_Q-t_P|；

wherein, alpha weight factor is used to balance the importance of the spatial distance and the temporal distance. If in practice the user thinks that the time continuity is greater than the spatial proximity in the trend requirement of model training, α may be set to 0.6-1; otherwise, the setting is 0.1-0.4. If the balance space distance and the time distance are considered to be required to be balanced, the value is set to be 0.5.

Specifically, regarding step S1001, the neighborhood range is captured and determined: after defining the spatio-temporal distance metric, this sub-step will use the metric to determine the epsilon-neighborhood of each sample. An epsilon neighborhood refers to the set of all sample points that have a spatiotemporal distance from a given sample less than or equal to a given threshold epsilon. By setting a suitable threshold epsilon, the size of the neighborhood can be controlled, thus balancing the complexity of the computation with the accuracy of sample selection.

That is, it can be appreciated that once the spatiotemporal distance metric D _ST (P, Q) is defined, this metric can be used to determine the ε -neighborhood of each sample x _i (data point). For a given data point P and distance threshold ε, point Q belongs to the ε -neighborhood of P if and only if D _ST (P, Q) < ε, if this condition is met then point Q is said to belong to the ε -neighborhood of point P. In this way, sample points that are spatially close to the current sample can be captured, rather than considering the entire data set. The spatiotemporal neighborhood thus defined can take into account both spatial and temporal proximity of the data points, thereby capturing more accurately the characteristics of the spatiotemporal data.

It will be appreciated that after the epsilon neighborhood of each sample is determined, the representative samples in each neighborhood may be further selected.

Illustratively, the sample with the highest density or the sample nearest to other samples is selected as a representative; doing so may reduce the number of samples that need to be processed, thereby reducing the time complexity of sample selection.

In summary, a spatio-temporal neighborhood can be effectively defined by performing steps S1000 and S1001, and an ε -neighborhood for each sample is determined based on a spatio-temporal distance metric. This will provide an important basis for sample selection and position adjustment in subsequent steps, thereby improving the efficiency and accuracy of the multimodal model optimized search training method.

It is noted that the epsilon-neighborhood is defined by a distance threshold epsilon. Specifically, for a given data point P, its epsilon-neighborhood contains all other data points with a spatiotemporal distance less than or equal to epsilon from the point P. In other words, if the spatiotemporal distance DST (P, Q) of a data point Q from point P is less than or equal to ε, then point Q is considered to be part of the ε -neighborhood of point P. The distance threshold epsilon effectively controls the size of the epsilon-neighborhood. A larger epsilon value results in a larger neighborhood range and thus contains more data points; while smaller epsilon values result in smaller neighborhood ranges, containing fewer data points. By adjusting the value of epsilon, the size of the neighborhood and the accuracy of sample selection can be balanced to adapt to different application scenes and data characteristics. After the epsilon-neighborhood is determined, the neighborhood information can be further utilized to carry out subsequent steps such as sample selection, clustering, information quantity measurement and the like so as to realize the optimized retrieval training of the multi-mode model.

Further, the execution program (MATLAB) for the threshold ε and ε -neighborhood inclusion S1000-S1001 is as follows:

The principle of the above procedure is: data is a matrix of N x D, where N is the number of data points x _i and D is the number of dimensions (including the time dimension). epsilon is the threshold for epsilon-neighborhood. alpha is a weight factor for the time dimension. And calculating the space distance between all the point pairs by using pdist functions to obtain an N-by-N space distance matrix. The time distances between all the point pairs are calculated, and a time distance matrix of N x N is obtained. Combining the spatial distance and the temporal distance; then for each point in the dataset, all points with a space-time distance less than or equal to epsilon are found. The index of these points is stored in the corresponding element of the neighbor cell array. Finally, it is necessary to perform the exclusion itself, i.e. to exclude each point from its epsilon-neighborhood, since a point is not normally considered to be its own neighborhood. This function eventually returns a cell array neighbor, where each element is an array containing the indices of all points in the epsilon-neighborhood of the corresponding point. These indices may be used for further data processing, such as clustering, etc., hereinafter.

In the present embodiment, regarding step S101, a multi-metric DBSCAN algorithm is performed: in the processing of multi-modal data, conventional DBSCAN algorithms often only support a single distance metric approach, which is frustrating when processing data that includes both spatial and temporal dimensions. To solve this problem, this embodiment proposes an extended DBSCAN algorithm, i.e., a multi-metric DBSCAN algorithm, which can consider both the spatial neighborhood and the temporal neighborhood of the data point and treat the data point as a core point only when both metrics satisfy the density requirement. Step S101 specifically includes sub-steps S1010 to S1011.

Specifically, regarding step S1010, the distance functions are combined: in processing multimodal data, we introduce a combined distance function, taking into account the importance of both spatial and temporal dimensions. The function combines spatial and temporal distances and uses weights to adjust their impact. Specifically, given two data points P and Q, their combined distance metric D _C (P, Q) is defined as follows:

Where D _S (P, Q) represents the spatial distance between points P and Q, and D _T (P, Q) represents the temporal distance between them. w _s and w _T are weights of spatial distance and temporal distance, respectively, and their sum is 1. These weights reflect the importance of the spatial and temporal metrics in the combined distance. For the assignment schemes of the weights w _s and w _T, see the technical scheme provided in the second embodiment for details.

Specifically, regarding step S1011, cluster analysis: after the combined distance metric D _C (P, Q) is obtained, the dataset can be partitioned into several compact and meaningful clusters by cluster analysis, and representative samples selected therefrom to construct an optimized training set. The method specifically comprises the steps S10110-S10113:

Wherein, regarding step S10110, clustering: the dataset is first clustered using a multi-metric DBSCAN algorithm. In this process, the algorithm will determine which samples belong to the same cluster based on the combined distance metric D _C (P, Q). The multi-metric DBSCAN can consider the space distance and the time distance at the same time, and the point is required to meet the density requirement under the two metrics so as to be regarded as a core point; the method comprises the following specific steps:

P1, two distance measures D _S and D _T are selected, namely a spatial distance measure (e.g. euclidean distance) and a temporal distance measure (e.g. time difference), and two corresponding neighborhood radii epsilon _S and epsilon _T, and a minimum number of MinPts.

The assignment of the minimum point number MinPts can be achieved by drawing a K-distance graph, i.e. for each data point, calculating its distance to the kth nearest neighbor and plotting these distances. By observing this graph, a number of suitable k values (i.e. "MinPts-1" in the traditional sense) can be found, and by means of a number of rounds of trial and error, the distance change becomes smaller until a certain k value is applied, and after the k value is applied, the distance change becomes larger; finding the k value which reflects the phenomenon most, and taking the k value as the minimum point MinPts; this method can help identify density variations in the data and set the value of MinPts accordingly.

Wherein, regarding step S10111, a center point is determined: for each cluster formed, the center point CK may be selected as a representative sample. Calculating the average of all points within the cluster to obtain a center point CK, which may be chosen as a weighted average in the spatial and temporal dimensions:

Where C _k represents (the set of points of) the kth cluster, |c _k | is the number of points in cluster C _k, and x _i is the number of points in the cluster. If the data points are dimensional in space and time.

It will be appreciated that this averaging calculation takes into account each dimension in both the spatial and temporal dimensions.

Wherein, regarding step S10112 (optional step), a representative sample is selected: in addition to the center point, other methods may be selected to determine the representative sample.

Preferably, in step S10112, the point closest to the center point is selected as the representative sample. The scheme has the following advantages:

1) The computational complexity is reduced: calculating the center point may need to consider the average value (weighted average) of all points in the cluster, and the point closest to the center point is selected as a representative sample due to the calculation and the average of multiple dimensions, so that only the distance from the point to the center point is needed to be calculated, and the point closest to the center point is selected, thereby simplifying the calculation process;

2) Sample representativeness is improved: sometimes there may be some points within the cluster that are further from the center point but still important, however selecting the point closest to the center point as a representative sample may ensure that the selected sample is at least in position close to the center point, thereby better representing the central trend of the cluster to some extent;

3) Reducing noise effects: there may be some noise points or outliers in the dataset that may interfere with the calculation of the center point. Selecting the point closest to the center point as a representative sample helps to reduce the impact of these noise points, as the noise points will not typically be the closest points to the center point.

Wherein, regarding step S10113 (if the above step S10112 is not performed, the step is regarded as S10112), the training set is optimized: once the representative samples for each cluster are determined, these samples may be used to construct a reduced training set. The method comprises the following steps of:

X＝{x₁,x₂,...,x_N}；

Where N is the total number of samples x _i (data points). K clusters can be obtained after the clustering of the multi-metric DBSCAN, and the K clusters are recorded as:

C＝{C₁,C₂,...,C_K}；

X_reduced＝{r₁,r₂,...,r_K}；

here, the optimal training set X _reduced contains a representative sample of each cluster, and thus its size is K, which is much smaller than the size N of the original data set. There are a variety of methods for selecting representative samples.

Preferably, the center point of the cluster is selected as the representative sample. The calculation formula of the center point is already given in S10111. If the center point is defined as the mean of all points within the cluster (weighted average over the spatial and temporal dimensions), then for the kth cluster C _k, the center point r _k can be calculated as:

Where, |c _k | represents the number of points in cluster C _k.

It will be appreciated that a large number of data points can be reduced by clustering to a representative sample for each cluster, thereby significantly reducing the size of the training set. I.e., combining representative samples of each cluster, to form a reduced training set. The size of this training set is typically much smaller than the size of the original data set because it contains only representative samples of each cluster, not all data points.

Specifically, the MATLAB execution procedure in step S1011 is as follows:

% initializing an empty representative sample set

representativeSamples＝[]；

% For each cluster

for k＝1:size(centers,1)

% Index to find all data points belonging to the current cluster

clusterIndices＝find(labels＝＝k)；

% If the cluster has no data points, skip

ifisempty(clusterIndices)

continue；

end

% Extract all data points belonging to the current cluster

clusterData＝data(clusterIndices,:)；

% Calculate the distance of each point to the cluster center

Distances = pdist2 (clusterData, centers (k,:); % calculation of Point-to-Point distance Using pdist function

% Index to find the point with the smallest distance

[～,minIndex]＝min(distances)；

% Add this point to the representative sample set

representativeSample＝clusterData(minIndex,:)；

representativeSamples＝[representativeSamples；

representativeSample]；

end

It should be noted that the above procedure relies on the pdist function of MATLAB to calculate the point-to-point distance. Therefore, a tool STATISTICS AND MACHINE LEARNING Toolbox needs to be installed.

It will be appreciated that once the optimal training set X _reduced is obtained, it can be used for subsequent machine learning model training to increase computational efficiency and possibly maintain some sample diversity. Because the chemical training set X _reduced has representative samples that have been subjected to time-domain and spatial-domain measurements, they can represent important features or cluster centers in the original dataset. As previously mentioned, selecting the closest point to the center point is one strategy for determining a representative sample. Also in the concept of the conventional DBSCAN algorithm, the core point is a key concept in the DBSCAN algorithm, and refers to a point containing enough neighbors in the neighborhood. Training set X _reduced contains only core points, as they are sufficient to define the structure of the clusters. And thus areas of higher density in the constructed dataset generally correspond to cluster centers. By identifying these density peaks and including them in training set X _reduced, clustering can be performed more efficiently.

It should be further noted that, the reasons why the training set X _reduced has an optimization effect on the retrieval and training of the DBSCAN algorithm mainly include the following points:

1) The calculated amount is reduced: the training set X _reduced is used for replacing an original data set, so that the number of points to be processed by a DBSCAN algorithm can be remarkably reduced, and the calculation complexity and the memory requirement are reduced;

2) Noise interference is reduced: the representative sample is selected through the minimized filtering strategy, so that the influence of noise points can be reduced in the training set X _reduced, and the algorithm can more easily identify a real clustering structure;

3) And (3) improving the clustering quality: if the samples in training set X _reduced are well representative of the distribution of the raw data, running DBSCAN on these samples may result in higher quality clustering results;

4) Acceleration convergence: the DBSCAN algorithm will converge to a stable cluster state faster due to the smaller number of samples in the training set X _reduced.

In the present embodiment, regarding step S102, hierarchical clustering: in the optimization retrieval training of the multi-mode model, hierarchical clustering is a key step, and the data organization structure is further refined by performing independent clustering operation on the spatial dimension and the time dimension of the data, so that the retrieval efficiency is improved. The step is to independently cluster the space dimension and the time dimension of the data of the sample x _i, judge the epsilon-neighborhood of all samples x _i and determine whether the epsilon-neighborhood is a core point or not; recursion is performed on all core points. The method specifically comprises three substeps S1020-S1022.

Specifically, with respect to step S1020, spatio-temporal clustering: the space-time clustering plays a key role in the multi-modal model optimization retrieval training. In the step, the spatial coordinates and the time coordinates of the data points are clustered by using a DBSCAN algorithm, so that the organization structure and the retrieval efficiency of the data are further improved. The following describes in detail the embodiments of spatio-temporal clustering:

(1) Spatial clustering: for spatial clustering, the DBSCAN algorithm will determine clusters based on the spatial distances between the data points. The spatial vector of each sample X _i in the optimization training set X _reduced is read, representing the coordinates of the data point in space. The method comprises the following steps:

S10201, if the epsilon-neighborhood of one sample x _i contains at least MinPts points, marking the points as core points.

(2) Time clustering: for temporal clustering, the DBSCAN algorithm will determine clusters based on the temporal distance between data points. Likewise, the optimal training set X _reduced is given, but the time properties of each sample should be paid attention to at this time. The temporal distance measure is a temporal difference measure.

S10205, if the epsilon-neighborhood of one sample x _i contains at least MinPts points, marking the points as core points.

Preferably, the temporal distance measure D _T(x_i,x_j) may be calculated according to a specific temporal representation. If time is a one-dimensional numerical representation (e.g., a timestamp), then the time distance may be the absolute difference between two points in time:

D_T(x_i,x_j)＝|t_i-t_j|；

Specifically, regarding steps S10202 and S10206, the two steps are substantially logically identical:

P1, definition of core points: a point is defined as a core point if there are at least MinPts points (including itself) in its epsilon-neighborhood.

P2, the density can reach: if there is a sequence of points p1, p2, pn, where p1 is the core point and for any i (1.ltoreq.i < n), pi+1 is within the epsilon-neighborhood of pi, then pn is said to be reachable from the p1 density. This means that another point can be "reached" by a series of adjacent core points.

P3, cluster formation: starting from any one core point, find all points reachable from the point density, and group them into one cluster. This typically involves recursively finding and adding new density reachable points until no new points can be added to the cluster.

The above P1 to P3 can be realized by phyton pseudo codes as follows:

In the above pseudo code, expandCluster functions are the logic that actually performs "recursively find all density reachable points from the core point". It expands the current cluster by, starting from a given core point, looking up points within the epsilon-neighborhood of that point, and recursively performing the same operations on those points. If points in the neighborhood are also core points and there are new density reachable points, these points will also be added to the cluster. It should be noted that clusters in the DBSCAN algorithm are defined based on density connectivity, i.e., any two points within a cluster must be in density connectivity. This means that even if two points are within epsilon-neighborhood of each other, they will not be separated into the same cluster if there are not enough points between them to form a dense connected path.

Specifically, regarding step S1021, the policy is combined: the independent clustering results (clusters) are combined into a coarse spatio-temporal clustering hierarchy. The strategy considers the similarity or consistency between spatial and temporal clusters. Wherein, two schemes are given as follows:

1) Based on the number of shared data points: for each spatial cluster C _S and each temporal cluster C _T, calculate the number of data points shared between them |c _S∩C_T |; if the number of shared data points exceeds a threshold T1, the two clusters are merged into one coarse-granularity spatio-temporal cluster. The steps of the scheme are as follows:

P1, calculating the number of shared data points: for each spatial cluster C _S and each temporal cluster C _T, the number of data points shared between them, i.e., the intersection size |c _S∩C_T | of the sets C _S and C _T, is calculated.

P2, setting a sharing threshold value: a threshold T1 is defined that represents the minimum number of data points that two clusters need to share in order to be considered relevant. If the number of shared data points between a pair of spatial clusters C _S and temporal clusters C _T exceeds a threshold T1, then the two clusters are combined into one coarse-granularity spatio-temporal cluster.

It will be appreciated that the assignment of threshold T1 is limited by the computing resources; it is necessary for a person skilled in the art to perform subjective motility according to the actual environment.

The scheme emphasizes the co-occurrence of data points in space and time dimensions, and is applicable to clustering of events or entities closely related in space and time, for example, a certain video and audio are matched according to the rhythm of a picture (frame point position);

2) Based on the distance between cluster centers: for each spatial cluster C _S and each temporal cluster C _T, calculating the distance D (C _S,c_T) between them; if this distance is less than the threshold T2, the two clusters are merged into one coarse-granularity spatio-temporal cluster. The center point may be the mean or other statistic of all data points within the cluster.

Unlike the previous scheme, this scheme focuses on the overall distribution of clusters, not just the shared data points. If the center points of two clusters are close enough, then the two clusters may belong to the same space-time context. The specific implementation mode is as follows:

p1, calculating cluster center: for each spatial cluster C _S and temporal cluster C _T, its center point is calculated. The center point may be the mean, median, or other statistic of all data points within the cluster, depending on the particular distribution and clustering requirements of the data.

P2, calculating the inter-cluster distance: using a distance metric (euclidean distance as provided above), the distance D (c _S,c_T) between each pair of spatial and temporal cluster center points is calculated.

P3, setting a distance threshold value: a threshold T2 is defined which represents the maximum distance between two cluster centers that needs to be met in order to be considered relevant.

P4, merging clusters: if the center point distance between a pair of spatial clusters C _S and temporal clusters C _T is less than the threshold T2, the two clusters are merged into one coarse-granularity spatio-temporal cluster.

The scheme emphasizes the overall similarity and proximity of clusters, applicable to clusters of events or entities that have similarly distributed features in both spatial and temporal dimensions.

It will be appreciated that, as such, the assignment of the threshold T2 is limited by the computational resources; it is necessary for a person skilled in the art to perform subjective motility according to the actual environment.

It should be noted that, the threshold T1 and the threshold T2 may be assigned according to the scheme provided in the third embodiment.

Specifically, regarding step S1022, a fine space-time clustering hierarchy is constructed:

P1, intra-cluster fine-grained clustering: inside each coarse-grained cluster, the multi-metric DBSCAN algorithm is applied again to perform fine-grained clustering. This step may be repeated a number of times to progressively refine the structure of the clusters. Specifically, after each fine-grained clustering, the obtained finer clusters can be used as new input, and the multi-metric DBSCAN algorithm is applied again for clustering. This process may be repeated multiple times, but in order to avoid excessive refinement and waste of computing resources, a maximum number of loops needs to be formulated. This maximum number of loops may be determined based on the particular application scenario, computing resource, or data set characteristics.

P2, obtaining a thinned space-time clustering hierarchical structure: and obtaining a plurality of layers of clustering results after the intra-cluster fine-grained clustering is completed. In order to integrate these results into a refined spatio-temporal cluster hierarchy, the merging strategy needs to be used again. The merging strategy here may be the same as or similar to the strategy used in step S1021, e.g. merging based on the number of shared data points or the distance between cluster centers. The clustering results of different levels can be gradually combined into a tree-shaped clustering hierarchical structure through a combining strategy. This structure may be represented as a tree graph (dendrogram) in which each node represents a cluster and the hierarchy of nodes represents the degree of refinement of the cluster. Visual representations of the dendrograms may help us better understand and interpret the results of the clustering.

Illustratively, the above-described tree diagram generation program (MATLAB) is:

% data generated

Rng (1); % for repeatability

Data=rand (10, 3); %10 samples, 3 features per sample

% Executive hierarchical clustering

Z=linkage (data, 'ward'); % hierarchical clustering using Ward method

% Generation of dendrograms

figure；

dendrogram(Z)；

% Display of custom dendrograms

Set (gca, 'YTickLabel', { }); % removal of Y-axis tag

title('Dendrogram ofHierarchical Clustering')；

xlabel('Data Points')；

ylabel('Distance')；

% Save dendrogram as image file

saveas(gcf,'dendrogram.png')；

In the above procedure, the linkage function performs hierarchical clustering and returns a link matrix Z that describes clusters that are merged and distances between them during clustering. The dendrogram function then uses this linking matrix to generate a dendrogram.

In the present embodiment, regarding step S103, feature engineering: the statistics (including mean, variance, and covariance) of the data points of each sample x _i and their neighbors in time and space are calculated and constructed as new features into a (new) feature set. An extended feature set is obtained. The method specifically comprises three substeps from S1030 to S1032.

Specifically, regarding step S1030, a neighboring point is defined:

For each data point x _i∈X_reduced, it is first necessary to define a set of its neighbors. The set of neighboring points is denoted as N (x _i), which contains all data points that are sufficiently close in space and time to data point x _i. The definition of the neighboring points is based on the spatial and temporal proximity of the data points. This step is the basis for the subsequent calculation of the statistics, as the statistics are calculated based on the relationship of each data point to its neighbors.

Specifically, regarding step S1031, statistics are calculated: once the neighboring points are defined, a series of statistics may be calculated to describe the relationship between the data points and their neighboring points. These statistics include spatial mean, temporal mean, spatial variance, temporal variance, and optionally covariance:

1) Spatial mean μ _S(x_i): calculating the mean value vector of the adjacent points in the space dimension:

2) Time mean μ _T(x_i): calculating the average value of the adjacent points in the time dimension:

Where t _j is the time coordinate of data point x _j.

3) Spatial varianceMeasuring the dispersion degree of adjacent points in space:

where i s _j-μ_S(x_i) i represents the euclidean norm of the vector.

4) Time varianceMeasuring the dispersion degree of adjacent points in time:

5) Covariance Cov (x _i) (optional): if correlation is considered to exist between the spatial and temporal dimensions of sample x _i (e.g., weather data, satellite navigation trajectory data, remote sensing image data, geotag social media data, traffic flow data, etc., which contain not only spatial information but also temporal information), their covariance can be calculated:

″

Here, s _j and t _j are the spatial and temporal coordinates, respectively, after being centered (i.e., subtracting the respective means)

"Value, μ _S(x_i) and μ _T(x_i) are the spatial and temporal means after centering, respectively, which are all practically 0.

It should be noted that these statistics are added as new features to the feature set, resulting in an extended feature set. This extended feature set contains more information about the relationship between the data point and its neighbors, helping to improve the performance of subsequent retrieval tasks.

Specifically, with respect to step S1032, a new feature set is constructed: the statistics (mean, variance, covariance) calculated above are taken as new features, and an extended feature set is formed with the features of the original data points. This new feature set will be the input to the DBSCAN algorithm. First, the features of the raw data point x _i need to be vectorized:

F(x_i)＝[f₁(x_i),f₂(x_i),...,f_n(x_i)]；

It will be appreciated that f ₁(x_i),f₂(x_i),...,f_n(x_i) is a feature function of the vector, i.e. the original sample x _i is mapped to different dimensions in the feature space.

Illustratively, in image processing, f ₁(x_i),f₂(x_i),...,f_n(x_i) represents a conventional method of extracting features of gray values, color channel values, gradient magnitudes, edge directions, and the like of pixels in an image.

Illustratively, in text processing, f ₁(x_i),f₂(x_i),...,f_n(x_i) represents a conventional method of extracting word frequencies, TF-IDF values, n-gram features, etc., of words in text.

It will be appreciated that each data point x _i will be represented by an extended feature vector F _extended(x_i). This extended feature vector contains both the original features and the newly calculated statistics features, thus providing more information.

It is noted that if the scale of the original features and statistics varies greatly (e.g., some features range from 0 to 1 and others from 1000 to 10000), it may be beneficial to scale the features before merging them. The feature scaling can be performed by conventional normalization or normalization and the like to scale the value range of the feature to a similar scale. This helps to ensure that all features have similar impact in subsequent clustering algorithms.

Illustratively, the values of each feature are scaled to a distribution with a mean of 0 and a variance of 1 using a normalization method.

Preferably, the normalization method is the Z-Score algorithm (also known as normalized Score normalization). For a given set of eigenvalues, the Z-Score algorithm performs:

Where F _extended(x_i)_old is the value of the original extended feature vector F _extended(x_i of the original sample (data point) x _i; f _extended(x_i)_new is the characteristic value normalized by the Z-Score algorithm; μ is the mean of the set of eigenvalues; σ is the standard deviation (square root of variance) of the eigenvalue set.

In S1032 of this embodiment, an extended feature set is finally obtained, wherein each sample x _i is represented by an extended feature vector F _extended(x_i). This extended feature set can be used directly as an input to the DBSCAN algorithm for re-optimization or directly for training of the multimodal model.

It will be appreciated that feature engineering techniques are a key step in the data preprocessing of the present embodiment, which involves extracting and constructing new features from the raw data that aid in the learning of the machine learning algorithm. This is particularly important in the context of the multimodal data of the present embodiment, as multimodal data typically contains information from different sources (e.g., text, images, sounds, etc.), and such information may have different representations and characteristics. When the spatio-temporal characteristics of the data are extracted with feature engineering techniques and encoded into new features, this means that the DBSCAN algorithm is looking for and utilizing temporal and spatial patterns in the data.

For example, in video analysis, the movement trajectory of an object may contain both spatial location information and temporal information. By calculating statistics (e.g., mean, variance, etc.) of neighboring points in time and space, these patterns can be captured and encoded as new features.

It will be further appreciated that these new features have several important roles:

1) Information capture: new features are able to capture information in the raw data that may not be obvious but is very important in the analysis. For example, in video analysis, the speed or acceleration of movement of an object may be an important feature, but it does not appear directly as the original pixel value.

2) Dimension reduction: by encoding complex data patterns into a more compact feature set, dimension reduction of the data can be achieved. But this does not mean that the information is lost; instead, critical information is preserved (or even highlighted) by a more efficient representation. In this new, lower-dimensional feature space, the similarities and differences between samples are more pronounced, simplifying subsequent learning tasks.

3) Calculation efficiency: reducing the data dimension can greatly improve the computational efficiency. Many machine learning algorithms (especially those involving extensive mathematical operations or iterative optimization) can be very slow when run on high-dimensional data. By reducing the number of features to be processed, the technical scheme provided by the embodiment can accelerate the operations such as sample selection, position adjustment and the like.

4) Robustness enhancement: some machine learning algorithms may suffer "curse of dimensions" in high dimensional space, i.e., as dimensions increase, the data becomes very sparse, resulting in reduced algorithm performance. The dimension is reduced through the feature engineering, so that the robustness of the extended DBSCAN algorithm featuring the embodiment can be enhanced.

It can be understood that, after the above-mentioned expanded DBSCAN algorithm can identify the time domain characteristic and the space domain characteristic of the sample x _i, compared with the conventional clustering methods such as the conventional K-means algorithm, the expanded DBSCAN algorithm maintains its original characteristics, that is, the number of clusters to be formed does not need to be known in advance, clusters of any shape can be found, and at the same time, noise points can be identified. And the extended DBSCAN algorithm is insensitive to the order of the samples x _i, i.e. the input order has little effect on the result. Meanwhile, for boundary samples between clusters, the space domain swings according to which cluster is preferentially detected.

It is noted that the so-called "extended feature set" is actually enriching the original data set by adding new features, but it is understood in a broad sense that it can also be regarded as a kind of dimension reduction if the new feature set more effectively represents the underlying structure of the data and does so with fewer features. "dimension reduction" herein does not mean that the number of features is explicitly reduced by techniques such as Principal Component Analysis (PCA), but rather that the data is re-represented in a more compact, informative manner by feature engineering.

Further, from the standpoint of minimizing sample selection and minimizing the time complexity of data position adjustment required by the multimodal model, the spatio-temporal characteristics of the data are extracted by feature engineering techniques and encoded as new features. This process captures important information of the data in both time and space domains, and reduces the dimensionality of the data by constructing a more efficient, compact representation of the features. Such a dimension reduction process helps to reduce the number of features that need to be considered in subsequent processing, thereby reducing the time complexity of sample selection and position adjustment. By taking the computed statistics (e.g., mean, variance, etc.) as new features, an extended feature set is constructed along with the features of the original data points. This extended feature set provides richer, more differentiated information than the original data, enabling the subsequent clustering or classification algorithms to work more efficiently. Because the quality of the feature set is improved, the algorithm can converge to the optimal solution faster when performing sample selection and position adjustment. And meanwhile, the expanded feature set is finally used as input, and the clustering algorithm can be operated in a data space with lower dimensionality due to the previous feature processing and dimension reduction work, so that the computational complexity is reduced. At the same time, the high quality feature set also enables the clustering algorithm to find structures and patterns in the data faster.

In the present embodiment, regarding step S2, the information metric policy: this step selects to perform the LOF algorithm (Local Outlier Factor, LOF) as an information metric function I (x _i) to measure the maximum information quantity of sample x _i; by calculating local outliers of the samples. The core idea of the LOF algorithm is that the density at outliers (or outliers) should be less than the density of other points in its neighborhood. Thus, the algorithm determines whether each point is an outlier by comparing the density of that point to its neighbors. The method specifically comprises four substeps of S200-S203.

Specifically, regarding step S200, a neighborhood is defined: for a given sample x _i and a radius ε (which can be considered a threshold in this step), the ε -neighborhood of sample x _i is first defined. This neighborhood contains all sample points that are less than or equal to epsilon from x _i. The use of epsilon-neighborhood is chosen over traditional k-distance neighborhood, mainly because it can be complementary to the DBSCAN algorithm.

It will be appreciated that the epsilon parameter in the DBSCAN algorithm allows the size of the neighborhood to be controlled to accommodate data sets of different densities. The sensitivity of the algorithm can be controlled by adjusting the value of epsilon so that it can capture meaningful cluster structures in data of different scales. This flexibility makes the epsilon-neighborhood a powerful tool that complements the DBSCAN algorithm. While conventional DBSCAN algorithms themselves can identify outliers in the data, epsilon-neighborhood based LOF algorithms further provide a quantitative assessment of these outliers (by computing local outliers). This synergy allows us to more fully understand the structure of the data and effectively use outlier information to optimize the training of the model.

It will be appreciated that the epsilon neighborhood in the DBSCAN algorithm provides a parameter of epsilon that allows the algorithm to more flexibly adapt to data sets of different densities. By adjusting epsilon, the scope of the neighborhood can be controlled, thereby capturing local anomalies of different scales. In addition, the epsilon-neighborhood allows the algorithm to better capture local anomaly patterns using different sized neighbors in areas of different densities in the event of a change in dataset density.

Specifically, with respect to step S201, a local reachable density is calculated (Local Reachability Density):

For sample x _i and any point o within its epsilon-neighborhood, the reachable distance REACHDIST (o, x _i) is defined as the maximum between the distance from point o to point x _i and epsilon, namely:

reachDist(o,x_i)＝max(|o-x_i|,ε)；

Local reachable density lrd _ε(x_i for sample x _i) is defined as the inverse of the sum of the inverse of the reachable distances from x _i for all points within the epsilon-neighborhood where sample x _i is located, i.e.:

if N _ε (x) is empty, then an infinity is defined for lrd _ε(x_i) to handle this situation.

Specifically, with respect to step S202, a local outlier factor is calculated: the local outlier factor LOF _ε (x) for sample x _i is calculated by comparing the local reachable densities of points within its epsilon-neighborhood with the local reachable densities of sample x _i.

Further, it is the average of the ratio of the average local reachable density lrd _ε (o) of all points in the ε -neighborhood of sample x _i to the local reachable density lrd _ε (x) of sample x _i, namely:

Where N _ε(x_i represents the number of points within the ε -neighborhood of sample x _i.

It will be appreciated that this ratio reflects the degree of density difference of sample x _i relative to the points in its neighborhood.

Specifically, regarding step S203, the LOF score is interpreted as an information amount measure: a LOF score of greater than 1 indicates that sample x _i is more sparse than neighbors within its epsilon-neighborhood, possibly outliers; a near 1 indicates that sample x _i has a similar density as neighbors within its epsilon-neighborhood; a value of less than 1 indicates that sample x _i is denser than neighbors within its epsilon-neighborhood. The LOF score is taken as an informative measure of sample x _i.

It will be appreciated that a high LOF score means that sample x _i is anomalous with respect to neighbors within its epsilon-neighborhood, and thus carries more information. In this case, maximizing the amount of information is equivalent to finding the sample with the highest LOF score; therefore, the samples can be allocated by taking the data as the credentials, and the training set after reduction can be formed.

Specifically, the above-mentioned steps of allocation include the following P1 to P5:

P1, calculate LOF scores for all samples: first, for each sample in the dataset, its LOF score needs to be calculated. This involves determining the neighborhood size epsilon, calculating the local reachable densities, and finally deriving the local outliers for each sample.

P2, sequencing samples: the calculated LOF scores are ranked from high to low or low to high, depending on whether the multimodal model wishes to focus on outliers or common points. If outliers (with higher LOF scores) are of interest, the scores are ranked in descending order.

P3, select threshold or number of samples: depending on the actual requirements, a threshold for the LOF score is defined, or the number of samples that are desired to be retained in the training set is determined directly.

Illustratively, the training selection of the multimodal model only retains samples with a LOF score of the first N%, or selects samples with a LOF score above a certain threshold.

P4, distributing samples to a training set: based on the selection criteria described above, the eligible samples are assigned to a new training set. These samples are considered to carry more information, possibly because they are outliers or represent important patterns in the dataset.

P5, processing the residual samples: for the unselected samples, the processing can be performed according to the actual situation. They may be omitted entirely or may be retained for subsequent further analysis, such as part of a validation set or test set.

In this embodiment, regarding step S3, the contribution metric policy: the scheme selects a contribution metric strategy as a Distance-based weighting (Distance-Based Weighting) as a contribution metric function C (x _i); i.e. the contribution of a sample point to the model is inversely proportional to its distance from the center of the model. In other words, sample points closer to the center of the model contribute more to the model, while sample points farther away contribute less. The method specifically comprises five substeps S300-S304.

Specifically, with respect to step S300, a model center or reference point is determined: the traditional distance-based weighting method is to determine a model center or reference point, but for this scheme the epsilon-neighborhood center of each sample x _i is the reference point itself. There is no need to additionally calculate a model center or reference point. The calculation difficulty is reduced, and the relation is established with the DBSCAN algorithm.

Specifically, with respect to step S301, the distance is calculated: for each sample x _i, its distance to all other samples is calculated, i.e., which points fall within epsilon-neighborhood of x _i. For any two sample points x _i and o, if the distance between them, |x _i -o|+.epsilon, then it is said that "o falls within the ε -neighborhood of x _i".

Specifically, regarding step S302, the contribution degree is calculated: a contribution function based on the number of sample points in the epsilon-neighborhood is defined, and the contribution of all samples x _i is calculated. There are two options:

C(x_i)＝|N_ε(x_i)|；

It will be appreciated that the output of the contribution function C (x _i) in this scheme is equal to the number of sample points within the epsilon-neighborhood of sample x _i. The scheme is simple and visual, and can reflect the density information of the sample points in the local area.

Where δ is a small positive number to avoid dividing by zero.

It can be appreciated that the scheme can more finely describe the contribution degree of the sample points to the model, and the influence of distance factors is considered.

It will be appreciated that in either of the above schemes, the output of the contribution function C (x _i) is the contribution value of the present x _i.

Specifically, regarding step S303, normalization (optional): if it is desired to normalize the contribution so that the sum of the contributions of all sample points is 1, then:

Specifically, regarding step S304, application: finally, during model training, we can determine which samples are most important or prioritized to the model based on the calculated or normalized contribution.

For example, a reduced training set may be formed by assigning a higher misclassification cost to sample points with higher contribution in training of the classifier. By the method, learning bias of the model can be optimized, and performance and generalization capability of the model are improved. Meanwhile, the sample selection strategy based on the contribution degree can help the multi-mode model to more effectively utilize limited computing resources, and training efficiency is improved.

In this embodiment, regarding step S4, an optimized training set is formed: after steps S1 to S3 are completed, each sample is minimized in complexity (S1), and each sample is assigned a corresponding contribution and priority (S2 to S3); next, the objective of step S4 is to perform training on the multimodal model based on the optimized training set (including the most valuable samples for model training) of S1-S3, which improves the training efficiency of the multimodal model.

Embodiment two: the present embodiment further provides, on the basis of the first embodiment, assignment schemes for the weights w _s and w _T:

The weights are automatically determined by minimizing the cluster quality index, requiring the use of a gradient descent algorithm: the cluster quality index is J, which is regarded as an objective function with respect to the weights w _s and w _T, and is denoted J (w _s,w_T). This function is a criterion based on a cluster compactness evaluation. The goal is to find a set of weights such that J is minimized.

It should be noted that the goal of the clustering algorithm is to aggregate similar data points into the same cluster while making the data points as different between clusters as possible. While closeness (Cohesion) is an important aspect of evaluating clustering effects, it measures the similarity or proximity between data points within the same cluster. Therefore, the present embodiment preferably uses the compactness as an evaluation criterion of the objective function J (w _s,w_T).

Preferably, for the compactness as an evaluation criterion for the objective function J (w _s,w_T), the present embodiment selects the DBI (Davies-Bouldin Index) Index, which takes into account the average distance within a cluster as well as the distance between different clusters. The smaller the value of DBI, the better the clustering effect is generally indicated. When DBI is taken as an objective function J (w _s,w_T), the objective is to find a weight w _s and a weight w _T so that the DBI value obtained after distance calculation and clustering according to the weight is minimum; the DBI index obtaining method comprises the following steps:

Where k is the number of clusters of the cluster. Is the average distance between all samples in the ith cluster and the cluster center c _i.Is the average distance between all samples in the j-th cluster and the cluster center c _j. d (c _i,c_j) is the distance between cluster center c _i and cluster center c _j.

When using DBI as the objective function, the present embodiment suggests adding an additional step to enhance accuracy, i.e., calculating the euclidean distance according to the weights w _s and w _R, then clustering, and finally calculating the value of DBI according to the result of clustering.

In this embodiment, the gradient descent algorithm finds the minimum value by iteratively updating the weights. In each iteration, the weights are updated in the negative gradient direction of the loss function in order to expect to reach a local minimum. The gradient descent algorithm is considered in this embodiment as an update rule, expressed as:

And Is the value of the new weights w _s and w _T output for each iteration,AndIs the value of the weights w _s and w _T for the previous iteration in each iteration; η is the learning rate, a positive fraction, used to control the step size of each update.Is a partial differential symbol; And The partial derivatives of the loss function J (which can also be considered as gradients) with weights w _s and w _T, respectively.

It should be noted that the choice of the learning rate η has a great influence on the performance and the convergence speed of the gradient descent algorithm, and that the specific assignment depends on the balanced allocation of the actually running computing resources and η. If the learning rate is too high, the algorithm can oscillate near the minimum value and can not converge; if the learning rate is too small, the algorithm convergence speed is too slow. Therefore, in practical application, a person skilled in the art is required to exert subjective activity, and the learning rate eta is debugged for many times and assigned.

It will be appreciated that the key issue here is how to define the cluster quality index J and how to calculateAndDepending on the cluster quality assessment criteria specifically chosen. The technical scheme selects and defines a custom loss function based on the distance between samples.

Illustratively, the loss function is defined as the average or sum of all samples over the combined distance of other samples in its epsilon-neighborhood:

Wherein ε (i) is an ε -neighborhood;

It will be appreciated that the above equation computes the average of the squares of the combined distances of all samples to other sample points in its epsilon-neighborhood, which is used to quantify the closeness or loss of clustering. The contributions of all sample points are added together in the calculation of this average value and then divided by the total number of sample points N. The contribution of each sample point to the loss function is equally taken into account.

It is noted that in this case the partial derivative of the loss function with respect to the weight needs to be calculated, since the partial derivative of the loss function with respect to the weight represents the gradient direction of the loss function in the weight space. Calculating the partial derivative may instruct how to adjust the weights to minimize the loss function:

In particular, the partial derivative represents the sensitivity of the loss function value as a function of weight. If the partial derivative is positive, it is stated that increasing the weight will result in an increase in the loss function value, and therefore the weight should be decreased; if the partial derivative is negative, it is stated that increasing the weight will result in a decrease in the loss function value, and therefore the weight should be increased. The weights can be iteratively adjusted by calculating the partial derivative of the loss function with respect to the weights and using it in the weight update rules of the gradient descent algorithm so that the loss function gradually decreases, thereby optimizing the clustering effect.

That is, once these partial derivatives are calculated, the weights can be iteratively updated using a standard gradient descent algorithm, by which the values of the loss function can be gradually reduced, thereby optimizing the clustering effect. The iterative process will continue until a satisfactory clustering effect is achieved or a stopping criterion is met (e.g. the gradient is small enough or a maximum number of iterations is reached).

Illustratively, a gradient is considered "sufficiently small" if the value of the loss function drops very slowly or hardly any more in several consecutive iterations, meaning that the gradient has been sufficiently small, or that the optimization has approached a local minimum.

Embodiment III: the present embodiment further provides, on the basis of the first embodiment, assignment schemes of the threshold T1 and the threshold T2:

in the present embodiment, regarding the threshold T1:

p1, first, for each pair of spatial cluster C _S and temporal cluster C _T, the number of data points shared between them |c _S∩C_T | is calculated.

P2, collecting the number of shared data points of all the space clusters and the time cluster pairs to form a distribution.

P3, analyzing the distribution, one of the following unified metrics can be selected as the assignment basis of T1:

1) Mean of distribution (Mean):

T1＝Mean(|C_S∩C_T|For all distributions)；

2) Median of distribution (Median):

T1＝Median(|C_S∩C_T|For all distributions)；

3) Some Percentile (PERCENTILE) of the distribution, e.g., 75 th percentile:

T1＝75th Percentile(|C_S∩C_T|For all distributions)；

In the above formula, "For all distributions" is "for all distribution" operations; mean is a distributed Mean operation, median is a distributed Median operation, PERCENTILE is a distributed percentage operation.

In the present embodiment, regarding the threshold T2:

And P2, collecting the distances between the centers of all the spatial clusters and the center pairs of the time clusters to form a distribution.

P3, analyzing the distribution, one of the following unified metrics can be selected as the assignment basis of T2:

1) Mean of distribution (Mean):

T2＝Mean(D(C_S,C_T)For all distributions)；

2) Standard deviation of distribution (Standard Deviation), used in combination with mean:

T2＝Mean(D(C_S,C_T))+k*StdDev(D(C_S,C_R))；

where k is the regulator.

3) Some Percentile (PERCENTILE) of the distribution, e.g., the 90 th percentile:

t2=90 th Percentile (D (C _S,C_T) for all distributions).

Wherein "For all distributions" is "for all distribution" operations; mean is the Mean of the distribution, stdDev is the standard deviation of the distribution, PERCENTILE is the percentage of the distribution.

Embodiment four: as shown in fig. 4, the present embodiment executes a D-S evidence theory algorithm based on the first embodiment, and the algorithm further introduces an adaptive correction policy for the solution of the first embodiment (step S5), so that the training set can measure the historical contribution degree and the historical priority of each sample to correct steps S2 and S3; the step S5 specifically includes four sub-steps S500 to S503.

In the present embodiment, with respect to step S500, first, the evidence set a and the evidence set B are collected:

A＝[LOF_ε(x_i)¹,LOF_ε(x_i)²,…,LOF_ε(x_i)^N];

wherein ,LOF_ε(x_i)¹,LOF_ε(x_i)²,...,LOF_ε(x_i)^N is the historical LOF score of sample x _i, which is evenly divided by a certain time step, the maximum number of times of which is N;

B＝[C(x_i)¹,C(x_i)²,…,C(x_i)^N]；

Wherein C (x _i)¹,C(x_i)²,...,C(x_i)^N is the historical contribution value of the same sample x _i, which is also divided uniformly by a certain time step, the maximum number of times of the time step is also N;

In this embodiment, with respect to step S501, evidence A and evidence B are combined by the Dempster 'S combination rule, i.e., for a given set of evidence A ₁,,A₂,...,A_n (evidence B is the same), the Dempster' S combination rule computes a set Bel (C) of combined confidence allocations, where C is a possible hypothesis or state space. The combined uncertainty allocation Pl (C) is also calculated:

combined uncertainty allocation Pl (C):

Note that a _i and B _i are a subset of evidence a and evidence B, respectively; mass is the weight (divided by the average of the evidence); c is the "possible set"; Representing an empty set;

then, a joint trust function Bel (A U B) is obtained:

bel (A) and Pl (A) are trust and uncertainty allocations for the set of evidence A;

bel (B) and Pl (B) are trust and uncertainty allocations for the set of evidence B;

(Pl (A) ≡Pl (B)) is the uncertainty allocation of the intersection of evidence A and evidence B respectively:

A _i and B _i are subsets of evidence a and evidence B, respectively. The molecules are the sum of the mass of evidence supporting A.cndot.B, and 1 minus this sum yields Pl (A.cndot.B), representing the degree of uncertainty for A.cndot.B.

Wherein, Indicating that the intersection of any subset a _i of evidence a with a subset Bi of evidence B is not empty. I.e. it looks for a subset a _i that exists objectively, whose mass in a is not zero, while there is a subset B _i, whose mass in B is also not zero, i.e. the two subsets intersect. It will be further appreciated that,It is responsible for finding elements in the hypothesis set C that are supported in both evidence a and evidence B. If there isThese common elements are considered when computing the merged trust allocation Bel (C) because they contribute to the support of the hypothetical set C.

It can be understood that, in practice, the joint trust function Bel (a-u-B) provides comprehensive evidence information reflecting the degree of support and uncertainty of the variation output by the multi-modal model after training in the historical data collected in the current time and in the previous time period. By this trust function, the degree of support for different hypotheses or states can be quantified and used to correct the priority and importance of each sample x _i in the training set.

In this embodiment, regarding step S502, the joint confidence function Bel (a u B) is mapped to an interval value between [0,1] as the correction factor γ by the sigmoid function:

e is a base of natural logarithms, approximately equal to 2.718; the input of the sigmoid function is the output of Bel (A.u.B). The sigmoid function is characterized by an output range between 0,1, regardless of input.

It will be appreciated that the magnitude of the correction factor γ is affected by Bel (a u B), thereby adjusting the state transfer function in the system. The output value of the sigmoid function approaches 1 when Bel (a ≡b) is large and approaches 0 when Bel (a ≡b) is small. The magnitude of the correction factor γ thus reflects the magnitude of Bel (a u B). This mapping retains the causal relationship that the change in γ remains consistent with the change in Bel (a ≡b). By this correction, the system can introduce a certain uncertainty correction when considering the credibility of the historical evidence, thereby being more flexible to adapt to different situations. This design helps the system to be more adaptive and robust in the face of new data and changes.

In the present embodiment, regarding step S503, correction:

1) For S2:

2) For S3:

It will be appreciated that multiplying by the correction factor γ is actually a weighted adjustment to the original function. The purpose of this adjustment is to introduce a dependency on the trust level of the correction factor γ, so that the model is adaptively corrected according to the current evidence (reflected by γ). That is, if γ approaches 1 (limited by evidence a), the priority (information amount) of sample x _i is important to the repeatability, and the multi-modal model needs to be further trained as far as sample x _i is concerned with its rich information amount; and vice versa.

The above examples merely illustrate embodiments of the invention that are specific and detailed for relevant practical applications and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

It will be further appreciated by those of skill in the art that the various example elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the various example elements and steps have been described generally in terms of function in the foregoing description to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

While those skilled in the art will appreciate that all or a portion of the flow in a method embodying all embodiments described above may be accomplished by computer programs instructing the relevant hardware. The computer program may be stored in a storage medium readable by a non-volatile computer, which storage medium, when executed, may comprise the flow of the embodiments of the methods as described above. Any reference to memory, storage, database, or other medium by the methods provided by and used in the embodiments of the present application may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Claims

1. The multi-modal model optimization search training method comprises a multi-modal model and a training set thereof, wherein the training set comprises N samples x=x ₁,x₂,...,x_N, any one sample x _i represents audio, text or video training data, and the method is characterized in that: the training of the multi-modal model is performed with the following optimized search steps:

S1, sample selection strategy: executing a DBSCAN algorithm to combine the time domain characteristics and the space domain characteristics of all samples x _i of the training set, capturing and determining the epsilon-neighborhood range of each sample, and combining the spatial distance and the time distance of each data point through a combined distance function; searching the minimum value of the data points of all samples x _i through an optimization strategy, then performing hierarchical clustering analysis on the minimum value and selecting representative samples according to the hierarchical clustering analysis to form a reduced training set;

S100, defining a space-time neighborhood: the space-time neighborhood comprises a space neighborhood and a time neighborhood, the space-time neighborhood simultaneously considers the space proximity and the time proximity, respectively represents the time domain characteristic and the space domain characteristic of the sample x _i, and determines epsilon-neighborhood of all samples x _i;

S101, executing a multi-metric DBSCAN algorithm: for all samples x _i, the spatial neighborhood and the temporal neighborhood of the samples x _i need to be considered simultaneously, and the points are required to meet the density requirement under both measures to be considered as core points;

In S101, the method includes:

S1010, combining distance functions D _C (P, Q): combining the spatial distance and the temporal distance of the spatial neighborhood and the temporal neighborhood, respectively, and using weights to adjust their impact:

Wherein w _s and w _T are weights of a spatial distance and a temporal distance, respectively, and the sum of them is 1; d _S (P, Q) is a spatial distance measure representing the spatial distance; d _T (P, Q) is a temporal distance measure representing the temporal distance;

S1011, cluster analysis: after obtaining a combined distance metric D _C (P, Q), clustering the data set and generating a plurality of groups of clusters, and selecting a central point or a representative sample of each cluster as training data;

S102, hierarchical clustering: independently clustering the space dimension and the time dimension of the data of the sample x _i, judging epsilon-neighborhood of all samples x _i, and determining whether the epsilon-neighborhood is a core point or not;

s1020, space-time clustering:

in the S1020, for spatial clustering: determining clusters based on spatial distances between data points; reading a space vector of each sample X _i in the optimized training set X _reduced, and representing coordinates of the data points in space; the spatial clustering includes S10200 to S10203:

S10200, defining a neighborhood radius epsilon _S and a minimum point number MinPts; for each sample x _i∈X_reduced, find all points within its ε -neighborhood, i.e., satisfy the spatial distance Is set to be a data point x _j;

S10201, if the epsilon-neighborhood of one sample x _i at least contains MinPts points, marking the sample x _i as a core point;

S10202, starting from any core point, recursively finding all points with reachable densities to form a cluster;

s10203, repeating S10202 until all core points have been accessed and their clusters are fully determined;

in the S1020, for temporal clustering: determining clusters based on temporal distances between data points; given an optimization training set X _reduced, focusing on the time attribute of each sample; the time clustering includes S10204 to S10207:

S10204, for each sample x _i∈X_reduced, find all points within its ε -neighborhood, i.e. satisfy the time distance Is set to be a data point x _j;

S10205, if the epsilon-neighborhood of one sample x _i at least contains MinPts points, marking the sample x _i as a core point;

S10206, starting from any core point, recursively finding all points with reachable densities to form a cluster;

s10207, repeating S10206 until all core points have been accessed and their clusters are fully determined;

S1021, combining strategies: combining the independent clustering results into a coarse space-time clustering hierarchical structure based on the number of shared data points or based on the distance between cluster centers; the coarse space-time clustering hierarchical structure comprises a plurality of coarse granularity space-time clusters;

S1022, constructing a fine space-time clustering hierarchical structure: in each coarse-granularity cluster, a multi-metric DBSCAN algorithm is applied again to cluster the fine granularity, and the structure of the clusters is gradually refined; combining the clustering results of different levels into a hierarchical structure of space-time clustering again through the combining strategy described in S1021;

recursion is carried out on all the core points to form a cluster and the clusters are combined into a space-time clustering hierarchical structure;

S103, feature engineering: calculating statistics of data points and adjacent points of each sample x _i in time and space, constructing the statistics as new features into an expanded feature set, and then using the expanded feature set as a dimension-reduced data set;

S2, information measurement strategy: for any sample x _i, reading the range of epsilon-neighborhood of the sample x _i, executing LOF algorithm to calculate the local reachable density of epsilon-neighborhood of each sample x _i and the local outlier factor of the epsilon-neighborhood, and forming LOF score according to the local reachable density; by interpreting the LOF score as an informative measure for any sample x _i;

S3, contribution degree measurement strategy: for any one sample x _i, calculating the distance from the sample x _i to all other samples x _i, weighting the distance through a contribution degree function, and calculating the contribution degree C (x _i) of each sample x _i, so as to allocate different weights to all samples x _i;

S300, determining a reference point: the epsilon-neighborhood center of each sample x _i is used as a reference point;

S301, calculating a distance: for any two samples x _i and o, if the distance between them satisfies |x _i -o|+.epsilon, then sample o falls within the ε -neighborhood of sample x _i;

s302, calculating contribution degree: defining a contribution degree function C (x _i) based on the number of sample points in the epsilon-neighborhood, calculating the contribution degree of all samples x _i, wherein the contribution degree function C (x _i) is proportional to the number of sample points in the epsilon-neighborhood of the samples x _i:

C(x_i)＝|N_ε(x_i)|；

where N _ε(x_i) represents the set of sample points within epsilon-neighborhood of sample x _i, |·| represents the number of elements in the set;

After S302 is performed, determining which samples x _i are most important or priority to the model according to the calculated contribution, and assigning different weights to samples x _i according to the contribution in the model training process;

s4, after the S1 to the S3 are executed, a training set with reduced samples is formed, and training is carried out on the multi-mode model by using the training set.

2. The multi-modal model optimization search training method as set forth in claim 1, wherein: in the S100, it includes:

S1000, performing a combination of the spatial distance and the temporal distance: for any two data points P and Q, their coordinates in space are (x _P,y_P,z_P) and (x _Q,y_Q,z_Q), respectively;

Defining a spatio-temporal distance metric D _ST (P, Q) as:

Wherein D _S(P,Q)² is the spatial distance between points P and Q; d _T (P, Q) is the time distance between points P and Q;

S1001, capturing and determining a neighborhood range: determining an epsilon-neighborhood for each sample x _i based on the spatiotemporal distance metric D _S(P,Q)²; for a given data point P and distance threshold ε, point Q belongs to the ε -neighborhood of P if and only if: d _S(P,Q)² < ε.

3. The multimodal model optimization search training method of claim 2, wherein: in S1011, it includes:

S10110, clustering the data set: determining which samples x _i belong to the same cluster based on a combined distance metric D _C (P, Q); simultaneously considering the space distance and the time distance, and requiring the point to meet the density requirement under two measures to be regarded as a core point;

S10111, determining a center point: for each cluster formed, selecting a center point CK as a representative sample; calculating the average of all points within the cluster to obtain a center point CK, performing a weighted average over the spatial and temporal dimensions:

Wherein C _k represents the kth cluster, |c _k | is the number of points in cluster C _k, x _i is the number of points in the cluster;

s10112, optimizing a training set: k clusters are obtained after clustering:

C＝{C₁,C₂,...,C_K}；

A representative sample r is selected for each cluster C _i, constituting an optimization training set X _reduced:

X_reduced＝{r₁,r₂,...,r_K}；

The optimal training set X _reduced contains a representative sample for each cluster.

4. The multi-modal model optimization search training method as set forth in claim 3, wherein: in the S103, the statistics include spatial mean μ _S(x_i), temporal mean μ _T(x_i), spatial varianceAnd time varianceTogether with the features of the raw data points, an expanded feature set is constructed; vectorizing the features of the original data x _i:

F(x_i)＝[f₁(x_i),f₂(x_i),...,f_n(x_i)]；

Wherein f ₁(x_i),f₂(x_i),...,f_n(x_i) is a feature function; n is the number of original features;

These statistics are added to the end of the original feature vector, forming an extended feature vector F _extended(x_i):

The extended feature set is obtained, wherein each sample x _i is represented by one extended feature vector F _extended(x_i).

5. The multi-modal model optimization search training method as set forth in claim 4, wherein: in the S2, it includes:

S200, defining a neighborhood: for a given arbitrary sample x _i and a radius ε, the ε -neighborhood N _ε(x_i of sample x _i) is defined as the set of all sample points less than or equal to ε from x _i;

S201, calculating local reachable density: for sample x _i and any point o within its epsilon-neighborhood, the reachable distance REACHDIST (o, x _i) is the maximum between the distance of point o to x _i and epsilon:

reachDist(o,x_i)＝max(|o-x_i|,ε)；

The local reachable density lrd _ε (x) of sample x _i is defined as the inverse of the sum of the inverse of the reachable distances of all points within the epsilon-neighborhood of sample x _i and sample x _i:

S202, calculating local outlier factors: is the average of the ratio of the average local reachable density lrd _ε (o) of all points within the epsilon-neighborhood of sample x _i to the local reachable density lrd _ε(x_i of sample x _i:

Where, |N _ε(x_i) | represents the number of epsilon-neighborhood points in sample x _i;

s203, interpreting the LOF score as an information amount measure: taking the training set as the evidence and distributing the samples to form a reduced training set.

6. A storage medium, characterized by: program instructions for executing the multimodal model optimization search training method of any one of claims 1-5 are stored in the storage medium.