CN118551280B

CN118551280B - Unbalanced spectral data detection method, device, electronic device and storage medium

Info

Publication number: CN118551280B
Application number: CN202410996293.7A
Authority: CN
Inventors: 石壮威; 毕海; 訾剑臣; 苏昀昊; 梁骁翃; 许高
Original assignee: Ji Hua Laboratory
Current assignee: Ji Hua Laboratory
Priority date: 2024-07-24
Filing date: 2024-07-24
Publication date: 2024-11-12
Anticipated expiration: 2044-07-24
Also published as: CN118551280A

Abstract

The present application belongs to the technical field of detecting spectral data, and discloses a method, device, electronic device and storage medium for detecting unbalanced spectral data. The method comprises: obtaining unbalanced spectral data, performing dimension reduction on the unbalanced spectral data, clustering the unbalanced spectral data after dimension reduction using a K-means clustering algorithm to obtain dimension reduced spectral data, sampling the dimension reduced spectral data using a variational EM algorithm to obtain spectral sampling data, inputting the spectral sampling data into a preset spectral detection model, and calculating the category result of the unbalanced spectral data; inputting the spectral sampling data obtained based on the K-means clustering algorithm and the variational EM algorithm into a preset spectral detection model, and calculating the category result of the unbalanced spectral data, so as to detect the unbalanced spectral data, thereby improving the detection efficiency of the unbalanced spectral data.

Description

Unbalanced spectrum data detection method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of detecting spectrum data, in particular to an unbalanced spectrum data detection method, an unbalanced spectrum data detection device, electronic equipment and a storage medium.

Background

In spectroscopic detection, the detection of unbalanced spectroscopic data is a common and challenging problem. Unbalanced spectral data refers to spectral data in which the number of spectral data points of a certain class is far greater than that of other classes, or in which the distribution of spectral data points of a certain class in a feature space is extremely sparse.

In order to process unbalanced spectrum data, a sampling technology is applied to the field of machine learning algorithms, and a part of representative spectrum data points are sampled from a plurality of types of spectrum data point samples through the machine learning algorithm to realize the balance of the unbalanced spectrum data, so that the recognition capability of a model to a few types of spectrum data point samples can be effectively improved, and the classification accuracy is improved. However, the existing method for detecting the spectrum based on the machine learning algorithm is difficult to accurately analyze unbalanced spectrum data, and cannot effectively select the most representative spectrum data points, so that the intrinsic law of the unbalanced spectrum data cannot be accurately described.

Therefore, in order to solve the technical problem that the existing spectrum data detection method cannot effectively select the most representative spectrum data point from the unbalanced spectrum data, and it is difficult to accurately analyze the unbalanced spectrum data, there is a need for an unbalanced spectrum data detection method, an unbalanced spectrum data detection device, an electronic device and a storage medium.

Disclosure of Invention

The application aims to provide an unbalanced spectrum data detection method, device, electronic equipment and storage medium, which are used for obtaining a class result of unbalanced spectrum data by inputting spectrum sampling data obtained based on a K-means clustering algorithm and a variation EM algorithm into a preset spectrum detection model and calculating, so that the problem that the existing spectrum data detection method cannot effectively select the most representative spectrum data point from the unbalanced spectrum data and is difficult to accurately analyze the unbalanced spectrum data is solved, the unbalanced spectrum data can be accurately and rapidly classified and identified, and the detection efficiency of the unbalanced spectrum data is improved.

In a first aspect, the present application provides a method for unbalanced spectral data detection, comprising:

acquiring unbalanced spectrum data;

After the unbalanced spectrum data is subjected to dimension reduction, clustering the unbalanced spectrum data subjected to dimension reduction by using a K-means clustering algorithm to obtain dimension-reduced spectrum data;

sampling the dimension-reduced spectrum data through a variation EM algorithm to obtain spectrum sampling data;

And inputting the spectrum sampling data into a preset spectrum detection model, and calculating to obtain a class result of the unbalanced spectrum data.

The unbalanced spectrum data detection method provided by the application can be used for detecting the unbalanced spectrum data, the class result of the unbalanced spectrum data is obtained by calculating the spectrum sampling data obtained based on the K-means clustering algorithm and the variation EM algorithm into the preset spectrum detection model, the problem that the existing spectrum data detection method cannot effectively select the most representative spectrum data point from the unbalanced spectrum data and is difficult to accurately analyze the unbalanced spectrum data is solved, the unbalanced spectrum data can be accurately and rapidly classified and identified, and the detection efficiency of the unbalanced spectrum data is improved.

Optionally, the unbalanced spectral data comprises spectral data points; after the dimension of the unbalanced spectrum data is reduced, clustering the unbalanced spectrum data after the dimension reduction by using a K-means clustering algorithm to obtain dimension-reduced spectrum data, wherein the dimension-reduced spectrum data comprises the following steps:

Performing dimension reduction on the unbalanced spectrum data to obtain dimension reduced unbalanced spectrum data;

Selecting a plurality of spectrum data points from the unbalanced spectrum data after the dimension reduction as a plurality of initial clustering centers;

and dividing all the spectrum data points into a plurality of spectrum data point groups by the K-means clustering algorithm based on the initial clustering center to obtain dimension-reduction spectrum data.

The unbalanced spectrum data detection method provided by the application can be used for detecting unbalanced spectrum data, spectrum data points in the unbalanced spectrum data are divided into a plurality of spectrum data point groups through a K-means clustering algorithm, the dimension-reduced spectrum data are obtained, and the dimension-reduced spectrum data are sampled, so that the detection efficiency of the unbalanced spectrum data is improved.

Optionally, based on the initial clustering center, dividing all the spectrum data points into a plurality of spectrum data point groups through the K-means clustering algorithm to obtain dimension-reduced spectrum data, including:

A1, calculating the distances between other spectrum data points except the initial clustering center and a plurality of initial clustering centers in the spectrum data points;

a2, distributing the other spectrum data points to the initial clustering center closest to the other spectrum data points to form a spectrum data point group;

step A3, recalculating a cluster center in the spectrum data point group, and reassigning all other spectrum data points except the cluster center according to the distance between the cluster center and all other spectrum data points except the cluster center;

And A4, repeatedly executing the step A3 until the clustering center is not changed any more, and determining a plurality of spectrum data point groups obtained through division as dimension reduction spectrum data.

Optionally, the dimension-reduced spectrum data is sampled by a variational EM algorithm to obtain spectrum sampling data, including:

Based on a preset energy functional function, constructing corresponding conditional probability distribution while eliminating abnormal data in the dimension reduction spectrum data, and calculating variation probability distribution of the dimension reduction spectrum data by a sampling method;

Performing iterative updating on the dimension reduction spectrum data by adopting a gradient descent method, and iterating the conditional probability distribution corresponding to the dimension reduction spectrum data after iterative updating to obtain a target conditional probability distribution with the minimum KL divergence with the variation probability distribution;

and determining the iterated dimensionality reduction spectrum data corresponding to the target conditional probability distribution as the spectrum sampling data.

The unbalanced spectrum data detection method provided by the application can be used for detecting unbalanced spectrum data, the target conditional probability distribution with the smallest KL divergence of the variation probability distribution is obtained through calculation after abnormal data are removed through the variation EM algorithm, the dimension-reduced spectrum data corresponding to the target conditional probability distribution are obtained as spectrum sampling data, the spectrum sampling data are detected through the spectrum detection model, the category of the unbalanced spectrum data can be obtained, and the detection efficiency of the unbalanced spectrum data is improved.

Optionally, based on a preset energy functional function, while eliminating abnormal data in the dimension-reduced spectrum data, constructing a corresponding conditional probability distribution, so as to obtain a variation probability distribution of the dimension-reduced spectrum data through calculation by a sampling method, including:

setting the conditional probability distribution of the dimension reduction spectrum data as normal distribution provided with the preset energy functional function;

Calculating the envelope of the preset energy functional function, and removing spectrum data points located outside the envelope in the dimension-reduced spectrum data;

And taking the equal sampling quantity of the spectrum data points of each spectrum data point group as constraint, taking the maximization of the normal distribution as a target, sampling in the dimension-reduced spectrum data, and calculating to obtain the variation probability distribution of the dimension-reduced spectrum data.

Optionally, a gradient descent method is adopted to iteratively update the dimension reduction spectrum data, and iterate a conditional probability distribution corresponding to the dimension reduction spectrum data after the iteration update, so as to obtain a target conditional probability distribution with the smallest KL divergence with the variation probability distribution, including:

Performing iterative updating on the dimension reduction spectrum data by adopting a gradient descent method, and recording dimension reduction spectrum data after each iteration to obtain dimension reduction spectrum data after a plurality of iterations;

Calculating a plurality of conditional probability distributions corresponding to each iterated dimension reduction spectrum data, and calculating KL (moment-dependent-moment-dependent) divergences between the conditional probability distributions and the variation probability distributions to obtain a plurality of KL divergences;

comparing magnitude relations among the plurality of KL divergences to extract minimum values in the plurality of KL divergences;

And determining the conditional probability distribution corresponding to the minimum value as a target conditional probability distribution with the smallest KL divergence of the variation probability distribution.

Optionally, the preset spectrum detection model is a random forest model.

In a second aspect, the present application provides an unbalanced spectral data detection device comprising:

The acquisition module is used for acquiring unbalanced spectrum data;

The dimension reduction module is used for clustering the unbalanced spectrum data after dimension reduction by using a K-means clustering algorithm after dimension reduction of the unbalanced spectrum data to obtain dimension reduction spectrum data;

The sampling module is used for sampling the dimension-reduced spectrum data through a variation EM algorithm to obtain spectrum sampling data;

and the calculation module is used for inputting the spectrum sampling data into a preset spectrum detection model and calculating to obtain a class result of the unbalanced spectrum data.

According to the unbalanced spectrum data detection device, the spectrum sampling data obtained based on the K-means clustering algorithm and the variation EM algorithm are input into the preset spectrum detection model, the class result of the unbalanced spectrum data is obtained through calculation, the problem that the existing spectrum data detection method cannot effectively select the most representative spectrum data point from the unbalanced spectrum data and is difficult to accurately analyze the unbalanced spectrum data is solved, the unbalanced spectrum data can be accurately and rapidly classified and identified, and the detection efficiency of the unbalanced spectrum data is improved.

In a third aspect, the application provides an electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, when executing the computer program, running the steps in the method of unbalanced spectral data detection as hereinbefore described.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of unbalanced spectral data detection as hereinbefore described.

The beneficial effects are that: according to the unbalanced spectrum data detection method, the device, the electronic equipment and the storage medium, the spectrum sampling data obtained based on the K-means clustering algorithm and the variation EM algorithm are input into the preset spectrum detection model, and the class result of the unbalanced spectrum data is obtained through calculation, so that the problem that the existing spectrum data detection method cannot effectively select the most representative spectrum data point from the unbalanced spectrum data and is difficult to accurately analyze the unbalanced spectrum data is solved, the unbalanced spectrum data can be accurately and rapidly classified and identified, and the detection efficiency of the unbalanced spectrum data is improved.

Drawings

Fig. 1 is a flowchart of a method for detecting unbalanced spectral data according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of an unbalanced spectrum data detection device according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Description of the reference numerals: 1. an acquisition module; 2. a dimension reduction module; 3. a sampling module; 4. a computing module; 301. a processor; 302. a memory; 303. a communication bus.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, fig. 1 is a diagram illustrating a method for detecting unbalanced spectrum data according to some embodiments of the present application, which includes the steps of:

Step S101, unbalanced spectrum data is obtained;

step S102, after the dimension of the unbalanced spectrum data is reduced, clustering the unbalanced spectrum data after the dimension reduction by using a K-means clustering algorithm to obtain the dimension reduced spectrum data;

Step S103, sampling the dimension-reduced spectrum data through a variation EM algorithm to obtain spectrum sampling data;

Step S104, the spectrum sampling data are input into a preset spectrum detection model, and the class result of unbalanced spectrum data is obtained through calculation.

According to the unbalanced spectrum data detection method, the spectrum sampling data obtained based on the K-means clustering algorithm and the variation EM algorithm are input into the preset spectrum detection model, and the class result of the unbalanced spectrum data is obtained through calculation, so that the problem that the existing spectrum data detection method cannot effectively select the most representative spectrum data point from the unbalanced spectrum data and is difficult to accurately analyze the unbalanced spectrum data is solved, the unbalanced spectrum data can be accurately and rapidly classified and identified, and the detection efficiency of the unbalanced spectrum data is improved.

Specifically, in step S101, unbalanced spectrum data is acquired, where the unbalanced spectrum data is composed of spectrum data points, and the unbalanced spectrum data refers to spectrum data in which the number of spectrum data points in a certain class is far greater than that of spectrum data points in other classes, or spectrum data in which the distribution of spectrum data points in a certain class in a feature space is extremely sparse.

Specifically, in step S102, after performing dimension reduction on the unbalanced spectrum data, using a K-means clustering algorithm to cluster the unbalanced spectrum data after dimension reduction to obtain dimension-reduced spectrum data, including:

Selecting a plurality of spectrum data points from the unbalanced spectrum data after dimension reduction as a plurality of initial clustering centers;

based on the initial clustering center, all spectrum data points are divided into a plurality of spectrum data point groups through a K-means clustering algorithm, and dimension reduction spectrum data are obtained.

In step S102, redundant data in the unbalanced spectrum data is removed by using the existing dimension reduction method such as Principal Component Analysis (PCA) method or Linear Discriminant Analysis (LDA), and the number of spectrum data points with original features is reduced to a certain number of spectrum data points with most important features, so as to obtain the unbalanced spectrum data after dimension reduction. Among them, the conventional dimension reduction method such as the principal component analysis method and the linear discriminant analysis method is a conventional technique, and details thereof are not described here.

And (3) carrying out dimension reduction on the unbalanced spectrum data after dimension reduction by using a K-means clustering algorithm, randomly selecting a plurality of spectrum data points from the unbalanced spectrum data after dimension reduction as a plurality of initial clustering centers, distributing other spectrum data points to the initial clustering centers with the smallest distance to the initial clustering centers to form spectrum data point groups with the initial clustering centers as main groups (namely, the clustering centers of the spectrum data point groups are the initial clustering centers), carrying out re-partitioning on the clustering centers of each spectrum data point group after each time of grouping is completed (the partitioning mode is to select the spectrum data point closest to the clustering center in the spectrum data point group as a new clustering center), and carrying out re-distribution on other spectrum data points of each spectrum data point group based on the partitioned clustering centers (the distributing mode is to distribute other spectrum data points to the clustering center with the smallest distance to the spectrum data point group) until all spectrum data points in the spectrum data point group after re-distribution are not re-distributed again to form a plurality of spectrum data point groups, thereby obtaining the dimension-reduced spectrum data.

Specifically, in step S102, based on the initial clustering center, all spectrum data points are divided into a plurality of spectrum data point groups by a K-means clustering algorithm, so as to obtain dimension-reduced spectrum data, which includes:

A2, distributing other spectrum data points to an initial cluster center closest to the spectrum data points to form a spectrum data point group;

step A3, recalculating a clustering center in the spectrum data point group, and reassigning all other spectrum data points except the clustering center according to the distances between the clustering center and all other spectrum data points except the clustering center;

In step S102, the distances between other spectrum data points in the unbalanced spectrum data after the dimension reduction and the initial clustering center are calculated, the initial clustering center with the minimum distance from the other spectrum data points is taken as a central node for grouping, the initial clustering center with the minimum distance from the other spectrum data points is extracted from all the initial clustering centers, the other spectrum data points are distributed to the group with the minimum distance from the initial clustering center, after each distribution is completed, the spectrum data point closest to the central point position needs to be recalculated as the clustering center of each spectrum data point group according to the position of each spectrum data point in each spectrum data point group, the distances between the other spectrum data points except the clustering center and the plurality of clustering centers are recalculated, the other spectrum data points are distributed to the group with the minimum distance from the clustering center, and the steps of calculating the clustering centers and distributing the spectrum data points are repeated until the clustering center is not changed any more, or the spectrum data points of each spectrum data point group are not changed any more, and the spectrum data point group obtained through final calculation is determined as the dimension reduction spectrum data.

Specifically, in step S103, the reduced-dimension spectral data is sampled by a variational EM algorithm, so as to obtain spectral sampling data, which includes:

based on a preset energy functional function, constructing corresponding conditional probability distribution for calculating variation probability distribution of the dimension reduction spectrum data by a sampling method while eliminating abnormal data in the dimension reduction spectrum data;

Carrying out iterative updating on the dimension reduction spectrum data by adopting a gradient descent method, and carrying out iteration on conditional probability distribution corresponding to the dimension reduction spectrum data after iterative updating to obtain target conditional probability distribution with minimum KL divergence with variation probability distribution;

and determining the iterated dimensionality reduction spectrum data corresponding to the target conditional probability distribution as spectrum sampling data.

Specifically, in step S103, based on a preset energy functional function, while eliminating abnormal data in the dimension-reduced spectrum data, a corresponding conditional probability distribution is constructed, so as to calculate a variation probability distribution of the dimension-reduced spectrum data by a sampling method, including:

Setting the conditional probability distribution of the dimension reduction spectrum data as normal distribution provided with a preset energy functional function;

Calculating an envelope of a preset energy functional function, and removing spectrum data points positioned outside the envelope in the dimension-reduced spectrum data;

and taking the equal sampling quantity of the spectrum data points of each spectrum data point group as a constraint, taking the maximized normal distribution as a target, sampling in the dimension-reduced spectrum data, and calculating to obtain the variation probability distribution of the dimension-reduced spectrum data.

In step S103, when the dimension-reduced spectrum data is sampled, it is necessary to model the dimension-reduced spectrum data and solve the probability distribution p (z|x), but because of the unbalance of the unbalanced spectrum data itself, the probability distribution p (z|x) is difficult to directly calculate, and therefore, when the dimension-reduced spectrum data is sampled, the variation probability distribution q (z) of the dimension-reduced spectrum data is calculated by using the variation EM algorithm to infer the distribution of the probability distribution p (z|x). The objective of the variational EM algorithm is to make q (z) and p (z|x) as similar as possible, i.e. its KL (Kullback-Leibler) divergence is as small as possible.

The KL divergence calculation formula specifically comprises:

；

Wherein, KL divergence for probability distribution p (z|x) and variation probability distribution q (z); is the expected value under the variation probability distribution q (z).

Because p (z|x) =p (x|z) p (z)/p (x) (p (x|z) is the probability of spectral data points drawn to the same class in the unbalanced spectral data x under the condition that the spectral data points drawn to any class in the reduced-dimension spectral data z occur, p (x) is the probability of spectral data points drawn to any class in the unbalanced spectral data x), and after clustering, it can be considered thatAccording to the Jenson inequality (the violin inequality is the prior art, and is not described in detail here), the KL divergence calculation formula is optimized to obtain:

；

Wherein, I.e. a conditional probability distribution.

Thus, when the conditional probability distributionAnd variation probability distributionWhen the KL divergence of (a) is minimum, the divergence of the variation probability distribution q (z) and the probability distribution p (z|x) is minimum, i.e., the variation probability distribution q (z) and the probability distribution p (z|x) are closest.

In step S103, it is assumed that the conditional probability distribution p (z) of the reduced-dimension spectral data satisfies a normal distribution having a mean value m and a variance S, and an energy functional function is set in the normal distribution, the normal distribution being specifically:

；（1）

Wherein, Is a normal distribution (i.e. a conditional probability distribution),For the variance of the spectral data points in the dimension-reduced spectral data, z is the spectral data point in the dimension-reduced spectral data, m is the mean value of the spectral data points in the dimension-reduced spectral data, superscript T is the matrix transpose,Is of natural constant eTo the power, whereinThenIs a preset energy functional function; k is the number of spectral data point packets.

Calculating the envelope of the preset energy functional function by the mean value m and the variance SEnvelope of envelopeThe calculation formula of (a) is specifically as follows:

；

Wherein, Is an envelope.

As can be seen from the above equation, the envelope=0 is targeted, the center of the envelope is determined to be m, and in the reduced-dimension spectral data, the radius in each dimension is the standard deviation in this dimension.

At the same time, will be outside the envelope (i.e. satisfy) And (3) eliminating the spectrum data points of the data as abnormal data.

Taking the equal sampling quantity of the spectrum data points of each spectrum data point group as constraint, for the dimension reduction spectrum data after abnormal data is removed, assuming that K is the quantity of the spectrum data point groups, m _i represents the mean value of the ith spectrum data point group, S _i represents the variance of the ith spectrum data point group, and in the process of sampling the dimension reduction spectrum data, the quantity of the spectrum data points of each spectrum data point group is constrained to be equal, namely the quantity of the spectrum data points of the ith spectrum data point group before sampling is n _i, and the quantity of the spectrum data points of each spectrum data point group obtained after sampling is n.

The maximization of the normal distribution is aimed at, i.e. the maximization of the conditional probability distribution p (z) corresponding to the normal distribution, i.e. the minimization of the energy function is required, i.e. the calculation of the minimum value of the energy function。

To sum up, the minimum value of the energy functional functionThe method comprises the following steps:

；

Wherein, As the minimum of the energy function,Spectral data points grouped for the ith spectral data point,The mean of the spectral data points grouped for the ith spectral data point,The variance of the spectral data points grouped for the ith spectral data point,For constrained symbols, the minimum value representing the preceding expression needs to satisfy the following specific condition; for the number of spectral data points of each spectral data point packet obtained after sampling, The number of spectral data points grouped for the ith spectral data point before sampling.

And calculating to obtain a conditional probability distribution p (z) corresponding to the minimum value of the energy functional function, namely the variation probability distribution of the dimension reduction spectrum data.

Specifically, in step S103, a gradient descent method is adopted to iterate and update the dimension-reduced spectrum data, and iterate the conditional probability distribution corresponding to the iterated dimension-reduced spectrum data, to obtain a target conditional probability distribution with the smallest KL divergence with the variance probability distribution, including:

performing iterative updating on the dimension reduction spectrum data by adopting a gradient descent method, and recording the dimension reduction spectrum data after each iteration to obtain dimension reduction spectrum data after a plurality of iterations;

calculating a plurality of conditional probability distributions corresponding to the dimension reduction spectrum data after each iteration, and calculating KL (total length) divergences between the conditional probability distributions and the variation probability distributions to obtain a plurality of KL divergences;

comparing magnitude relationships between the plurality of KL divergences to extract a minimum value of the plurality of KL divergences;

The conditional probability distribution corresponding to the minimum value is determined as the target conditional probability distribution having the smallest KL divergence with the variation probability distribution.

In step S103, the dimension-reduced spectrum data is iteratively updated by using a gradient descent method, so that the conditional probability distribution and the variation probability distribution corresponding to the dimension-reduced spectrum data after the iteration update are as close as possible, that is, the KL divergence between the conditional probability distribution and the variation probability distribution is minimum.

The iteration formula for carrying out iteration update on the dimensionality reduction spectrum data specifically comprises the following steps:

；

Wherein, The obtained data is dimension reduction spectrum data after t+1st iteration; The dimension reduction spectrum data after the t-th iteration is obtained; t is more than or equal to 0, t is an integer, when t is equal to 0, Is the initial dimension-reduction spectrum data; in order to be a step size, For the purpose of the gradient operator,And the energy functional function corresponding to the dimension reduction spectrum data after the t-th iteration.

Calculating a conditional probability distribution corresponding to the dimension reduction spectrum data after each iteration by using the normal distribution (namely using a formula (1)) of the step to obtain a plurality of conditional probability distributions, and respectively calculating KL divergences of the conditional probability distributions and the variation probability distribution, wherein a calculation formula of the KL divergences is specifically as follows:

；

Wherein, Is a conditional probability distributionAnd variation probability distributionKL divergence of (c).

Comparing magnitude relations among the plurality of KL divergences, extracting the minimum value in the plurality of KL divergences, and determining a conditional probability distribution corresponding to the minimum value as a target conditional probability distribution having the smallest KL divergences with respect to the variation probability distribution.

And acquiring iterative dimension-reduced spectrum data corresponding to the target conditional probability distribution, and determining the iterative dimension-reduced spectrum data as spectrum sampling data.

Specifically, in step S104, the spectrum sampling data is input to a preset spectrum detection model, and the class result of the spectrum sampling data (i.e. all classes included in the spectrum data points in the spectrum sampling data) is obtained through calculation of the spectrum detection model, and the class result of the spectrum sampling data is determined as the class result of the unbalanced spectrum data, so as to represent all classes included in the spectrum data points in the unbalanced spectrum data. The preset spectrum detection model is specifically a random forest model, which is a prior art and is not described in detail herein.

For example, by using the unbalanced spectrum data detection method of the present application, raman spectrum detection is performed on a part of urine samples that are routinely detected in urine in a certain hospital, and negative or positive urine proteins or urine sugars are predicted. The method comprises the steps of firstly collecting 3060 spectrum data, acquiring spectrum sampling data through a variation EM algorithm by the unbalanced spectrum data detection method, inputting the spectrum sampling data into a random forest model for detection, and achieving 88% accuracy in urine protein detection and 91% accuracy in urine sugar detection. Compared with the unbalanced spectrum data detection method without the application, the accuracy of the result obtained by directly inputting the spectrum data into the random forest model is smaller than that of the result detected by the unbalanced spectrum data detection method with the application, and the accuracy of the detection result is shown in the following table:

Detection index	Urine protein	Urine sugar	Diagnosis of	Occult blood	White blood cells
						Direct input	0.842	0.829	0.820	0.761	0.744
The application relates to an unbalanced spectrum data detection method	0.881	0.909	0.896	0.825	0.826

From the above table, the method for detecting unbalanced spectrum data can accurately detect the category result of each spectrum data point in spectrum data.

According to the unbalanced spectrum data detection method, the unbalanced spectrum data is acquired, the dimension of the unbalanced spectrum data is reduced by using a K-means clustering algorithm, the dimension-reduced spectrum data is obtained, the dimension-reduced spectrum data is sampled by using a variation EM algorithm, spectrum sampling data is obtained, the spectrum sampling data is input into a preset spectrum detection model, and the class result of the unbalanced spectrum data is obtained through calculation; therefore, the classification result of the unbalanced spectrum data is obtained by inputting the spectrum sampling data obtained based on the K-means clustering algorithm and the variation EM algorithm into a preset spectrum detection model, the problem that the existing spectrum data detection method cannot effectively select the most representative spectrum data point from the unbalanced spectrum data and is difficult to accurately analyze the unbalanced spectrum data is solved, the unbalanced spectrum data can be accurately and rapidly classified and identified, and the detection efficiency of the unbalanced spectrum data is improved.

Referring to fig. 2, the present application provides an unbalanced spectrum data detection apparatus for detecting unbalanced spectrum data, comprising:

an acquisition module 1 for acquiring unbalanced spectrum data;

the dimension reduction module 2 is used for reducing the dimension of the unbalanced spectrum data by using a K-means clustering algorithm to obtain dimension reduction spectrum data;

the sampling module 3 is used for sampling the dimension-reduced spectrum data through a variation EM algorithm to obtain spectrum sampling data;

And the calculating module 4 is used for inputting the spectrum sampling data into a preset spectrum detection model and calculating to obtain a class result of unbalanced spectrum data.

Specifically, the acquisition module 1 acquires, when executing, unbalanced spectrum data, where the unbalanced spectrum data is composed of spectrum data points, and the unbalanced spectrum data refers to spectrum data in which the number of spectrum data points in a certain class is far greater than that of spectrum data points in other classes, or spectrum data in which the distribution of spectrum data points in a certain class in a feature space is extremely sparse.

Specifically, after the dimension reduction module 2 performs dimension reduction on the unbalanced spectrum data, using a K-means clustering algorithm to cluster the dimension reduced unbalanced spectrum data, and when obtaining dimension reduced spectrum data, executing:

When the dimension reduction module 2 is executed, redundant data in unbalanced spectrum data is removed through the existing dimension reduction methods such as a Principal Component Analysis (PCA) method or a Linear Discriminant Analysis (LDA) method, the number of spectrum data points with original characteristics is reduced to a certain number of spectrum data points with most important characteristics, and the unbalanced spectrum data after dimension reduction is obtained. Among them, the conventional dimension reduction method such as the principal component analysis method and the linear discriminant analysis method is a conventional technique, and details thereof are not described here.

Specifically, the dimension reduction module 2 performs, when all spectrum data points are divided into a plurality of spectrum data point groups based on the initial clustering center through a K-means clustering algorithm to obtain dimension reduction spectrum data:

Specifically, the sampling module 3 performs, when sampling the dimension-reduced spectrum data by the variational EM algorithm to obtain spectrum sampling data:

Specifically, the sampling module 3 constructs a corresponding conditional probability distribution while rejecting abnormal data in the dimension-reduced spectrum data based on a preset energy functional function, and performs:

The sampling module 3, when executing, needs to model the dimension-reduced spectrum data and solve the probability distribution p (z|x) when sampling the dimension-reduced spectrum data, but because of the unbalance of the unbalanced spectrum data, the probability distribution p (z|x) is difficult to directly calculate, and therefore, when sampling, a variation EM algorithm is adopted to calculate the variation probability distribution q (z) of the dimension-reduced spectrum data so as to infer the distribution of the probability distribution p (z|x). The objective of the variational EM algorithm is to make q (z) and p (z|x) as similar as possible, i.e. its KL (Kullback-Leibler) divergence is as small as possible.

The KL divergence calculation formula specifically comprises:

；

Wherein, I.e. a conditional probability distribution.

The sampling module 3, when executing, assumes that the conditional probability distribution p (z) of the dimension-reduced spectral data satisfies a normal distribution with a mean value of m and a variance of S, and sets an energy functional function in the normal distribution, where the normal distribution is specifically:

；（1）

；

Wherein, Is an envelope.

；

Specifically, when the sampling module 3 adopts a gradient descent method to iteratively update the dimension-reduced spectrum data and iterates the conditional probability distribution corresponding to the dimension-reduced spectrum data after the iteration update to obtain the target conditional probability distribution with the smallest KL divergence with the variation probability distribution, the sampling module performs:

And when the sampling module 3 is executed, a gradient descent method is adopted to update the dimensionality reduction spectrum data in an iteration way, so that the conditional probability distribution and the variation probability distribution corresponding to the iterated dimensionality reduction spectrum data are as close as possible, namely the KL divergence between the conditional probability distribution and the variation probability distribution is minimum.

；

Wherein, The obtained data is dimension reduction spectrum data after t+1st iteration; The dimension reduction spectrum data after the t-th iteration is obtained; t is more than or equal to 0, t is an integer, when t is equal to 0, Is the initial dimension-reduction spectrum data; in order to be a step size, For the purpose of the gradient operator,And the energy functional function corresponding to the dimension reduction spectrum data after the t-th iteration. Step sizeThe iteration times can be set according to the needs, and the step lengthTypically set to 0.01 and the number of iterations typically set to 50.

Calculating a conditional probability distribution corresponding to the dimension reduction spectrum data after each iteration by using the normal distribution (namely using a formula (1)) obtained in the previous step to obtain a plurality of conditional probability distributions, and respectively calculating KL divergences of the conditional probability distributions and the variation probability distribution, wherein the calculation formula of the KL divergences is specifically as follows:；

Specifically, when the calculation module 4 performs the calculation, the spectrum sampling data is input to a preset spectrum detection model, and the class result of the spectrum sampling data (i.e. all classes included in the spectrum data points in the spectrum sampling data) is obtained through calculation of the spectrum detection model, and the class result of the spectrum sampling data is determined as the class result of the unbalanced spectrum data so as to represent all classes included in the spectrum data points in the unbalanced spectrum data. The preset spectrum detection model is specifically a random forest model, which is a prior art and is not described in detail herein.

For example, by using the unbalanced spectrum data detection device of the present application, raman spectrum detection is performed on a part of urine samples that are routinely detected in urine in a hospital, and negative or positive urine proteins or urine sugars are predicted. The spectrum data 3060 are collected firstly, spectrum sampling data are obtained through the unbalanced spectrum data detection device through a variation EM algorithm, and are input into a random forest model for detection, so that the accuracy of 88% in urine protein detection and the accuracy of 91% in urine sugar detection are achieved. Compared with the unbalanced spectrum data detection device which does not use the application, the accuracy of the result obtained by directly inputting the spectrum data into the random forest model is smaller than the detection result of the unbalanced spectrum data detection device which uses the application, and the accuracy of the detection result is specifically shown in the following table:

Detection index	Urine protein	Urine sugar	Diagnosis of	Occult blood	White blood cells
						Direct input	0.842	0.829	0.820	0.761	0.744
The application relates to an unbalanced spectrum data detection device	0.881	0.909	0.896	0.825	0.826

As can be seen from the above table, the unbalanced spectrum data detection apparatus of the present application can accurately detect the category result of each spectrum data point in the spectrum data.

As can be seen from the above, the unbalanced spectrum data detection device performs dimension reduction on the unbalanced spectrum data by acquiring the unbalanced spectrum data and using a K-means clustering algorithm to obtain dimension-reduced spectrum data, samples the dimension-reduced spectrum data by using a variation EM algorithm to obtain spectrum sampling data, inputs the spectrum sampling data into a preset spectrum detection model, and calculates to obtain a class result of the unbalanced spectrum data; therefore, the classification result of the unbalanced spectrum data is obtained by inputting the spectrum sampling data obtained based on the K-means clustering algorithm and the variation EM algorithm into a preset spectrum detection model, the problem that the existing spectrum data detection method cannot effectively select the most representative spectrum data point from the unbalanced spectrum data and is difficult to accurately analyze the unbalanced spectrum data is solved, the unbalanced spectrum data can be accurately and rapidly classified and identified, and the detection efficiency of the unbalanced spectrum data is improved.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device includes: processor 301 and memory 302, the processor 301 and memory 302 being interconnected and in communication with each other by a communication bus 303 and/or other form of connection mechanism (not shown), the memory 302 storing a computer program executable by the processor 301, the computer program being executable by the processor 301 when the electronic device is running to perform the method of unbalanced spectral data detection in any of the alternative implementations of the above embodiments to perform the following functions: the method comprises the steps of obtaining unbalanced spectrum data, reducing the dimension of the unbalanced spectrum data by using a K-means clustering algorithm to obtain dimension-reduced spectrum data, sampling the dimension-reduced spectrum data by using a variation EM algorithm to obtain spectrum sampling data, inputting the spectrum sampling data into a preset spectrum detection model, and calculating to obtain a class result of the unbalanced spectrum data.

An embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method for detecting unbalanced spectral data in any of the alternative implementations of the above embodiments to implement the following functions: the method comprises the steps of obtaining unbalanced spectrum data, reducing the dimension of the unbalanced spectrum data by using a K-means clustering algorithm to obtain dimension-reduced spectrum data, sampling the dimension-reduced spectrum data by using a variation EM algorithm to obtain spectrum sampling data, inputting the spectrum sampling data into a preset spectrum detection model, and calculating to obtain a class result of the unbalanced spectrum data. The storage medium may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An unbalanced spectrum data detection method for detecting unbalanced spectrum data, comprising the steps of:

acquiring unbalanced spectrum data;

Inputting the spectrum sampling data into a preset spectrum detection model, and calculating to obtain a class result of the unbalanced spectrum data;

sampling the dimension-reduced spectrum data through a variation EM algorithm to obtain spectrum sampling data, wherein the method comprises the following steps:

2. The method of claim 1, wherein the unbalanced spectral data comprises spectral data points; after the dimension of the unbalanced spectrum data is reduced, clustering the unbalanced spectrum data after the dimension reduction by using a K-means clustering algorithm to obtain dimension-reduced spectrum data, wherein the dimension-reduced spectrum data comprises the following steps:

3. The method for detecting unbalanced spectral data according to claim 2, wherein dividing all the spectral data points into a plurality of spectral data point groups based on the initial clustering center by the K-means clustering algorithm to obtain dimension-reduced spectral data comprises:

4. The method for detecting unbalanced spectrum data according to claim 1, wherein constructing a corresponding conditional probability distribution for calculating a variation probability distribution of the dimension-reduced spectrum data by a sampling method while eliminating abnormal data in the dimension-reduced spectrum data based on a preset energy functional function comprises:

5. The method for detecting unbalanced spectrum data according to claim 1, wherein the step of iteratively updating the dimension-reduced spectrum data by using a gradient descent method, and iterating a conditional probability distribution corresponding to the dimension-reduced spectrum data after the iterative update, to obtain a target conditional probability distribution having a smallest KL divergence with the variation probability distribution, includes:

6. The method for detecting unbalanced spectral data of claim 1, wherein the predetermined spectral detection model is a random forest model.

7. An unbalanced spectrum data detection apparatus for detecting unbalanced spectrum data, comprising:

The acquisition module is used for acquiring unbalanced spectrum data;

The calculation module is used for inputting the spectrum sampling data into a preset spectrum detection model and calculating to obtain a class result of the unbalanced spectrum data;

the sampling module is used for sampling the dimension-reduced spectrum data through a variation EM algorithm to obtain spectrum sampling data, and comprises the following steps:

8. An electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, when executing the computer program, running the steps in the method of unbalanced spectral data detection as claimed in any one of claims 1 to 6.

9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method of unbalanced spectral data detection according to any one of claims 1 to 6.