[go: up one dir, main page]

CN118626758A - A data processing method and device based on least squares method of clustering - Google Patents

A data processing method and device based on least squares method of clustering Download PDF

Info

Publication number
CN118626758A
CN118626758A CN202410646317.6A CN202410646317A CN118626758A CN 118626758 A CN118626758 A CN 118626758A CN 202410646317 A CN202410646317 A CN 202410646317A CN 118626758 A CN118626758 A CN 118626758A
Authority
CN
China
Prior art keywords
data
clustering
program module
square
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410646317.6A
Other languages
Chinese (zh)
Inventor
马雅男
唐小峰
胡宇
牛健行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Jovian Technology Exploitation Co ltd
CETC 10 Research Institute
Original Assignee
Chengdu Jovian Technology Exploitation Co ltd
CETC 10 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Jovian Technology Exploitation Co ltd, CETC 10 Research Institute filed Critical Chengdu Jovian Technology Exploitation Co ltd
Priority to CN202410646317.6A priority Critical patent/CN118626758A/en
Publication of CN118626758A publication Critical patent/CN118626758A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于聚类的最小二乘法的数据处理方法及装置,属于数据处理领域,包括步骤:利用计算机程序将聚类算法与最小二乘算法融合,并在计算机中通过最小二乘算法程序模块输出结果与实际数据的误差反馈控制聚类算法程序模块的J(mk)值选取范围,从而利用聚类算法程序模块剔除异常数据,使得最小二乘法拟合结果更加贴近目标值;其中,J(mk)表示k类的对象mk到聚类中心的距离平方和;所述数据包括航空综合电子信息装备测试数据,且所述航空综合电子信息装备测试数据通过计算机输入单元输入到运行所述程序的计算机。本发明方法简单,可实现性高,可以使数据处理更加准确。

The invention discloses a data processing method and device based on clustering least squares method, which belongs to the field of data processing, and comprises the steps of: using a computer program to merge a clustering algorithm with a least squares algorithm, and controlling the J(m k ) value selection range of the clustering algorithm program module by error feedback between the output result of the least squares algorithm program module and the actual data in the computer, thereby using the clustering algorithm program module to eliminate abnormal data, so that the least squares method fitting result is closer to the target value; wherein J(m k ) represents the square sum of the distances from the object m k of class k to the cluster center; the data comprises aviation integrated electronic information equipment test data, and the aviation integrated electronic information equipment test data is input into the computer running the program through a computer input unit. The method of the invention is simple, highly feasible, and can make data processing more accurate.

Description

一种基于聚类的最小二乘法的数据处理方法及装置A data processing method and device based on least squares method of clustering

技术领域Technical Field

本发明涉及数据处理领域,更为具体的,涉及一种基于聚类的最小二乘法的数据处理方法及装置。The present invention relates to the field of data processing, and more specifically, to a data processing method and device based on clustering least squares method.

背景技术Background Art

在科学实验、计算机图像处理、信号处理等众多研究方面,都要涉及到数据的处理。为了分析事物之间的关系,预测未来的趋势走向,提出合理的决策等,都要对数据进行拟合。数据拟合普遍运用在众多领域,因此一直以来都是比较热门的课题,其实质是通过给定的离散数据点进行数据的曲线曲面拟合,用光滑的曲线曲面逼近这些离散数点,从而使拟合效果更好,是数学学科以及其他学科的重要的研究课题。In many research areas such as scientific experiments, computer image processing, and signal processing, data processing is involved. In order to analyze the relationship between things, predict future trends, and make reasonable decisions, data fitting must be performed. Data fitting is widely used in many fields, so it has always been a hot topic. Its essence is to fit the curves and surfaces of data through given discrete data points, and to approximate these discrete points with smooth curves and surfaces, so as to achieve better fitting results. It is an important research topic in mathematics and other disciplines.

曲线拟合的最小二乘法,是用多项式函数进行拟合,并放弃使所求函数严格通过各给定点(xi,yi)的要求,只要求拟合曲线能反映数据的基本关系,这样在一定程度上可以消除测量数据本身存在的误差,在实际工程数据处理中使用较为广泛。数据拟合的最小二乘法的核心内容为:根据给定的数据组(xi,yi)(i=0,1,...,m)选取近似函数形式,即给定函数类Φ,求函数f(x)∈Φ,使得误差vi=f(xi)-yi(i=0,1,...,m)的平方和为最小,即:The least square method of curve fitting uses a polynomial function for fitting, and abandons the requirement that the desired function strictly passes through each given point ( xi , yi ). It only requires that the fitting curve can reflect the basic relationship of the data. This can eliminate the errors in the measured data itself to a certain extent, and is widely used in actual engineering data processing. The core content of the least square method of data fitting is: according to the given data set ( xi , yi ) (i = 0, 1, ..., m), select the approximate function form, that is, given the function class Φ, find the function f(x)∈Φ, so that the sum of the squares of the error vi = f( xi ) -yi (i = 0, 1, ..., m) is the smallest, that is:

从几何意义上讲,就是寻求与给定点(xi,yi)(i=0,1,...,m)的距离平方和为最小的曲线y=f(x),如图1所示。函数f(x)称为拟合函数或最小二乘解,求拟合函数f(x)的方法称为曲线拟合的最小二乘法。由图1可以看出:原始数据点均匀的分布在拟合曲线f(x)的两边,并且反映出数据变化的趋势和特点。利用最小二乘法来分析测量数据,能够根据测得数据来拟合出近似函数形式,得出比较精确解。实际问题中,由于各点观测精度不同,还常常引入加权方差。In geometric terms, it is to find the curve y = f(x) with the smallest sum of squares of distances from a given point (x i , y i ) (i = 0, 1, ..., m), as shown in Figure 1. The function f(x) is called a fitting function or a least squares solution, and the method for finding the fitting function f(x) is called the least squares method of curve fitting. As can be seen from Figure 1, the original data points are evenly distributed on both sides of the fitting curve f(x), and reflect the trend and characteristics of data changes. By using the least squares method to analyze the measured data, an approximate function form can be fitted based on the measured data to obtain a more accurate solution. In practical problems, due to the different observation accuracy of each point, weighted variance is often introduced.

最小二乘法的优点在于:最小二乘法曲线拟合是测量数据处理的常用方法。最佳平方逼近可以在一个区间上比较均匀的逼近函数,且具有方法简单易行,时效性大,应用广等优点,容易通过计算机的简单程序实现。在回归模型在符合假定的条件下,釆用最小二乘法估计其回归参数具有良好的统计性质,如无偏性、一致性和最小方差性等。The advantages of the least squares method are: Least squares curve fitting is a common method for measuring data processing. The best square approximation can approximate the function more evenly in an interval, and has the advantages of being simple and easy to implement, time-effective, and widely used, and can be easily implemented through a simple computer program. Under the conditions that the regression model meets the assumptions, the least squares method is used to estimate its regression parameters with good statistical properties, such as unbiasedness, consistency, and minimum variance.

最小二乘法的缺点在于:由于实际工程测试中需要处理的数据量很大,如果对整个测量数据应用最小二乘法进行一次性拟合,其拟合精度将很难保证。基于梯度信息的最小二乘法很容易陷入局部最优点,拟合出的函数有时不能满足实际要求。实际的测试数据千差万别,而且对测试数据进行数据拟合的目的也不同,从而使得采用最小二乘法进行数据拟合的结果往往达不到预期的要求。例如,由于偶尔存在的粗大误差而出现了异常值,或数据的概率分布偏离正态分布,此时采用最小二乘法的拟合结果就将失去其良好的统计特性。另一方面,当正规方程阶数较高时,往往出现病态,因此必须加以谨慎对待和加以巧妙处理,有效方法之一是引入正交多项式以改善其病态性质。The disadvantage of the least squares method is that since the amount of data to be processed in actual engineering tests is very large, if the least squares method is applied to the entire measurement data for one-time fitting, its fitting accuracy will be difficult to guarantee. The least squares method based on gradient information can easily fall into the local optimum, and the fitted function sometimes cannot meet the actual requirements. The actual test data varies greatly, and the purpose of data fitting for the test data is also different, so the results of data fitting using the least squares method often fail to meet the expected requirements. For example, due to occasional gross errors, outliers appear, or the probability distribution of the data deviates from the normal distribution. At this time, the fitting results using the least squares method will lose their good statistical properties. On the other hand, when the order of the normal equation is high, it often becomes ill-conditioned, so it must be treated with caution and handled skillfully. One of the effective methods is to introduce orthogonal polynomials to improve its ill-conditioned properties.

但是,现有技术存在如下技术问题:准确性有待提高。However, the prior art has the following technical problems: the accuracy needs to be improved.

发明内容Summary of the invention

本发明的目的在于克服现有技术的不足,提供一种基于聚类的最小二乘法的数据处理方法及装置,可以使数据处理更加准确。The purpose of the present invention is to overcome the deficiencies of the prior art and to provide a data processing method and device based on the least squares method of clustering, which can make data processing more accurate.

本发明的目的是通过以下方案实现的:The object of the present invention is achieved through the following solutions:

一种基于聚类的最小二乘法的数据处理方法,包括以下步骤:A data processing method based on the least squares method of clustering comprises the following steps:

利用计算机程序将聚类算法与最小二乘算法融合,并在计算机中通过最小二乘算法程序模块输出结果与实际数据的误差反馈控制聚类算法程序模块的J(mk)值选取范围,从而利用聚类算法程序模块剔除异常数据,使得最小二乘法拟合结果更加贴近目标值;其中,J(mk)表示k类的对象mk到聚类中心的距离平方和;所述数据包括航空综合电子信息装备测试数据,且所述航空综合电子信息装备测试数据通过计算机输入单元输入到运行所述程序的计算机。A clustering algorithm and a least squares algorithm are integrated by using a computer program, and the J(m k ) value selection range of the clustering algorithm program module is controlled by error feedback between the output result of the least squares algorithm program module and actual data in the computer, so that the clustering algorithm program module is used to eliminate abnormal data, so that the least squares fitting result is closer to the target value; wherein J(m k ) represents the sum of squares of distances from objects m k of class k to the cluster center; the data includes aviation integrated electronic information equipment test data, and the aviation integrated electronic information equipment test data is input into a computer running the program through a computer input unit.

进一步地,所述利用计算机程序将聚类算法与最小二乘算法融合,并在计算机中通过最小二乘算法程序模块输出结果与实际数据的误差反馈控制聚类算法程序模块的J(mk)值选取范围,从而利用聚类算法程序模块剔除异常数据,使得最小二乘法拟合结果更加贴近目标值,具体包括如下子步骤:Furthermore, the clustering algorithm and the least squares algorithm are integrated by a computer program, and the J(m k ) value selection range of the clustering algorithm program module is controlled by error feedback between the output result of the least squares algorithm program module and the actual data in the computer, so that the clustering algorithm program module is used to eliminate abnormal data, so that the least squares fitting result is closer to the target value, which specifically includes the following sub-steps:

步骤1,假定n次采样的输入数据为(x1,p1,…,q1),(x2,p2,…,q2),…,(xn,pn,…,qn),测试输出数据为y1,y2,…yn,其中n=[1,2,…]表示采样点;x,p,…,q表征m个不同的输入特征参数;Step 1, assuming that the input data of n samples are (x1, p1, ..., q1), (x2, p2, ..., q2), ..., (xn, pn, ..., qn), and the test output data are y1, y2, ...yn, where n = [1, 2, ...] represents the sampling point; x, p, ..., q represents m different input feature parameters;

步骤2,输入数据(x,p,…,q)通过计算机输入单元进入聚类算法程序模块处理;Step 2, the input data (x, p, ..., q) enters the clustering algorithm program module for processing through the computer input unit;

步骤3,聚类算法程序模块输出数据进入最小二乘算法程序模块,最小二乘法程序模块根据算法原理得到对应拟合函数z=f(x,p,...,q),测试结果z1,z2,...zn输出;Step 3, the clustering algorithm program module outputs data to the least squares algorithm program module, and the least squares algorithm program module obtains the corresponding fitting function z=f(x, p, ..., q) according to the algorithm principle, and outputs the test results z 1 , z 2 , ... z n ;

步骤4,z1,z2,...zn与测试数据y1,y2,...yn输入均方差计算程序模块,计算均方误差E:Step 4: z 1 , z 2 , ... z n and test data y 1 , y 2 , ... y n are input into the mean square error calculation program module to calculate the mean square error E:

步骤5,当E值满足误差要求,则通过计算机输出单元最小二乘法结果z1,z2,...zn直接输出;如果不满足误差要求,则将E反馈给聚类算法程序模块,重新修正J(mk)的要求范围,如果E值较大,则减小J(mk)的适用范围,增加剔除点;如果E值较小,则扩展J(mk)的适用范围,使得更多的有用数据进入最小二乘法程序模块进行拟合。Step 5, when the E value meets the error requirement, the least squares method results z 1 , z 2 , ... z n are directly output through the computer output unit; if the error requirement is not met, E is fed back to the clustering algorithm program module to re-correct the required range of J(m k ). If the E value is large, the applicable range of J(m k ) is reduced and the elimination points are increased; if the E value is small, the applicable range of J(m k ) is expanded so that more useful data can enter the least squares method program module for fitting.

进一步地,所述输入数据(x,p,…,q)通过计算机输入单元进入聚类算法模块处理,具体包括子步骤:Furthermore, the input data (x, p, ..., q) enters the clustering algorithm module through the computer input unit for processing, which specifically includes the following sub-steps:

步骤a),对于(x,p,…,q)各参数集分别聚类,即分别选取k1,k2,…,km个数据点记为[C1,C2,…,Cki],i=1,…,m,m表征x,p,…,q不同的输入特征参数个数,作为ki个类别各自的中心;Step a), clustering each parameter set (x, p, ..., q) separately, i.e. selecting k1, k2, ..., km data points respectively and recording them as [C 1 , C 2 , ..., C ki ], i = 1, ..., m, m represents the number of different input feature parameters of x, p, ..., q, as the center of each of the ki categories;

步骤b),对于(x,p,…,q)各参数集,分别计算剩下的元素到ki类中心的欧氏距离,每个数据归到距其最近的类别中,欧氏距离计算公式为:Step b), for each parameter set (x, p, ..., q), calculate the Euclidean distance from the remaining elements to the center of the ki class, and each data is assigned to the class closest to it. The Euclidean distance calculation formula is:

dji=|aj-Cki|a∈[x,p,...,q];d ji =|a j -C ki |a∈[x,p,...,q];

i=1,…,m,表示不同的输入特征参数个数;j=1,…,n,表示采样点数;i=1,…,m, represents the number of different input feature parameters; j=1,…,n, represents the number of sampling points;

步骤c),对于(x,p,…,q)各参数集,分别根据聚类结果,首先计算每类所有对象的均值作为新的聚类中心rki,随后计算所有对象距其所在类别聚类中心的距离平方和,即J值:Step c), for each parameter set (x, p, …, q), first calculate the mean of all objects in each category as the new cluster center r ki according to the clustering results, and then calculate the sum of the squares of the distances of all objects from the cluster center of their category, that is, the J value:

步骤d),计算效果评判的函数:Step d), calculate the effect evaluation function:

其中:in:

dkj=1,aj∈rkid kj = 1, a jr ki ;

dkj=0,aj不属于rkid kj = 0, a j does not belong to r ki ;

kx=[k1,k2,...km];kx=[k1,k2,...km];

步骤f),判断J(mk)是否满足用户要求,如在则数据从模块输出;如果不在则根据D=||aj-rki||2排序,剔除D=||aj-rki||2大的数据点,随后重新进入步骤b),直至J(mk)满足要求,数据从聚类算法模块输出。Step f), determine whether J(m k ) meets the user's requirements, if so, the data is output from the module; if not, sort according to D = || a j - r ki || 2 , remove the data points with a large D = || a j - r ki || 2 , and then re-enter step b) until J(m k ) meets the requirements and the data is output from the clustering algorithm module.

进一步地,在步骤1中,针对需要数据拟合的应用情景输入相应场景的采集数据;所述相应场景包括机器学习、数字孪生系统、人工智能和机器视觉。Furthermore, in step 1, the collected data of the corresponding scene is input for the application scenario requiring data fitting; the corresponding scene includes machine learning, digital twin system, artificial intelligence and machine vision.

进一步地,在步骤a)中,ki个数据点[C1,C2,…,Cki]并不要求来自样本数据点。Further, in step a), the ki data points [C 1 , C 2 , …, C ki ] are not required to be from sample data points.

进一步地,在步骤f)中,所述用户要求可设置要求范围。Further, in step f), the user requirement may set a requirement range.

一种基于聚类的最小二乘法的数据处理装置,包括处理器和存储器,在存储器存储有程序,当程序被处理器加载时执行如上任一项所述的基于聚类的最小二乘法的数据处理方法。A data processing device based on clustering least squares method comprises a processor and a memory. A program is stored in the memory. When the program is loaded by the processor, the data processing method based on clustering least squares method as described in any one of the above items is executed.

本发明的有益效果包括:The beneficial effects of the present invention include:

本发明方法简单,可实现性高。通过引入误差反馈控制的机制方案将最小二乘算法与聚类算法紧密结合在一起,通过循环调整J(mk)值范围,使得最小二乘算法输出结果更贴近实际需求,因此可以使最终数据处理更加准确。The method of the present invention is simple and highly feasible. The least squares algorithm and the clustering algorithm are closely combined by introducing the mechanism of error feedback control, and the output result of the least squares algorithm is closer to the actual demand by cyclically adjusting the J(m k ) value range, so that the final data processing can be more accurate.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1为最小二乘法拟合曲线图;Figure 1 is a least squares fitting curve diagram;

图2为本发明实施例方法示意图;FIG2 is a schematic diagram of a method according to an embodiment of the present invention;

图3为UVPA噪声系数测试数据曲线;Figure 3 is a UVPA noise coefficient test data curve;

图4为最小二乘法首次输出结果;Figure 4 shows the first output result of the least squares method;

图5为聚类算法剔除UVPA噪声系数测试数据误差较大点示意图;FIG5 is a schematic diagram of using a clustering algorithm to eliminate points with large errors in UVPA noise coefficient test data;

图6为最小二乘法第二次输出结果;Figure 6 is the second output result of the least squares method;

图7为最小二乘法最终输出结果;Figure 7 shows the final output result of the least squares method;

图8为7次迭代均方误差曲线。Figure 8 is the mean square error curve after 7 iterations.

具体实施方式DETAILED DESCRIPTION

本说明书中所有实施例公开的所有特征,或隐含公开的所有方法或过程中的步骤,除了互相排斥的特征和/或步骤以外,均可以以任何方式组合和/或扩展、替换。All features disclosed in all embodiments in this specification, or steps in all methods or processes implicitly disclosed, except for mutually exclusive features and/or steps, can be combined and/or expanded or replaced in any manner.

鉴于背景中的问题,本发明的发明人进一步经历创造性思考后发现:In view of the problems in the background, the inventors of the present invention further discovered after creative thinking:

K-means聚类算法是聚类分析中的经典且有效的算法,这种算法是基于划分的思想。设原始数据集为d维的,记原始数据集为X={xi|xi∈Rd,i=1,2,...,N}。通过聚类分析,将原始数据集聚成k类,并记k类为mk(k=1,2,...,K)。每一类中都有一个聚类中心,记每个类的聚类中心为ck(k=1,2,...,K),那么每个类中的对象到聚类中心的距离平方和为:K-means clustering algorithm is a classic and effective algorithm in cluster analysis. This algorithm is based on the idea of partitioning. Assume that the original data set is d-dimensional, and the original data set is X = { xi | xi∈Rd , i = 1, 2, ..., N}. Through cluster analysis, the original data set is clustered into k classes, and the k classes are denoted by mk (k = 1, 2, ..., K). Each class has a cluster center, and the cluster center of each class is denoted by ck (k = 1, 2, ..., K). Then the sum of the squares of the distances from the objects in each class to the cluster center is:

聚类分析是使所有类的对象到聚类中心的距离平方和最小,则效果评判的函数为:Cluster analysis is to minimize the sum of squares of distances from objects of all classes to the cluster center, so the function for effect evaluation is:

其中:in:

dki=1,xi∈ci dki1xi∈ci

dki=0,xi为其他d ki = 0, xi is other

通过K-means聚类算法流程对原始数据集进行预处理,数据将被分成k类,每一个数据将被归到距其最近的类中。然后,我们能够计算每个对象到聚类中心的欧式距离,并且选择最大的距离作为异常值。因此,通过聚类算法把数据聚成类,那些不属于任务一类的数据就为异常值。By preprocessing the original data set through the K-means clustering algorithm process, the data will be divided into k categories, and each data will be assigned to the class closest to it. Then, we can calculate the Euclidean distance from each object to the cluster center and select the largest distance as the outlier. Therefore, the data is clustered into classes through the clustering algorithm, and those data that do not belong to the task class are outliers.

因此,本发明方案本质上构思提出了一种新型的基于聚类算法的最小二乘算法构架的数据处理方法,该方法基于计算机硬件运行程序实现。如图2所示,该方法基于计算机程序将聚类算法与最小二乘算法深度融合,通过最小二乘算法程序模块输出结果与实际数据的误差反馈控制聚类算法程序模块的J(mk)值选取范围,从而利用聚类算法程序模块剔除异常数据,使得最小二乘法拟合结果更加贴近目标值,因此提升方法准确性。Therefore, the scheme of the present invention essentially proposes a novel data processing method based on the least squares algorithm framework of the clustering algorithm, and the method is implemented based on a computer hardware running program. As shown in Figure 2, the method deeply integrates the clustering algorithm and the least squares algorithm based on a computer program, and controls the J(m k ) value selection range of the clustering algorithm program module through the error feedback between the output result of the least squares algorithm program module and the actual data, thereby using the clustering algorithm program module to eliminate abnormal data, so that the least squares fitting result is closer to the target value, thereby improving the accuracy of the method.

本发明方案可以在仅依靠聚类算法加最小二乘法的基础上,创新性的增加反馈机制实现方案,设计整体工作流程如下:The solution of the present invention can innovatively add a feedback mechanism to implement the solution based on the clustering algorithm plus the least squares method. The overall workflow is designed as follows:

步骤1,针对需要数据拟合等的应用情景时,假定n次采样的输入数据为(x1,p1,…,q1),(x2,p2,…,q2),…,(xn,pn,…,qn),测试输出数据为y1,y2,…yn,其中n=[1,2,…]表示采样点,x,p,…,q表征m个不同的输入特征参数,为了得到准确的拟合曲线用于预测其他采样点的输出,希望寻找合适的函数表达式使得输入x,p,…,q,输出对应测试输出数据y1,y2,…yn;Step 1, for application scenarios that require data fitting, etc., assume that the input data of n samples is (x1, p1, ..., q1), (x2, p2, ..., q2), ..., (xn, pn, ..., qn), and the test output data is y1, y2, ...yn, where n = [1, 2, ...] represents the sampling points, and x, p, ..., q represents m different input feature parameters. In order to obtain an accurate fitting curve for predicting the output of other sampling points, it is hoped to find a suitable function expression so that the input x, p, ..., q, the output corresponds to the test output data y1, y2, ...yn;

步骤2,输入数据(x,p,…,q)进入聚类算法程序模块,聚类算法程序模块的工作流程如下:Step 2: Input data (x, p, ..., q) into the clustering algorithm program module. The workflow of the clustering algorithm program module is as follows:

步骤a),对于(x,p,…,q)各参数集分别聚类,即分别选取k1,k2,…,km个数据点记为[C1,C2,…,Cki](i=1,…,m),m表征x,p,…,q不同的输入特征参数个数,作为ki个类别各自的中心,其中ki个数据点[C1,C2,…,Cki]并不要求来自样本数据点;Step a), clustering each parameter set (x, p, …, q) separately, i.e. selecting k1, k2, …, km data points denoted as [C 1 , C 2 , …, C ki ] (i=1, …, m), where m represents the number of different input feature parameters of x, p, …, q, as the centers of the ki categories, wherein the ki data points [C 1 , C 2 , …, C ki ] are not required to come from sample data points;

步骤b),对于(x,p,…,q)各参数集,分别计算剩下的元素到ki类中心的欧氏距离,每个数据归到距其最近的类别中,欧氏距离计算公式为:Step b), for each parameter set (x, p, ..., q), calculate the Euclidean distance from the remaining elements to the center of the ki class, and each data is assigned to the class closest to it. The Euclidean distance calculation formula is:

dji=|aj-Cki|a∈[x,p,...,q];d ji =|a j -C ki |a∈[x,p,...,q];

i=1,…,m(不同的输入特征参数个数),j=1,…,n(采样点数)。i=1,…,m (number of different input feature parameters), j=1,…,n (number of sampling points).

步骤c),对于(x,p,…,q)各参数集,分别根据聚类结果,首先计算每类所有对象的均值作为新的聚类中心rki,随后计算所有对象距其所在类别聚类中心的距离平方和,即J值:Step c), for each parameter set (x, p, …, q), first calculate the mean of all objects in each category as the new cluster center r ki according to the clustering results, and then calculate the sum of the squares of the distances of all objects from the cluster center of their category, that is, the J value:

步骤d),计算效果评判的函数Step d), calculate the effect evaluation function

其中:in:

dkj=1,aj∈rkid kj = 1, a jr ki ;

dkj=0,aj不属于rkid kj = 0, a j does not belong to r ki ;

kx=[k1,k2,...km];kx=[k1,k2,...km];

步骤f),判断J(mk)是否满足用户要求(用户可设置要求范围),如在则数据从模块输出;如果不在则根据D=||aj-rki||2排序,剔除D=||aj-rki||2大的数据点,随后重新进入步骤b),直至J(mk)满足要求,数据从聚类算法模块输出;Step f), determine whether J(m k ) meets the user's requirements (the user can set the requirement range), if so, the data is output from the module; if not, sort according to D = || a j - r ki || 2 , remove the data points with a large D = || a j - r ki || 2 , and then re-enter step b), until J(m k ) meets the requirements, the data is output from the clustering algorithm module;

步骤3,聚类算法模块输出数据进入最小二乘算法模块,最小二乘法根据算法原理的到对应拟合函数z=f(x,p,...,q),测试结果z1,z2,...zn输出;Step 3, the clustering algorithm module outputs data to the least squares algorithm module, and the least squares method obtains the corresponding fitting function z=f(x, p, ..., q) according to the algorithm principle, and the test results z 1 , z 2 , ... z n are output;

步骤4,z1,z2,...zn与测试数据y1,y2,...yn输入均方差计算模块,计算均方误差E:Step 4: z 1 , z 2 , ... z n and test data y 1 , y 2 , ... y n are input into the mean square error calculation module to calculate the mean square error E:

步骤5,当E值满足误差要求,则最小二乘法结果z1,z2,...zn直接输出;如果不满足误差要求,则将E反馈给聚类算法模块,重新修正J(mk)的要求范围,如果E值较大,则减小J(mk)的适用范围,增加剔除点;如果E值较小,则扩展J(mk)的适用范围,使得更多的有用数据进入最小二乘模块进行拟合。Step 5, when the E value meets the error requirement, the least squares method results z 1 , z 2 , ... z n are directly output; if it does not meet the error requirement, E is fed back to the clustering algorithm module to re-correct the required range of J(m k ). If the E value is large, the applicable range of J(m k ) is reduced and the elimination points are increased; if the E value is small, the applicable range of J(m k ) is expanded so that more useful data can enter the least squares module for fitting.

由于航空综合电子信息装备是综合化的装备,不同功能单元又由多个模块共同实现,因此大部分整机功能单元的核心指标都是由多个级联模块的指标共同决定的。由于每个模块的指标参数复杂,不同功能单元中模块工作方式存在较大差异,受到本身工作方式、工作环境、测试仪器等多重因素影响,核心功能指标与模块指标间映射关系呈现出复杂性和多样性。因此深入分析核心功能指标和模块指标之间的关系,对其物理特性进行研究探讨,构建函数通过模块指标计算得到核心功能指标,在已有知识的基础上,对模块之间的联系有更进一步的认识,结合不同模块的特点,对核心指标与模块指标之间进行更进一步的探索。Since aviation integrated electronic information equipment is integrated equipment, and different functional units are realized by multiple modules, the core indicators of most functional units of the whole machine are jointly determined by the indicators of multiple cascade modules. Due to the complex indicator parameters of each module, there are large differences in the working mode of modules in different functional units, and the influence of multiple factors such as the working mode, working environment, and test instruments, the mapping relationship between core functional indicators and module indicators is complex and diverse. Therefore, the relationship between core functional indicators and module indicators is deeply analyzed, and their physical characteristics are studied and discussed. Functions are constructed to calculate core functional indicators through module indicators. On the basis of existing knowledge, there is a further understanding of the connection between modules. Combined with the characteristics of different modules, further exploration is carried out between core indicators and module indicators.

以下结合附图和实例场景数据进一步说明。对于灵敏度测试数据,采集数据见表1所示。拟合曲线的目标数据为灵敏度(用y表征),输入的测试数据为频率(用a表征)、UVPA的增益数据(用b表征)、UVPA的噪声系数数据(用c表征)和UVRT的噪声系数数据(用d表征)。The following is further explained in conjunction with the accompanying drawings and example scenario data. For the sensitivity test data, the collected data is shown in Table 1. The target data of the fitting curve is sensitivity (represented by y), and the input test data is frequency (represented by a), UVPA gain data (represented by b), UVPA noise coefficient data (represented by c), and UVRT noise coefficient data (represented by d).

表1UV功能灵敏度指标与模块指标对应表Table 1 Correspondence between UV function sensitivity index and module index

关于聚类算法,对输入的测试数据UVPA的增益数据(用b表征)、UVPA的噪声系数数据(用c表征)和UVRT的噪声系数数据(用d表征)分别进行聚类,如UVPA的噪声系数数据如图3所示。首次计算将圆圈中距离聚类中心最远、D=||aj-rki||2值最大的点去掉。Regarding the clustering algorithm, the input test data UVPA gain data (represented by b), UVPA noise coefficient data (represented by c) and UVRT noise coefficient data (represented by d) are clustered respectively, as shown in FIG3 for the UVPA noise coefficient data. In the first calculation, the point in the circle that is farthest from the cluster center and has the largest value of D = ||a j - r ki || 2 is removed.

关于最小二乘法,图4给出了最小二乘法输出结果,并将输出结果与测试结果进行对比。Regarding the least squares method, FIG4 shows the least squares method output result and compares the output result with the test result.

关于均方误差计算,通过计算均方误差为0.2709,如果不满足要求则将误差反馈给聚类算法模块。Regarding the mean square error calculation, the mean square error is calculated to be 0.2709. If it does not meet the requirements, the error is fed back to the clustering algorithm module.

关于聚类算法,聚类算法将剔除第二个距离聚类中心最远、D=||aj-rki||2值最大的点去掉。将数据送入最小二乘法模块。如图5所示。Regarding the clustering algorithm, the clustering algorithm will remove the second point that is farthest from the cluster center and has the largest value of D = || a j - r ki || 2. The data is sent to the least squares module. As shown in Figure 5.

关于最小二乘法,图6给出了最小二乘法第二次计算结果,并将输出结果与测试结果进行对比。Regarding the least squares method, FIG6 shows the second calculation result of the least squares method and compares the output result with the test result.

关于均方误差计算,通过计算均方误差为0.113,如果不满足要求则将误差反馈给聚类算法模块。Regarding the mean square error calculation, the mean square error is calculated to be 0.113. If it does not meet the requirements, the error is fed back to the clustering algorithm module.

经7次迭代,最小二乘法输出结果如图7,与实测数据一致。均方误差曲线如图8所示,说明计算结果与实测数据越来越接近,模型越来越精确。After 7 iterations, the least squares method output result is shown in Figure 7, which is consistent with the measured data. The mean square error curve is shown in Figure 8, indicating that the calculated results are getting closer and closer to the measured data, and the model is becoming more and more accurate.

描述于本发明实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments of the present invention may be implemented by software or hardware, and the units described may also be arranged in a processor. The names of these units do not, in some cases, limit the units themselves.

根据本发明实施例的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各种可选实现方式中提供的方法。According to one aspect of an embodiment of the present invention, a computer program product or a computer program is provided, the computer program product or the computer program includes a computer instruction, and the computer instruction is stored in a computer-readable storage medium. A processor of a computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided in the above various optional implementations.

除以上实例以外,本领域技术人员根据上述公开内容获得启示或利用相关领域的知识或技术进行改动获得其他实施例,各个实施例的特征可以互换或替换,本领域人员所进行的改动和变化不脱离本发明的精神和范围,则都应在本发明所附权利要求的保护范围内。In addition to the above examples, those skilled in the art may obtain other embodiments based on the above disclosure or by using the knowledge or technology in the relevant field to make changes. The features of each embodiment may be interchangeable or replaced. The changes and modifications made by those skilled in the art do not depart from the spirit and scope of the present invention and should be within the scope of protection of the claims attached to the present invention.

Claims (7)

1. The data processing method based on the clustering least square method is characterized by comprising the following steps of:
The clustering algorithm and the least square algorithm are fused by using a computer program, and the J (m k) value selection range of the clustering algorithm program module is controlled by the error feedback of the output result of the least square algorithm program module and the actual data in the computer, so that the clustering algorithm program module is used for eliminating abnormal data, and the fitting result of the least square method is closer to the target value; wherein J (m k) represents the sum of squares of distances from the object m k of class k to the cluster center; the data includes avionics equipment test data, and the avionics equipment test data is input to a computer running the program through a computer input unit.
2. The data processing method based on the clustering least square method according to claim 1, wherein the clustering algorithm and the least square algorithm are fused by using a computer program, and the error feedback of the output result of the least square algorithm program module and the actual data controls the J (m k) value selection range of the clustering algorithm program module in the computer, so that the abnormal data is removed by using the clustering algorithm program module, and the fitting result of the least square method is closer to the target value, and the method specifically comprises the following sub-steps:
Step 1, assuming that the input data of n samples is (x 1, p1, …, q 1), (x 2, p2, …, q 2), …, (xn, pn, …, qn), and the test output data is y1, y2, … yn, where n= [1,2, … ] represents the sampling point; x, p, …, q represent m different input characteristic parameters;
Step 2, inputting data (x, p, …, q) into a clustering algorithm program module for processing through a computer input unit;
Step 3, the output data of the clustering algorithm program module enters a least square algorithm program module, the least square algorithm program module obtains a corresponding fitting function z=f (x, p, q) according to an algorithm principle, and a test result z 1,z2,...zn is output;
Step 4, z 1,z2,...zn and test data y 1,y2,...yn are input into a mean square error calculation program module to calculate a mean square error E:
Step 5, when the E value meets the error requirement, directly outputting a result z 1,z2,...zn of the least square method through a computer output unit; if the error requirement is not met, feeding E back to a clustering algorithm program module, revising the requirement range of J (m k), if the E value is larger, reducing the application range of J (m k), and increasing the reject point; if the E value is smaller, the application range of J (m k) is expanded, so that more useful data enter a least square program module for fitting.
3. The data processing method based on the clustering least square method according to claim 2, wherein the input data (x, p, …, q) is processed by a clustering algorithm module through a computer input unit, and specifically comprises the following sub-steps:
step a), clustering each parameter set of (x, p, …, q) respectively, namely respectively selecting k1, k2, …, and marking km data points as [ C 1,C2,…,Cki ], wherein i=1, …, m, m represents the number of different input characteristic parameters of x, p, …, q, and taking the number as the center of each of ki categories;
step b), for each parameter set of (x, p, …, q), respectively calculating the Euclidean distance from the rest element to the center of the ki class, wherein each data is classified into the nearest class, and the Euclidean distance calculation formula is as follows:
dji=|aj-Cki|a∈[x,p,...,q];
i=1, …, m, representing the number of different input characteristic parameters; j=1, …, n, representing the number of sampling points;
Step c), for each parameter set (x, p, …, q), respectively according to the clustering result, firstly calculating the average value of all objects in each class as a new clustering center r ki, and then calculating the square sum of the distances of all objects from the clustering center of the class in which the objects are located, namely J value:
Step d), calculating a function of the effect judgment:
wherein:
dkj=1,aj∈rki
d kj=0,aj does not belong to r ki;
kx=[k1,k2,...km];
Step f), judging whether J (m k) meets the user requirement or not, and outputting data from the module if the J (m k) meets the user requirement; if not, sorting according to D= |a j-rki||2, eliminating the data points with the size of D= |a j-rki||2, and then re-entering the step b) until J (m k) meets the requirement, and outputting the data from the clustering algorithm module.
4. The data processing method based on the cluster least square method according to claim 2, wherein in step 1, acquisition data of a corresponding scene is input for an application scene requiring data fitting; the corresponding scenarios include machine learning, digital twinning systems, artificial intelligence, and machine vision.
5. A data processing method based on a cluster least squares method according to claim 3, wherein in step a) ki data point [ C 1,C2,…,Cki ] is not required from sample data points.
6. The data processing method based on the cluster least square method according to claim 2, wherein in step f), the user request can set a request range.
7. A data processing apparatus based on a cluster least square method, comprising a processor and a memory, wherein a program is stored in the memory, and the data processing method based on the cluster least square method according to any one of claims 1 to 6 is executed when the program is loaded by the processor.
CN202410646317.6A 2024-05-23 2024-05-23 A data processing method and device based on least squares method of clustering Pending CN118626758A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410646317.6A CN118626758A (en) 2024-05-23 2024-05-23 A data processing method and device based on least squares method of clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410646317.6A CN118626758A (en) 2024-05-23 2024-05-23 A data processing method and device based on least squares method of clustering

Publications (1)

Publication Number Publication Date
CN118626758A true CN118626758A (en) 2024-09-10

Family

ID=92594831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410646317.6A Pending CN118626758A (en) 2024-05-23 2024-05-23 A data processing method and device based on least squares method of clustering

Country Status (1)

Country Link
CN (1) CN118626758A (en)

Similar Documents

Publication Publication Date Title
TWI539298B (en) Metrology sampling method with sampling rate decision scheme and computer program product thereof
TW200949596A (en) Server and system and method for automatic virtual metrology
CN110889091A (en) Machine tool thermal error prediction method and system based on temperature sensitive interval segmentation
US20210406727A1 (en) Managing defects in a model training pipeline using synthetic data sets associated with defect types
CN112348055A (en) Clustering evaluation measurement method, system, device and storage medium
CN117370326A (en) Data evaluation method, device, electronic equipment and medium
CN118035654A (en) Enterprise internal control evaluation method based on quantitative business data
CN117689256A (en) A quality traceability method for aluminum alloy melting and casting products
CN117290799B (en) Enterprise purchase management method and system based on big data
CN109582555A (en) Data exception detection method, device, detection system and storage medium
CN118626758A (en) A data processing method and device based on least squares method of clustering
CN108415372A (en) Precision machine tool thermal error compensation method
CN117907685A (en) Method and device for measuring resistance value in link signal and computer equipment
CN111400644B (en) Calculation processing method for laboratory analysis sample
CN108490912A (en) A kind of multi-modal process modal identification method based on pivot similarity analysis
CN117112186A (en) Methods, apparatus, equipment and media for model performance evaluation
CN107862126B (en) A system reliability assessment method under the condition of component-level information diversity
CN118133208B (en) Robot polishing offline identification abnormal data integration optimization system
CN111581008A (en) Fast and accurate detection method of outliers based on parallel cloud computing
CN110909067A (en) A visual analysis system and method for marine multidimensional data
CN112581188A (en) Construction method, prediction method and model of engineering project bid quotation prediction model
CN118735365B (en) Numerical control machining quality monitoring method and system based on statistical process control
CN111400152A (en) Data processing method, first server and second server
CN118701686B (en) Automatic bottle arranging method and system for filling machine of special-shaped bottles
CN118839176B (en) A method for optimizing the collection and analysis of gas cabinet sensor data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination