[go: up one dir, main page]

CN118094417A - CTD data quality monitoring method and device, electronic equipment and medium - Google Patents

CTD data quality monitoring method and device, electronic equipment and medium Download PDF

Info

Publication number
CN118094417A
CN118094417A CN202311720647.7A CN202311720647A CN118094417A CN 118094417 A CN118094417 A CN 118094417A CN 202311720647 A CN202311720647 A CN 202311720647A CN 118094417 A CN118094417 A CN 118094417A
Authority
CN
China
Prior art keywords
ctd
data
target
ctd data
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311720647.7A
Other languages
Chinese (zh)
Inventor
杨帅
应宗权
李嘉民
朱海威
丁平祥
刘梅梅
赵娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCCC Fourth Harbor Engineering Institute Co Ltd
Guangzhou Harbor Engineering Quality Inspection Co Ltd
Southern Marine Science and Engineering Guangdong Laboratory Guangzhou
Original Assignee
CCCC Fourth Harbor Engineering Institute Co Ltd
Guangzhou Harbor Engineering Quality Inspection Co Ltd
Southern Marine Science and Engineering Guangdong Laboratory Guangzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCCC Fourth Harbor Engineering Institute Co Ltd, Guangzhou Harbor Engineering Quality Inspection Co Ltd, Southern Marine Science and Engineering Guangdong Laboratory Guangzhou filed Critical CCCC Fourth Harbor Engineering Institute Co Ltd
Priority to CN202311720647.7A priority Critical patent/CN118094417A/en
Publication of CN118094417A publication Critical patent/CN118094417A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01DMEASURING NOT SPECIALLY ADAPTED FOR A SPECIFIC VARIABLE; ARRANGEMENTS FOR MEASURING TWO OR MORE VARIABLES NOT COVERED IN A SINGLE OTHER SUBCLASS; TARIFF METERING APPARATUS; MEASURING OR TESTING NOT OTHERWISE PROVIDED FOR
    • G01D21/00Measuring or testing not otherwise provided for
    • G01D21/02Measuring two or more variables by means not covered by a single other subclass
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Testing Or Calibration Of Command Recording Devices (AREA)

Abstract

The embodiment of the invention provides a CTD data quality monitoring method, a device, electronic equipment and a medium, wherein the method comprises the following steps: collecting CTD data of a water area, wherein the CTD data comprises warm salt depth data, biological environment information and other auxiliary parameter data, respectively analyzing the association degree of the biological environment information, the other auxiliary parameter data and the warm salt depth data, selecting target CTD data according to the association degree to construct a target CTD data set, performing abnormality judgment on the target CTD data by adopting an isolated forest algorithm, judging whether the CTD data of the batch is abnormal according to the outlier target CTD data, and sending abnormality alarm information when the CTD data of the batch is determined to be abnormal. The method of the invention carries out anomaly detection and identification on mass diversified CTD data, the data processing mode can efficiently and accurately identify and process the anomalies in the CTD data, improve the reliability, accuracy and analysis efficiency of CTD data quality monitoring, discover the anomalies of the marine environment in time and give an alarm to inform researchers in time.

Description

一种CTD数据质量监测方法、装置、电子设备和介质A CTD data quality monitoring method, device, electronic device and medium

技术领域Technical Field

本发明涉及CTD数据质控技术领域,特别是涉及一种CTD数据质量监测方法、一种CTD数据质量监测装置、一种电子设备以及一种计算机可读介质。The present invention relates to the technical field of CTD data quality control, and in particular to a CTD data quality monitoring method, a CTD data quality monitoring device, an electronic device and a computer readable medium.

背景技术Background technique

海洋环境监测数据不仅在海洋科学研究中占据重要地位,还是海洋观测预报、安全保障及海洋基础设施建设与维养等领域不可或缺的数据源。Marine environmental monitoring data not only occupies an important position in marine scientific research, but is also an indispensable data source in the fields of marine observation and forecasting, safety assurance, and marine infrastructure construction and maintenance.

海洋环境复杂多变,CTD监测过程中影响因素多元,现代CTD仪器除温盐深等主要监测数据外,还包含多种生物环境信息、辅助参数等,这些数据不可避免的含有大量误差、错误、缺失等“噪音”。又因为CTD监测的时间跨度往往能达到数月或几年,数据体量庞大、精度要求高,需要对数据的质量情况有全面准确的了解,所以数据质量检测成为了CTD数据质控中至关重要的一环。The marine environment is complex and changeable, and there are multiple factors affecting the CTD monitoring process. In addition to the main monitoring data such as temperature, salinity and depth, modern CTD instruments also contain a variety of biological environmental information, auxiliary parameters, etc. These data inevitably contain a lot of errors, mistakes, missing and other "noise". Because the time span of CTD monitoring can often reach several months or years, the data volume is huge and the accuracy requirements are high, and a comprehensive and accurate understanding of the data quality is required, so data quality detection has become a crucial part of CTD data quality control.

针对以上现象,当前海洋领域研究人员往往使用判断多种范围的方法,对一组数据中的多种特征轮流单独范围筛选,并在过程中标注特征码,根据特征码综合整理后输出整体数据质量的判断结果。In response to the above phenomenon, current researchers in the marine field often use a method of judging multiple ranges, screening multiple features in a set of data in turn and individually, marking feature codes in the process, and outputting the judgment results of the overall data quality after comprehensive sorting based on the feature codes.

这种方式质控检测过程中采用的判断范围往往来源于全球通用数据集,不能高精度、高准确度的反应数据特征,范围判断的方式仅对单一的数据特征有效,难以对时间段内多元化信息的准确性和有效性进行校验。The judgment range used in this quality control testing process is often derived from a global common data set, which cannot reflect data characteristics with high precision and accuracy. The range judgment method is only effective for a single data feature and it is difficult to verify the accuracy and validity of diversified information within a time period.

发明内容Summary of the invention

鉴于上述问题,提出了本发明实施例以便提供一种克服上述问题或者至少部分地解决上述问题的一种CTD数据质量监测方法和相应的一种CTD数据质量监测装置、一种电子设备以及一种计算机可读介质。In view of the above problems, embodiments of the present invention are proposed to provide a CTD data quality monitoring method and a corresponding CTD data quality monitoring device, an electronic device and a computer-readable medium that overcome the above problems or at least partially solve the above problems.

本发明实施例公开了一种CTD数据质量监测方法,所述方法包括:The embodiment of the present invention discloses a CTD data quality monitoring method, the method comprising:

采集水域的CTD数据,所述CTD数据包括温盐深数据、生物环境信息、其他辅助参数数据;Collect CTD data of water areas, including temperature, salinity, depth, biological environment information, and other auxiliary parameter data;

分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合;Respectively analyzing the correlation between the biological environment information, the other auxiliary parameter data and the temperature, salinity and depth data, and selecting target CTD data according to the correlation to construct a target CTD data set;

采用孤立森林算法对所述目标CTD数据进行异常判断;An isolation forest algorithm is used to judge abnormality of the target CTD data;

根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息。It determines whether the current batch of CTD data is abnormal based on the outlier target CTD data and issues an abnormal alarm message when it is determined that the current batch of CTD data is abnormal.

可选地,所述温盐深数据包括电导率、温度、压力,分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合的步骤,包括:Optionally, the temperature-salinity-depth data include conductivity, temperature, and pressure, and the steps of analyzing the correlation between the biological environment information, the other auxiliary parameter data, and the temperature-salinity-depth data, and selecting target CTD data according to the correlation to construct a target CTD data set include:

对所述CTD数据进行去量纲化和归一化,得到经过预处理的CTD数据;De-dimensionalizing and normalizing the CTD data to obtain pre-processed CTD data;

以电导率、温度、压力为目标参考项,计算经过预处理的生物环境信息和其他辅助参数数据中的每一个CTD参数特征分别与每一个目标参考项的灰色关联度系数,并按照灰色关联度系数从高到低的顺序排列所有CTD参数特征,得到三组CTD参数特征有序序列;Taking conductivity, temperature and pressure as target reference items, the grey correlation coefficient between each CTD parameter feature in the preprocessed biological environment information and other auxiliary parameter data and each target reference item is calculated, and all CTD parameter features are arranged in descending order according to the grey correlation coefficient, and three groups of CTD parameter feature ordered sequences are obtained;

从每一组CTD参数特征有序序列中选取排序靠前的预设数量的CTD参数特征;Selecting a preset number of CTD parameter features with top rankings from each group of CTD parameter feature ordered sequences;

采用电导率、温度、压力和所述排序靠前的预设数量的CTD参数特征作为目标CTD数据构成所述目标CTD数据集合。The target CTD data set is formed by using conductivity, temperature, pressure and the preset number of CTD parameter features ranked at the top as target CTD data.

可选地,采用孤立森林算法对所述目标CTD数据进行异常判断的步骤,包括:Optionally, the step of using an isolation forest algorithm to perform abnormality judgment on the target CTD data includes:

采用孤立森林算法计算所述目标CTD数据集合中每一个目标CTD数据的异常度分数;Using an isolation forest algorithm to calculate the abnormality score of each target CTD data in the target CTD data set;

根据每一个目标CTD数据的异常度分数,以及预设异常度分数阈值或平均异常度分数阈值确定离群的目标CTD数据。The outlier target CTD data is determined according to the abnormality score of each target CTD data and a preset abnormality score threshold or an average abnormality score threshold.

可选地,采用孤立森林算法计算所述目标CTD数据集合中每一个目标CTD数据的异常度分数的步骤,包括:Optionally, the step of using an isolation forest algorithm to calculate the abnormality score of each target CTD data in the target CTD data set includes:

S1,假设所述目标CTD数据集合中的元素总个数为n(n≥5000),定义孤立树的数量t(1≤t≤n)、平均深度h、最大深度H,根节点子样本集的容量k;S1, assuming that the total number of elements in the target CTD data set is n (n ≥ 5000), define the number of isolated trees t (1 ≤ t ≤ n), the average depth h, the maximum depth H, and the capacity k of the root node subsample set;

S2,从所述目标CTD数据集合中多次有放回采样,以纯随机方式抽取指定数量的元素构建m个子样本集(a1、a2……ai……am);S2, sampling with replacement multiple times from the target CTD data set, extracting a specified number of elements in a purely random manner to construct m sub-sample sets (a1, a2...ai...am);

S3,选择一个子样本集ai,作为一棵树的根节点,并随机选择一个CTD参数特征P;S3, select a subsample set ai as the root node of a tree and randomly select a CTD parameter feature P;

S4,对于CTD参数特征的单个值q,对树进行二叉分裂,遍历子样本集ai所有元素,当CTD参数特征P的任意记录值r小于等于单个值q,则将此元素放在树的左子节点,否则放在右子节点;S4, for a single value q of the CTD parameter feature, perform a binary split on the tree, traverse all elements of the subsample set ai, and when any record value r of the CTD parameter feature P is less than or equal to the single value q, place this element in the left child node of the tree, otherwise place it in the right child node;

S5,采用S4步骤,递归构造左子节点和右子节点,构建二叉树,停止条件为数组ai中任意元素都被孤立或树的高度已经等于预设高度h,分叉停止时将本二叉树视为构成孤立树;S5, using step S4, recursively constructing the left child node and the right child node to build a binary tree. The stopping condition is that any element in the array ai is isolated or the height of the tree is equal to the preset height h. When the bifurcation stops, the binary tree is regarded as an isolated tree.

S6:循环执行S3-S5,直至所有子样本集数都完成执行,此时共包含m个孤立树,形成容量为m的孤立森林;S6: Execute S3-S5 in a loop until all sub-sample sets are executed. At this time, there are m isolated trees in total, forming an isolation forest with a capacity of m;

S7:统计任意元素在孤立森林中的平均路径长度,计算异常度分数;S7: Count the average path length of any element in the isolation forest and calculate the outlier score;

树的平均路径长度计算公式为:The average path length of a tree is calculated as:

其中,为给定样本数Ψ时路径长度的平均值;/>此处x代表/>为欧拉常数,其值为0.5772156649;in, is the average path length when the number of samples Ψ is given; /> Here x represents/> is the Euler constant, its value is 0.5772156649;

异常度分数计算: Abnormality score calculation:

其中,h(x)是样本点x在孤立树中检索到的节点的深度;E(h(x))为所有孤立树的h(x)的期望值。Where h(x) is the depth of the node retrieved by the sample point x in the isolated tree; E(h(x)) is the expected value of h(x) of all isolated trees.

可选地,根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息的步骤,包括:Optionally, the step of determining whether the current batch of CTD data is abnormal according to the outlier target CTD data and issuing abnormal alarm information when it is determined that the current batch of CTD data is abnormal includes:

判断所有离群的目标CTD数据的数量是否超过预设的离群点数量阈值;Determine whether the number of all outlier target CTD data exceeds a preset outlier number threshold;

若所有离群的CTD数据的数量超过预设的离群点数量阈值,则确定本批CTD数据异常,发出异常告警信息。If the number of all outlier CTD data exceeds the preset outlier number threshold, the current batch of CTD data is determined to be abnormal and an abnormal alarm message is issued.

可选地,所述方法还包括:Optionally, the method further comprises:

采用所述水域的CTD数据生成可视化图表,以直观展示CTD数据的趋势和变化;所述可视化图表包括趋势图、柱状图和、饼状图;Generate a visualization chart using the CTD data of the water area to intuitively display the trend and change of the CTD data; the visualization chart includes a trend chart, a bar chart and a pie chart;

对所述水域的CTD数据进行统计分析,获得指标数据,以体现CTD数据的特征和规律;所述指标数据包括平均值、方差、标准差。Statistical analysis is performed on the CTD data of the waters to obtain index data to reflect the characteristics and laws of the CTD data; the index data includes mean value, variance, and standard deviation.

本发明实施例还公开了一种CTD数据质量监测装置,所述装置包括:The embodiment of the present invention further discloses a CTD data quality monitoring device, the device comprising:

采集模块,用于采集水域的CTD数据,所述CTD数据包括温盐深数据、生物环境信息、其他辅助参数数据;The acquisition module is used to collect CTD data of the water area, and the CTD data includes temperature, salinity and depth data, biological environment information, and other auxiliary parameter data;

关联度分析选取模块,用于分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合;A correlation analysis and selection module, used to analyze the correlation between the biological environment information, the other auxiliary parameter data and the temperature, salinity and depth data respectively, and select target CTD data according to the correlation to construct a target CTD data set;

异常判断模块,用于采用孤立森林算法对所述目标CTD数据进行异常判断;An abnormality judgment module, used for performing abnormality judgment on the target CTD data by using an isolation forest algorithm;

异常告警模块,用于根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息。The abnormal alarm module is used to determine whether the current batch of CTD data is abnormal based on the outlier target CTD data and issue an abnormal alarm message when it is determined that the current batch of CTD data is abnormal.

可选地,所述温盐深数据包括电导率、温度、压力,所述关联度分析选取模块包括:Optionally, the temperature-salinity-depth data include conductivity, temperature, and pressure, and the correlation analysis and selection module includes:

预处理子模块,用于对所述CTD数据进行去量纲化和归一化,得到经过预处理的CTD数据;A preprocessing submodule, used for de-dimensionalizing and normalizing the CTD data to obtain preprocessed CTD data;

关联度分析排列子模块,用于以电导率、温度、压力为目标参考项,计算经过预处理的生物环境信息和其他辅助参数数据中的每一个CTD参数特征分别与每一个目标参考项的灰色关联度系数,并按照灰色关联度系数从高到低的顺序排列所有CTD参数特征,得到三组CTD参数特征有序序列;The correlation analysis arrangement submodule is used to calculate the grey correlation coefficient between each CTD parameter feature in the preprocessed biological environment information and other auxiliary parameter data and each target reference item, and arrange all CTD parameter features in descending order of the grey correlation coefficient to obtain three groups of CTD parameter feature ordered sequences;

选取子模块,用于从每一组CTD参数特征有序序列中选取排序靠前的预设数量的CTD参数特征;A selection submodule is used to select a preset number of CTD parameter features that are ranked top from each group of CTD parameter feature ordered sequences;

目标CTD数据集合构造子模块,用于采用电导率、温度、压力和所述排序靠前的预设数量的CTD参数特征作为目标CTD数据构成所述目标CTD数据集合。The target CTD data set construction submodule is used to use conductivity, temperature, pressure and the CTD parameter features of the top ranking preset number as target CTD data to construct the target CTD data set.

可选地,所述异常判断模块包括:Optionally, the abnormality judgment module includes:

异常度分数计算子模块,用于采用孤立森林算法计算所述目标CTD数据集合中每一个目标CTD数据的异常度分数;An abnormality score calculation submodule, used to calculate the abnormality score of each target CTD data in the target CTD data set using an isolation forest algorithm;

离群目标CTD数据确定子模块,用于根据每一个目标CTD数据的异常度分数,以及预设异常度分数阈值或平均异常度分数阈值确定离群的目标CTD数据。The outlier target CTD data determination submodule is used to determine the outlier target CTD data according to the abnormality score of each target CTD data and a preset abnormality score threshold or an average abnormality score threshold.

可选地,所述异常度分数计算子模块包括:Optionally, the abnormality score calculation submodule includes:

定义单元,用于假设所述目标CTD数据集合中的元素总个数为n(n≥5000),定义孤立树的数量t(1≤t≤n)、平均深度h、最大深度H,根节点子样本集的容量k;A definition unit, used to assume that the total number of elements in the target CTD data set is n (n≥5000), define the number of isolated trees t (1≤t≤n), the average depth h, the maximum depth H, and the capacity k of the root node subsample set;

子样本集构造单元,用于从所述目标CTD数据集合中多次有放回采样,以纯随机方式抽取指定数量的元素构建m个子样本集(a1、a2……ai……am);A sub-sample set construction unit is used to extract a specified number of elements from the target CTD data set in a purely random manner by sampling with replacement multiple times to construct m sub-sample sets (a1, a2...ai...am);

根节点构建单元,用于选择一个子样本集ai,作为一棵树的根节点,并随机选择一个CTD参数特征P;The root node construction unit is used to select a subsample set ai as the root node of a tree and randomly select a CTD parameter feature P;

二叉树构建单元,用于对于CTD参数特征的单个值q,对树进行二叉分裂,遍历子样本集ai所有元素,当CTD参数特征P的任意记录值r小于等于单个值q,则将此元素放在树的左子节点,否则放在右子节点;A binary tree construction unit is used to perform binary splitting on the tree for a single value q of the CTD parameter feature, traverse all elements of the subsample set ai, and when any record value r of the CTD parameter feature P is less than or equal to the single value q, the element is placed in the left child node of the tree, otherwise it is placed in the right child node;

孤立树构建单元,用于采用二叉树构建单元递归构造左子节点和右子节点,构建二叉树,停止条件为数组ai中任意元素都被孤立或树的高度已经等于预设高度h,分叉停止时将本二叉树视为构成孤立树;An isolated tree construction unit is used to recursively construct a left child node and a right child node using a binary tree construction unit to construct a binary tree. The stopping condition is that any element in the array ai is isolated or the height of the tree is equal to a preset height h. When the bifurcation stops, the binary tree is regarded as constituting an isolated tree.

循环执行单元,用于循环执行根节点构建单元、二叉树构建单元、孤立树构建单元,直至所有子样本集数都完成执行,此时共包含m个孤立树,形成容量为m的孤立森林;A loop execution unit is used to loop execute the root node construction unit, the binary tree construction unit, and the isolated tree construction unit until all sub-sample sets are executed. At this time, a total of m isolated trees are included, forming an isolated forest with a capacity of m;

异常度分数计算单元,用于统计任意元素在孤立森林中的平均路径长度,计算异常度分数;Anomaly score calculation unit, used to count the average path length of any element in the isolation forest and calculate the anomaly score;

树的平均路径长度计算公式为:The average path length of a tree is calculated as:

其中,为给定样本数Ψ时路径长度的平均值;/>此处x代表/>为欧拉常数,其值为0.5772156649;in, is the average path length when the number of samples Ψ is given; /> Here x represents/> is the Euler constant, its value is 0.5772156649;

异常度分数计算: Abnormality score calculation:

其中,h(x)是样本点x在孤立树中检索到的节点的深度;E(h(x))为所有孤立树的h(x)的期望值。Where h(x) is the depth of the node retrieved by the sample point x in the isolated tree; E(h(x)) is the expected value of h(x) of all isolated trees.

可选地,所述异常告警信息模块包括:Optionally, the abnormal alarm information module includes:

数量阈值判断子模块,用于判断所有离群的目标CTD数据的数量是否超过预设的离群点数量阈值;The quantity threshold judgment submodule is used to judge whether the quantity of all outlier target CTD data exceeds the preset outlier quantity threshold;

异常告警子模块,用于若所有离群的CTD数据的数量超过预设的离群点数量阈值,则确定本批CTD数据异常,发出异常告警信息。The abnormal alarm submodule is used to determine that the batch of CTD data is abnormal and issue an abnormal alarm message if the number of all outlier CTD data exceeds a preset outlier number threshold.

可选地,所述装置还包括:Optionally, the device further comprises:

可视化图表生成模块,用于采用所述水域的CTD数据生成可视化图表,以直观展示CTD数据的趋势和变化;所述可视化图表包括趋势图、柱状图和、饼状图;A visualization chart generation module, used to generate visualization charts using the CTD data of the water area to intuitively display the trends and changes of the CTD data; the visualization charts include trend charts, bar charts and pie charts;

统计分析模块,用于对所述水域的CTD数据进行统计分析,获得指标数据,以体现CTD数据的特征和规律;所述指标数据包括平均值、方差、标准差。The statistical analysis module is used to perform statistical analysis on the CTD data of the water area to obtain index data to reflect the characteristics and laws of the CTD data; the index data includes mean value, variance and standard deviation.

本发明实施例还公开了一种电子设备,包括处理器、通信接口、存储器和通信总线,其中,所述处理器、所述通信接口以及所述存储器通过所述通信总线完成相互间的通信;The embodiment of the present invention further discloses an electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

所述存储器,用于存放计算机程序;The memory is used to store computer programs;

所述处理器,用于执行存储器上所存放的程序时,实现如本发明实施例所述的CTD数据质量监测方法。The processor is used to implement the CTD data quality monitoring method as described in the embodiment of the present invention when executing the program stored in the memory.

本发明实施例还公开了一个或多个计算机可读介质,其上存储有指令,当由一个或多个处理器执行时,使得所述处理器执行如本发明实施例所述的CTD数据质量监测方法。The embodiment of the present invention further discloses one or more computer-readable media having instructions stored thereon, which, when executed by one or more processors, enable the processors to execute the CTD data quality monitoring method as described in the embodiment of the present invention.

本发明实施例包括以下优点:The embodiments of the present invention include the following advantages:

本发明实施例的CTD数据质量监测方法,通过采集水域的CTD数据,CTD数据包括温盐深数据、生物环境信息、其他辅助参数数据;分别分析生物环境信息、其他辅助参数数据与温盐深数据的关联度并根据关联度选取目标CTD数据构造目标CTD数据集合;采用孤立森林算法对目标CTD数据进行异常判断;根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息。本发明该方法对海量多元化的CTD数据进行异常检测和识别,这种数据处理方式可以高效准确地识别和处理CTD数据中的异常情况,提高CTD数据质量监测的可靠性、准确性和分析的效率,及时发现海洋环境的异常情况,并及时报警通知研究人员。The CTD data quality monitoring method of the embodiment of the present invention collects CTD data of the water area, and the CTD data includes temperature, salinity and depth data, biological environment information, and other auxiliary parameter data; respectively analyzes the correlation between the biological environment information, other auxiliary parameter data and temperature, salinity and depth data, and selects target CTD data according to the correlation to construct a target CTD data set; uses an isolated forest algorithm to perform abnormal judgment on the target CTD data; judges whether the current batch of CTD data is abnormal based on the outlier target CTD data and issues abnormal alarm information when it is determined that the current batch of CTD data is abnormal. The method of the present invention performs abnormality detection and identification on massive and diversified CTD data. This data processing method can efficiently and accurately identify and process abnormal situations in CTD data, improve the reliability, accuracy and analysis efficiency of CTD data quality monitoring, promptly discover abnormal situations in the marine environment, and promptly alarm and notify researchers.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例中提供的一种CTD数据质量监测方法的步骤流程图;FIG1 is a flow chart of the steps of a CTD data quality monitoring method provided in an embodiment of the present invention;

图2是本发明实施例中提供的目标CTD数据集合A的示意图;FIG2 is a schematic diagram of a target CTD data set A provided in an embodiment of the present invention;

图3是本发明实施例中提供的一种CTD数据质量监测系统的结构框图;3 is a structural block diagram of a CTD data quality monitoring system provided in an embodiment of the present invention;

图4是本发明实施例中提供的一种CTD数据质量监测装置的结构框图;FIG4 is a structural block diagram of a CTD data quality monitoring device provided in an embodiment of the present invention;

图5是本发明实施例中提供的一种电子设备的框图;FIG5 is a block diagram of an electronic device provided in an embodiment of the present invention;

图6是本发明实施例中提供的一种计算机可读介质的示意图。FIG. 6 is a schematic diagram of a computer-readable medium provided in an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

参照图1,示出了本发明实施例中提供的一种CTD数据质量监测方法的步骤流程图,具体可以包括如下步骤:1, a flow chart of a method for monitoring CTD data quality provided in an embodiment of the present invention is shown, which may specifically include the following steps:

步骤101,采集水域的CTD数据,所述CTD数据包括温盐深数据、生物环境信息、其他辅助参数数据;Step 101, collecting CTD data of the water area, wherein the CTD data includes temperature, salinity, depth data, biological environment information, and other auxiliary parameter data;

本发明的CTD数据质量监测方法对海量多元化的CTD数据进行质量监测,检测和识别异常数据点位,及时发现海洋环境异常情况。在本发明实施例中,可以采用CTD传感器采集水域的CTD数据,CTD传感器是一种能够实时测量水域信息的仪器,它可以在不同深度的海水中测量水域的CTD数据。其中,水域的CTD数据可以包括温盐深数据、生物环境信息、其他辅助参数数据,温盐深数据可以为基本数据温度、电导率、压力,生物环境信息可以包括溶解氧、叶绿素、氧化还原电位(ORP)、光合有效射(PAR)、酸碱度(pH)、浊度、透射率、电压范围等有效信息,其他辅助参数数据可以包括水泵性能参数、监测时间等。The CTD data quality monitoring method of the present invention performs quality monitoring on a large amount of diversified CTD data, detects and identifies abnormal data points, and promptly discovers abnormal conditions in the marine environment. In an embodiment of the present invention, a CTD sensor can be used to collect CTD data of a water area. The CTD sensor is an instrument capable of measuring water area information in real time, and it can measure CTD data of a water area in seawater at different depths. Among them, the CTD data of the water area may include temperature, salinity, depth data, biological environment information, and other auxiliary parameter data. The temperature, salinity, depth data may be basic data temperature, conductivity, and pressure. The biological environment information may include dissolved oxygen, chlorophyll, oxidation-reduction potential (ORP), photosynthetically active radiation (PAR), pH, turbidity, transmittance, voltage range, and other effective information. Other auxiliary parameter data may include pump performance parameters, monitoring time, and the like.

步骤102,分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合;Step 102, respectively analyzing the correlation between the biological environment information, the other auxiliary parameter data and the temperature, salinity and depth data, and selecting target CTD data according to the correlation to construct a target CTD data set;

在不同纬度、季节、洋流条件下,水域的温盐深数据、生物环境信息、其他辅助参数数据这些CTD参数特征的影响权重不同,为此可做一次降维操作,提升高影响力特征比例,增加异常检测计算的效率和有效性。Under different latitudes, seasons, and ocean current conditions, the influence weights of CTD parameter features such as water temperature, salinity, depth, biological environment information, and other auxiliary parameter data are different. Therefore, a dimensionality reduction operation can be performed to increase the proportion of high-impact features and increase the efficiency and effectiveness of anomaly detection calculations.

在本发明实施例中可以通过特征相关度分析方法对原始数据进行降维,具体地,可以分别分析生物环境信息与温盐深数据的关联度,以及其他辅助参数数据与温盐深数据的关联度,再根据生物环境信息与温盐深数据的关联度,以及其他辅助参数数据与温盐深数据的关联度选取目标CTD数据构造目标CTD数据集合,使得减少了需要判断的CTD参数特征的维度,提升计算效率,进一步提升对CTD数据异常值判断的准确性。In an embodiment of the present invention, the original data can be reduced in dimension by a feature correlation analysis method. Specifically, the correlation between the biological environment information and the temperature, salinity and depth data, as well as the correlation between other auxiliary parameter data and the temperature, salinity and depth data can be analyzed respectively. Then, according to the correlation between the biological environment information and the temperature, salinity and depth data, as well as the correlation between other auxiliary parameter data and the temperature, salinity and depth data, the target CTD data is selected to construct a target CTD data set, thereby reducing the dimension of the CTD parameter features that need to be judged, improving the calculation efficiency, and further improving the accuracy of judging the abnormal values of the CTD data.

在本发明的一种实施例中,所述温盐深数据包括电导率、温度、压力,分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合的步骤,包括:In one embodiment of the present invention, the temperature-salinity-depth data includes conductivity, temperature, and pressure. The steps of analyzing the correlation between the biological environment information, the other auxiliary parameter data, and the temperature-salinity-depth data, and selecting target CTD data according to the correlation to construct a target CTD data set include:

对所述CTD数据进行去量纲化和归一化,得到经过预处理的CTD数据;De-dimensionalizing and normalizing the CTD data to obtain pre-processed CTD data;

以电导率、温度、压力为目标参考项,计算经过预处理的生物环境信息和其他辅助参数数据中的每一个CTD参数特征分别与每一个目标参考项的灰色关联度系数,并按照灰色关联度系数从高到低的顺序排列所有CTD参数特征,得到三组CTD参数特征有序序列;Taking conductivity, temperature and pressure as target reference items, the grey correlation coefficient between each CTD parameter feature in the preprocessed biological environment information and other auxiliary parameter data and each target reference item is calculated, and all CTD parameter features are arranged in descending order according to the grey correlation coefficient, and three groups of CTD parameter feature ordered sequences are obtained;

从每一组CTD参数特征有序序列中选取排序靠前的预设数量的CTD参数特征;Selecting a preset number of CTD parameter features with top rankings from each group of CTD parameter feature ordered sequences;

采用电导率、温度、压力和所述排序靠前的预设数量的CTD参数特征作为目标CTD数据构成所述目标CTD数据集合。The target CTD data set is formed by using conductivity, temperature, pressure and the preset number of CTD parameter features ranked at the top as target CTD data.

在采集CTD数据后,可以先对CTD数据进行预处理,比如去量纲和归一化,使预处理后的CTD数据变成更加适合孤立森林算法处理的数据形式,从而促进孤立森林算法对CTD数据异常值判断的准确性。接着可以以电导率、温度、压力为目标参考项,计算经过预处理的生物环境信息和其他辅助参数数据中的每一个CTD参数特征分别与每一个目标参考项的灰色关联度系数,并按照灰色关联度系数从高到低的顺序排列所有CTD参数特征,得到三组CTD参数特征有序序列,再从每一组CTD参数特征有序序列中选取排序靠前的预设数量的CTD参数特征,采用电导率、温度、压力和排序靠前的预设数量的CTD参数特征作为目标CTD数据构成目标CTD数据集合。参照图2,示出了本发明实施例中提供的目标CTD数据集合A的示意图。After collecting CTD data, the CTD data can be preprocessed, such as de-dimensionalization and normalization, so that the preprocessed CTD data becomes a data form more suitable for processing by the isolation forest algorithm, thereby promoting the accuracy of the isolation forest algorithm in judging the abnormal value of the CTD data. Then, conductivity, temperature, and pressure can be used as target reference items, and the gray correlation coefficient of each CTD parameter feature in the preprocessed biological environment information and other auxiliary parameter data and each target reference item is calculated, and all CTD parameter features are arranged in order from high to low according to the gray correlation coefficient, and three groups of CTD parameter feature ordered sequences are obtained, and then a preset number of CTD parameter features with a high ranking are selected from each group of CTD parameter feature ordered sequences, and conductivity, temperature, pressure, and a preset number of CTD parameter features with a high ranking are used as target CTD data to form a target CTD data set. Referring to Figure 2, a schematic diagram of a target CTD data set A provided in an embodiment of the present invention is shown.

比如,分别以电导率、温度、压力(重要目标参数)为目标参考项,然后通过计算剩余特征与每个参考项之间的灰色关联度系数,排序后产生三组特征作为元素的有序列表,再对三组有序列表各取前6个关联系数最高的特征,对共提取的18个特征去重后,获取去重后特征对应的原始数据数组A。再对原始数据集进行整理,将其中的零值转换为0,将温盐深数据特征转换成标准浮点数格式,并依次提取某一时间范围内的温盐深特征向量至数组A的同一元素内,该数组A视为本次数据质控监测的总样本集,总元素个数为n(n>=5000)。本发明对有序列表选取的特征数量不做限制,通过对特征的影响力排序后,也可以选取前40%的数据作为后续处理的数据序列,也可以选取前3个关联度系数最高的特征,等等。For example, conductivity, temperature, and pressure (important target parameters) are taken as target reference items, and then the gray correlation coefficient between the remaining features and each reference item is calculated, and three groups of features are generated as ordered lists of elements after sorting, and then the first 6 features with the highest correlation coefficients are taken from each of the three ordered lists. After deduplication of the 18 extracted features, the original data array A corresponding to the deduplication features is obtained. Then the original data set is sorted, the zero values therein are converted to 0, the temperature, salt and depth data features are converted into a standard floating point format, and the temperature, salt and depth feature vectors within a certain time range are sequentially extracted into the same element of array A. The array A is regarded as the total sample set for this data quality control monitoring, and the total number of elements is n (n>=5000). The present invention does not limit the number of features selected from the ordered list. After sorting the influence of the features, the first 40% of the data can also be selected as the data sequence for subsequent processing, or the first 3 features with the highest correlation coefficients can be selected, and so on.

其中,对灰色关联度系数的计算:是计算每一个CTD参数特征的特定列(col)的灰度值与目标参考项灰度值(ref_gray)之间的平均绝对差,具体代码为:Among them, the calculation of the gray correlation coefficient is to calculate the average absolute difference between the gray value of a specific column (col) of each CTD parameter feature and the gray value of the target reference item (ref_gray). The specific code is:

feature_gray=np.abs(normalized_data[col]-ref_gray).mean()。feature_gray = np.abs(normalized_data[col]-ref_gray).mean().

步骤103,采用孤立森林算法对所述目标CTD数据进行异常判断;Step 103, using an isolation forest algorithm to perform abnormality judgment on the target CTD data;

在获得目标CTD数据集合后,可以采用孤立森林算法对目标CTD数据集合中的每一个目标CTD数据进行异常判断,以便确定离群的目标CTD数据,该算法具有自适应性强的特点,其对监测数据的维度与线性特征无要求,对非线性强、维数高的CTD数据效果佳。After obtaining the target CTD data set, the isolation forest algorithm can be used to perform abnormality judgment on each target CTD data in the target CTD data set in order to determine the outlier target CTD data. The algorithm has the characteristics of strong adaptability and has no requirements on the dimension and linear characteristics of the monitoring data. It works best for CTD data with strong nonlinearity and high dimension.

在本发明的一种实施例中,采用孤立森林算法对所述目标CTD数据进行异常判断的步骤,包括:In one embodiment of the present invention, the step of using the isolation forest algorithm to perform abnormality judgment on the target CTD data includes:

采用孤立森林算法计算所述目标CTD数据集合中每一个目标CTD数据的异常度分数;Using an isolation forest algorithm to calculate the abnormality score of each target CTD data in the target CTD data set;

根据每一个目标CTD数据的异常度分数,以及预设异常度分数阈值或平均异常度分数阈值确定离群的目标CTD数据。The outlier target CTD data is determined according to the abnormality score of each target CTD data and a preset abnormality score threshold or an average abnormality score threshold.

采用孤立森林算法对目标CTD数据进行异常判断确定目标CTD数据是否离群,具体是先计算目标CTD数据集合中每一个目标CTD数据的异常度分数,再根据每一个目标CTD数据的异常度分数,以及预设异常度分数阈值或平均异常度分数阈值确定离哪些目标CTD数据是离群的。其中,预设异常度分数阈值可以根据不同的应用场景选择并设定,比如针对海洋环境保护、海洋气象预警、海洋资源开发等不同应用场景选择并设定。The isolation forest algorithm is used to judge the abnormality of the target CTD data to determine whether the target CTD data is outliers. Specifically, the abnormality score of each target CTD data in the target CTD data set is calculated first, and then the target CTD data is determined to be outliers according to the abnormality score of each target CTD data and the preset abnormality score threshold or the average abnormality score threshold. Among them, the preset abnormality score threshold can be selected and set according to different application scenarios, such as marine environmental protection, marine meteorological warning, marine resource development and other different application scenarios.

在本发明的一种实施例中,采用孤立森林算法计算所述目标CTD数据集合中每一个目标CTD数据的异常度分数的步骤,包括:In one embodiment of the present invention, the step of using the isolation forest algorithm to calculate the abnormality score of each target CTD data in the target CTD data set includes:

S1,假设所述目标CTD数据集合中的元素总个数为n(n≥5000),定义孤立树的数量t(1≤t≤n)、平均深度h、最大深度H,根节点子样本集的容量k;S1, assuming that the total number of elements in the target CTD data set is n (n ≥ 5000), define the number of isolated trees t (1 ≤ t ≤ n), the average depth h, the maximum depth H, and the capacity k of the root node subsample set;

S2,从所述目标CTD数据集合中多次有放回采样,以纯随机方式抽取指定数量的元素构建m个子样本集(a1、a2……ai……am);S2, sampling with replacement multiple times from the target CTD data set, extracting a specified number of elements in a purely random manner to construct m sub-sample sets (a1, a2...ai...am);

S3,选择一个子样本集ai,作为一棵树的根节点,并随机选择一个CTD参数特征P;S3, select a subsample set ai as the root node of a tree and randomly select a CTD parameter feature P;

S4,对于CTD参数特征的单个值q,对树进行二叉分裂,遍历子样本集ai所有元素,当CTD参数特征P的任意记录值r小于等于单个值q,则将此元素放在树的左子节点,否则放在右子节点;S4, for a single value q of the CTD parameter feature, perform a binary split on the tree, traverse all elements of the subsample set ai, and when any record value r of the CTD parameter feature P is less than or equal to the single value q, place this element in the left child node of the tree, otherwise place it in the right child node;

S5,采用S4步骤,递归构造左子节点和右子节点,构建二叉树,停止条件为数组ai中任意元素都被孤立或树的高度已经等于预设高度h,分叉停止时将本二叉树视为构成孤立树;S5, using step S4, recursively constructing the left child node and the right child node to build a binary tree. The stopping condition is that any element in the array ai is isolated or the height of the tree is equal to the preset height h. When the bifurcation stops, the binary tree is regarded as an isolated tree.

S6:循环执行S3-S5,直至所有子样本集数都完成执行,此时共包含m个孤立树,形成容量为m的孤立森林;S6: Execute S3-S5 in a loop until all sub-sample sets are executed. At this time, there are m isolated trees in total, forming an isolation forest with a capacity of m;

S7:统计任意元素在孤立森林中的平均路径长度,计算异常度分数;S7: Count the average path length of any element in the isolation forest and calculate the outlier score;

树的平均路径长度计算公式为:The average path length of a tree is calculated as:

其中,为给定样本数Ψ时路径长度的平均值;/>此处x代表/>为欧拉常数,其值为0.5772156649;in, is the average value of the path length when the number of samples Ψ is given;/> Here x represents/> is the Euler constant, its value is 0.5772156649;

异常度分数计算: Abnormality score calculation:

其中,h(x)是样本点x在孤立树中检索到的节点的深度;E(h(x0)为所有孤立树的h(x)的期望值。Where h(x) is the depth of the node retrieved by the sample point x in the isolated tree; E(h(x0) is the expected value of h(x) of all isolated trees.

在本发明实施例中,采用孤立森林算法计算目标CTD数据的异常度分数,具体可以先S1:定义孤立树的数量t(1≤t≤n)、平均深度h、最大深度H,根节点子样本集的容量k;S2,再从目标CTD数据集合中多次有放回采样,以纯随机方式抽取指定数量的元素构建m个子样本集(a1、a2……ai……am);S3,接着选择一个子样本集ai作为一棵树的根节点,并随机选择一个CTD参数特征P;S4,对于CTD参数特征的单个值q,对树进行二叉分裂,遍历子样本集ai所有元素,当CTD参数特征P的任意记录值r小于等于单个值q,则将此元素放在树的左子节点,否则放在右子节点;S5,采用S4步骤,递归构造左子节点和右子节点,构建二叉树,停止条件为数组ai中任意元素都被孤立或树的高度已经等于预设高度h,分叉停止时将本二叉树视为构成孤立树;S6:循环执行S3-S5,直至所有子样本集数都完成执行,此时共包含m个孤立树,形成容量为m的孤立森林;S7:最后统计任意元素在孤立森林中的平均路径长度,计算异常度分数。In an embodiment of the present invention, an isolation forest algorithm is used to calculate the abnormality score of the target CTD data. Specifically, S1: define the number of isolated trees t (1≤t≤n), the average depth h, the maximum depth H, and the capacity k of the root node sub-sample set; S2, then repeatedly sample with replacement from the target CTD data set, and extract a specified number of elements in a purely random manner to construct m sub-sample sets (a1, a2...ai...am); S3, then select a sub-sample set ai as the root node of a tree, and randomly select a CTD parameter feature P; S4, for a single value q of the CTD parameter feature, perform a binary split on the tree, and traverse the sub-sample set a For all elements in i, when any record value r of the CTD parameter feature P is less than or equal to a single value q, place this element in the left child node of the tree, otherwise place it in the right child node; S5, use S4 step to recursively construct the left child node and the right child node to build a binary tree. The stopping condition is that any element in the array ai is isolated or the height of the tree is equal to the preset height h. When the bifurcation stops, the binary tree is regarded as an isolated tree; S6: Loop through S3-S5 until all sub-sample sets are executed. At this time, there are m isolated trees in total, forming an isolated forest with a capacity of m; S7: Finally, count the average path length of any element in the isolated forest and calculate the anomaly score.

树的平均路径长度计算公式为:The average path length of a tree is calculated as:

其中,为给定样本数Ψ时路径长度的平均值;/>此处x代表为欧拉常数,其值为0.5772156649;in, is the average value of the path length when the number of samples Ψ is given;/> Here x represents is the Euler constant, its value is 0.5772156649;

异常度分数计算: Abnormality score calculation:

其中,h(x)是样本点x在孤立树中检索到的节点的深度;E(h(x))为所有孤立树的h(x)的期望值。Where h(x) is the depth of the node retrieved by the sample point x in the isolated tree; E(h(x)) is the expected value of h(x) of all isolated trees.

的取值范围是(0,1],取值越接近1,被认为是离群点的概率越大。当E(h(x))→0,s→1;当/>s→0;当/>s→0.5。也就是说异常度分数越接近1表示目标CTD数据是离群点的可能性越高,如果大部分目标CTD数据的异常度分数都接近0.5,说明本批CTD数据都没有明显的离群点。 The value range is (0,1]. The closer the value is to 1, the greater the probability of being considered an outlier. When E(h(x))→0,s→1; when/> s→0; when/> s→0.5. That is to say, the closer the abnormality score is to 1, the higher the possibility that the target CTD data is an outlier. If the abnormality scores of most target CTD data are close to 0.5, it means that there are no obvious outliers in this batch of CTD data.

步骤104,根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息。Step 104 , judging whether the current batch of CTD data is abnormal based on the outlier target CTD data and issuing abnormal alarm information when it is determined that the current batch of CTD data is abnormal.

在确定了哪些目标CTD数据离群后,可以进一步确定本批CTD数据是否存在异常情况,如果本批数据出现大量连续离群点,则说明本批CTD数据存在异常情况,所以可以统计所有离群的目标CTD数据的数量,根据所有离群的目标CTD数据的数量来判断本批CTD数据是否异常,如果所有离群的目标CTD数据的数量超过一定数量,则可以确定本批CTD数据存在异常,可以发出异常告警信息通知研究人员及时进行后续任务处理,反之则不需要。After determining which target CTD data are outliers, we can further determine whether there are any abnormalities in this batch of CTD data. If there are a large number of continuous outliers in this batch of data, it means that there are abnormalities in this batch of CTD data. Therefore, we can count the number of all outlier target CTD data and judge whether this batch of CTD data is abnormal based on the number of all outlier target CTD data. If the number of all outlier target CTD data exceeds a certain number, it can be determined that there are abnormalities in this batch of CTD data, and an abnormal alarm message can be issued to notify researchers to perform subsequent task processing in a timely manner. Otherwise, it is not necessary.

在本发明的一种实施例中,根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息的步骤,包括:In one embodiment of the present invention, the step of determining whether the current batch of CTD data is abnormal based on the outlier target CTD data and issuing abnormal alarm information when it is determined that the current batch of CTD data is abnormal includes:

判断所有离群的目标CTD数据的数量是否超过预设的离群点数量阈值;Determine whether the number of all outlier target CTD data exceeds a preset outlier number threshold;

若所有离群的CTD数据的数量超过预设的离群点数量阈值,则确定本批CTD数据异常,发出异常告警信息。If the number of all outlier CTD data exceeds the preset outlier number threshold, the current batch of CTD data is determined to be abnormal and an abnormal alarm message is issued.

在本发明实施例中,可以预设离群点数量阈值,通过判断所有离群的目标CTD数据的数量是否超过预设的离群点数量阈值来确定本批CTD数据是否异常,如果所有离群的CTD数据的数量超过预设的离群点数量阈值,则可以说明本批CTD数据异常,所以在确定所有离群的CTD数据的数量超过预设的离群点数量阈值,可以发出异常告警信息。其中,离群点数量阈值可以根据不同的应用场景选择并设定,比如针对海洋环境保护、海洋气象预警、海洋资源开发等不同应用场景选择并设定。In an embodiment of the present invention, a threshold value for the number of outliers can be preset, and whether the current batch of CTD data is abnormal can be determined by judging whether the number of all outlier target CTD data exceeds the preset threshold value for the number of outliers. If the number of all outlier CTD data exceeds the preset threshold value for the number of outliers, it can be shown that the current batch of CTD data is abnormal, so when it is determined that the number of all outlier CTD data exceeds the preset threshold value for the number of outliers, an abnormal alarm message can be issued. The threshold value for the number of outliers can be selected and set according to different application scenarios, such as for different application scenarios such as marine environmental protection, marine meteorological warning, and marine resource development.

在本发明的一种实施例中,所述方法还包括:In one embodiment of the present invention, the method further comprises:

采用所述水域的CTD数据生成可视化图表,以直观展示CTD数据的趋势和变化;所述可视化图表包括趋势图、柱状图和、饼状图;Generate a visualization chart using the CTD data of the water area to intuitively display the trend and change of the CTD data; the visualization chart includes a trend chart, a bar chart and a pie chart;

对所述水域的CTD数据进行统计分析,获得指标数据,以体现CTD数据的特征和规律;所述指标数据包括平均值、方差、标准差。Statistical analysis is performed on the CTD data of the waters to obtain index data to reflect the characteristics and laws of the CTD data; the index data includes mean value, variance, and standard deviation.

在本发明实施例中,在采集水域CTD数据后,还可以采用水域的这些CTD数据生成可视化图表,以直观展示CTD数据的趋势和变化,可视化图表可以包括趋势图、柱状图和、饼状图,本发明对此不做限制。此外,还可以对水域的这些CTD数据进行统计分析,获得指标数据,以体现CTD数据的特征和规律,指标数据可以包括平均值、方差、标准差,本发明对此不做限制。In an embodiment of the present invention, after collecting the CTD data of the water area, the CTD data of the water area can also be used to generate a visualization chart to intuitively display the trend and change of the CTD data. The visualization chart can include a trend chart, a bar chart, and a pie chart, which is not limited by the present invention. In addition, the CTD data of the water area can also be statistically analyzed to obtain index data to reflect the characteristics and laws of the CTD data. The index data can include the mean value, variance, and standard deviation, which is not limited by the present invention.

本发明CTD数据质量监测方法的技术效果如下:The technical effects of the CTD data quality monitoring method of the present invention are as follows:

1、高效准确:1. Efficient and accurate:

时间复杂度低,且经过数据阈值判断和数据降维过程,有效降低服务器计算负载。The time complexity is low, and after the data threshold judgment and data dimension reduction process, the server computing load is effectively reduced.

2、自适应性强:2. Strong adaptability:

无CTD外的必要数据特征,对传感器型号无特定要求。算法具有自适应性强的特点,其对监测数据的维度与线性特征无要求,对非线性强、维数高的CTD数据效果佳。There are no necessary data features other than CTD, and no specific requirements for sensor models. The algorithm has the characteristics of strong adaptability, and has no requirements for the dimension and linear characteristics of the monitoring data. It works best for CTD data with strong nonlinearity and high dimension.

3、精度高:3. High precision:

采用先进的海洋传感器和数据处理技术,能够实现多元海洋检测数据的高精度采集和处理,提高后续任务的可靠性和准确性。The use of advanced ocean sensors and data processing technology can achieve high-precision collection and processing of multi-dimensional ocean detection data, and improve the reliability and accuracy of subsequent tasks.

4、监测范围广:4. Wide monitoring range:

可广泛应用于各种海洋环境和领域,如海洋环境保护、海洋气象预警、海洋资源开发等,并且对更高数量的CTD传感器具有更强的适应性,拥有更大的监测范围。It can be widely used in various marine environments and fields, such as marine environmental protection, marine meteorological warning, marine resource development, etc., and has stronger adaptability to a higher number of CTD sensors and a larger monitoring range.

5、处理问题及时:5. Deal with problems in a timely manner:

定期监控监测并通过多种方式实时通知研究人员,面对环境异常情况或设备掉线情况处理更加及时。Regular monitoring and notification of researchers in real time through a variety of means allows for more timely handling of environmental anomalies or device disconnections.

需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明实施例所必须的。It should be noted that, for the sake of simplicity, the method embodiments are described as a series of action combinations, but those skilled in the art should be aware that the embodiments of the present invention are not limited by the order of the actions described, because according to the embodiments of the present invention, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.

参照图3,示出了本发明实施例中提供的一种CTD数据质量监测系统的结构框图,具体可以包括:3, a structural block diagram of a CTD data quality monitoring system provided in an embodiment of the present invention is shown, which may specifically include:

(1)CTD传感器(1)CTD sensor

CTD传感器是一种能够实时测量水域信息的仪器,它可以在不同深度的海水中采集温度、电导率、深度等基本数据(pressure,digiquartz[db];temperature,ITS-90,degC;conductivity,mS/cm.....)、生物环境数据(oxygen,SBE 43[mg/l];fluorescence,Seapoint;turbudity,Seapoint[FTU]......)、辅助参数(pumpstatus,time,Elapsed(Julian days)......)等。The CTD sensor is an instrument that can measure water area information in real time. It can collect basic data such as temperature, conductivity, depth (pressure, digiquartz [db]; temperature, ITS-90, degC; conductivity, mS/cm.....), biological environment data (oxygen, SBE 43 [mg/l]; fluorescence, Seapoint; turbudity, Seapoint [FTU]......), auxiliary parameters (pump status, time, Elapsed (Julian days)......), etc. in seawater at different depths.

(2)数据采集模块(2) Data acquisition module

数据采集模块负责将CTD传感器采集到的数据实时传输给数据储存模块,以便进行后续数据分析和异常检测。The data acquisition module is responsible for transmitting the data collected by the CTD sensor to the data storage module in real time for subsequent data analysis and anomaly detection.

(3)数据储存模块(3) Data storage module

数据存储模块负责存储CTD传感器采集到的原始数据和经过数据处理模块处理后的数据至非易失性介质,以便日后的数据分析和研究。The data storage module is responsible for storing the original data collected by the CTD sensor and the data processed by the data processing module to non-volatile media for future data analysis and research.

(4)数据处理模块(4) Data processing module

本发明的数据处理模块对数据储存模块中的数据进行特征简化、数据清洗、分析、监控和异常检测。首先,数据处理模块会对原始数据进行缺值、范围检测,判断传感器服务在线情况。当服务在线时,通过特征相关度分析方法对原始数据进行降维,并使用孤立森林算法筛查和标记原始CTD数据的异常点。The data processing module of the present invention performs feature simplification, data cleaning, analysis, monitoring and anomaly detection on the data in the data storage module. First, the data processing module performs missing value and range detection on the original data to determine the online status of the sensor service. When the service is online, the original data is reduced in dimension by the feature correlation analysis method, and the isolation forest algorithm is used to screen and mark the abnormal points of the original CTD data.

当监测到预设的异常情况时,如服务不在线,或出现大量异常值时,该数据处理模块会通过数据服务模块处理消息后,通知报警模块。When a preset abnormal situation is detected, such as the service is offline or a large number of abnormal values appear, the data processing module will process the message through the data service module and notify the alarm module.

(5)数据服务层模块(5)Data service layer module

本发明的数据服务模块主要包括三个方面的功能:The data service module of the present invention mainly includes three functions:

1)处理异常消息;1) Handle exception messages;

2)基本数据统计分析功能(平均值、方差、标准差等指标);2) Basic data statistical analysis functions (mean value, variance, standard deviation and other indicators);

3)可视化图表功能(趋势图、柱状图、饼状图等);3) Visual chart function (trend chart, bar chart, pie chart, etc.);

本发明的数据服务层模块可以判断数据处理模块传递来的通知类型,并根据具体情况调用相应的方法。当出现大量连续异常值时,该数据服务层模块会通过网络接口通知告警模块,以便后者能够及时推送告警信息至告警设备、微信公众号或独立应用程序。同时,该数据服务层模块还可以利用图表库对异常点位趋势、数量等信息生成可视化web页面,以便用户能够更加直观地了解数据的异常情况。这种数据服务层模块的设计,可以有效地提高数据管理和分析的效率,保证数据的完整性和准确性。The data service layer module of the present invention can determine the notification type transmitted by the data processing module and call the corresponding method according to the specific situation. When a large number of continuous abnormal values appear, the data service layer module will notify the alarm module through the network interface so that the latter can promptly push the alarm information to the alarm device, WeChat public account or independent application. At the same time, the data service layer module can also use the chart library to generate a visual web page for abnormal point trends, quantities and other information, so that users can more intuitively understand the abnormal situation of the data. The design of this data service layer module can effectively improve the efficiency of data management and analysis, and ensure the integrity and accuracy of the data.

(6)告警模块(6) Alarm module

处理告警消息,通过微信接口、短信接口等多种方式,将异常情况的报警信息发送给研究人员采取相应措施。Process alarm messages and send alarm information of abnormal situations to researchers through WeChat interface, SMS interface and other methods to take corresponding measures.

参照图4,示出了本发明实施例中提供的一种CTD数据质量监测装置的结构框图,具体可以包括如下模块:4, a structural block diagram of a CTD data quality monitoring device provided in an embodiment of the present invention is shown, which may specifically include the following modules:

采集模块401,用于采集水域的CTD数据,所述CTD数据包括温盐深数据、生物环境信息、其他辅助参数数据;The acquisition module 401 is used to collect CTD data of the water area, wherein the CTD data includes temperature, salinity, depth data, biological environment information, and other auxiliary parameter data;

关联度分析选取模块402,用于分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合;A correlation analysis and selection module 402 is used to analyze the correlation between the biological environment information, the other auxiliary parameter data and the temperature, salinity and depth data respectively, and select target CTD data according to the correlation to construct a target CTD data set;

异常判断模块403,用于采用孤立森林算法对所述目标CTD数据进行异常判断;An abnormality judgment module 403 is used to use an isolation forest algorithm to perform abnormality judgment on the target CTD data;

异常告警模块404,用于根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息。The abnormality alarm module 404 is used to determine whether the current batch of CTD data is abnormal based on the outlier target CTD data and issue abnormality alarm information when it is determined that the current batch of CTD data is abnormal.

可选地,所述温盐深数据包括电导率、温度、压力,所述关联度分析选取模块包括:Optionally, the temperature-salinity-depth data include conductivity, temperature, and pressure, and the correlation analysis and selection module includes:

预处理子模块,用于对所述CTD数据进行去量纲化和归一化,得到经过预处理的CTD数据;A preprocessing submodule, used for de-dimensionalizing and normalizing the CTD data to obtain preprocessed CTD data;

关联度分析排列子模块,用于以电导率、温度、压力为目标参考项,计算经过预处理的生物环境信息和其他辅助参数数据中的每一个CTD参数特征分别与每一个目标参考项的灰色关联度系数,并按照灰色关联度系数从高到低的顺序排列所有CTD参数特征,得到三组CTD参数特征有序序列;The correlation analysis arrangement submodule is used to calculate the grey correlation coefficient between each CTD parameter feature in the preprocessed biological environment information and other auxiliary parameter data and each target reference item, and arrange all CTD parameter features in descending order of the grey correlation coefficient to obtain three groups of CTD parameter feature ordered sequences;

选取子模块,用于从每一组CTD参数特征有序序列中选取排序靠前的预设数量的CTD参数特征;A selection submodule is used to select a preset number of CTD parameter features that are ranked top from each group of CTD parameter feature ordered sequences;

目标CTD数据集合构造子模块,用于采用电导率、温度、压力和所述排序靠前的预设数量的CTD参数特征作为目标CTD数据构成所述目标CTD数据集合。The target CTD data set construction submodule is used to use conductivity, temperature, pressure and the CTD parameter features of the top ranking preset number as target CTD data to construct the target CTD data set.

可选地,所述异常判断模块包括:Optionally, the abnormality judgment module includes:

异常度分数计算子模块,用于采用孤立森林算法计算所述目标CTD数据集合中每一个目标CTD数据的异常度分数;An abnormality score calculation submodule, used to calculate the abnormality score of each target CTD data in the target CTD data set using an isolation forest algorithm;

离群目标CTD数据确定子模块,用于根据每一个目标CTD数据的异常度分数,以及预设异常度分数阈值或平均异常度分数阈值确定离群的目标CTD数据。The outlier target CTD data determination submodule is used to determine the outlier target CTD data according to the abnormality score of each target CTD data and a preset abnormality score threshold or an average abnormality score threshold.

可选地,所述异常度分数计算子模块包括:Optionally, the abnormality score calculation submodule includes:

定义单元,用于假设所述目标CTD数据集合中的元素总个数为n(n≥5000),定义孤立树的数量t(1≤t≤n)、平均深度h、最大深度H,根节点子样本集的容量k;A definition unit, used to assume that the total number of elements in the target CTD data set is n (n≥5000), define the number of isolated trees t (1≤t≤n), the average depth h, the maximum depth H, and the capacity k of the root node subsample set;

子样本集构造单元,用于从所述目标CTD数据集合中多次有放回采样,以纯随机方式抽取指定数量的元素构建m个子样本集(a1、a2……ai……am);A sub-sample set construction unit is used to extract a specified number of elements from the target CTD data set in a purely random manner by sampling with replacement multiple times to construct m sub-sample sets (a1, a2...ai...am);

根节点构建单元,用于选择一个子样本集ai,作为一棵树的根节点,并随机选择一个CTD参数特征P;The root node construction unit is used to select a subsample set ai as the root node of a tree and randomly select a CTD parameter feature P;

二叉树构建单元,用于对于CTD参数特征的单个值q,对树进行二叉分裂,遍历子样本集ai所有元素,当CTD参数特征P的任意记录值r小于等于单个值q,则将此元素放在树的左子节点,否则放在右子节点;A binary tree construction unit is used to perform binary splitting on the tree for a single value q of the CTD parameter feature, traverse all elements of the subsample set ai, and when any record value r of the CTD parameter feature P is less than or equal to the single value q, the element is placed in the left child node of the tree, otherwise it is placed in the right child node;

孤立树构建单元,用于采用二叉树构建单元递归构造左子节点和右子节点,构建二叉树,停止条件为数组ai中任意元素都被孤立或树的高度已经等于预设高度h,分叉停止时将本二叉树视为构成孤立树;An isolated tree construction unit is used to recursively construct a left child node and a right child node using a binary tree construction unit to construct a binary tree. The stopping condition is that any element in the array ai is isolated or the height of the tree is equal to a preset height h. When the bifurcation stops, the binary tree is regarded as constituting an isolated tree.

循环执行单元,用于循环执行根节点构建单元、二叉树构建单元、孤立树构建单元,直至所有子样本集数都完成执行,此时共包含m个孤立树,形成容量为m的孤立森林;A loop execution unit is used to loop execute the root node construction unit, the binary tree construction unit, and the isolated tree construction unit until all sub-sample sets are executed. At this time, a total of m isolated trees are included, forming an isolated forest with a capacity of m;

异常度分数计算单元,用于统计任意元素在孤立森林中的平均路径长度,计算异常度分数;Anomaly score calculation unit, used to count the average path length of any element in the isolation forest and calculate the anomaly score;

树的平均路径长度计算公式为:The average path length of a tree is calculated as:

其中,为给定样本数Ψ时路径长度的平均值;/>此处x代表/>为欧拉常数,其值为0.5772156649;in, is the average value of the path length when the number of samples Ψ is given;/> Here x represents/> is the Euler constant, its value is 0.5772156649;

异常度分数计算: Abnormality score calculation:

其中,h(x)是样本点x在孤立树中检索到的节点的深度;E(h(x))为所有孤立树的h(x)的期望值。Where h(x) is the depth of the node retrieved by the sample point x in the isolated tree; E(h(x)) is the expected value of h(x) of all isolated trees.

可选地,所述异常告警信息模块包括:Optionally, the abnormal alarm information module includes:

数量阈值判断子模块,用于判断所有离群的目标CTD数据的数量是否超过预设的离群点数量阈值;The quantity threshold judgment submodule is used to judge whether the quantity of all outlier target CTD data exceeds the preset outlier quantity threshold;

异常告警子模块,用于若所有离群的CTD数据的数量超过预设的离群点数量阈值,则确定本批CTD数据异常,发出异常告警信息。The abnormal alarm submodule is used to determine that the batch of CTD data is abnormal and issue an abnormal alarm message if the number of all outlier CTD data exceeds a preset outlier number threshold.

可选地,所述装置还包括:Optionally, the device further comprises:

可视化图表生成模块,用于采用所述水域的CTD数据生成可视化图表,以直观展示CTD数据的趋势和变化;所述可视化图表包括趋势图、柱状图和、饼状图;A visualization chart generation module, used to generate visualization charts using the CTD data of the water area to intuitively display the trends and changes of the CTD data; the visualization charts include trend charts, bar charts and pie charts;

统计分析模块,用于对所述水域的CTD数据进行统计分析,获得指标数据,以体现CTD数据的特征和规律;所述指标数据包括平均值、方差、标准差。The statistical analysis module is used to perform statistical analysis on the CTD data of the water area to obtain index data to reflect the characteristics and laws of the CTD data; the index data includes mean value, variance and standard deviation.

对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

另外,本发明实施例还提供一种电子设备,如图5所示,包括处理器501、通信接口502、存储器503和通信总线504,其中,处理器501,通信接口502,存储器503通过通信总线504完成相互间的通信,In addition, an embodiment of the present invention further provides an electronic device, as shown in FIG5 , including a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502, and the memory 503 communicate with each other through the communication bus 504.

存储器503,用于存放计算机程序;Memory 503, used for storing computer programs;

处理器501,用于执行存储器503上所存放的程序时,实现如上述实施例中所述的CTD数据质量监测方法。The processor 501 is used to implement the CTD data quality monitoring method described in the above embodiment when executing the program stored in the memory 503.

上述终端提到的通信总线可以是外设部件互连标准(Peripheral ComponentInterconnect,简称PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture,简称EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned in the above terminal can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

通信接口用于上述终端与其他设备之间的通信。The communication interface is used for communication between the above terminal and other devices.

存储器可以包括随机存取存储器(Random Access Memory,简称RAM),也可以包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include a random access memory (RAM) or a non-volatile memory, such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(Application SpecificIntegrated Circuit,简称ASIC)、现场可编程门阵列(Field-Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

如图6所示,在本发明提供的又一实施例中,还提供了一种计算机可读存储介质601,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述实施例中所述的CTD数据质量监测方法。As shown in FIG. 6 , in another embodiment provided by the present invention, a computer-readable storage medium 601 is also provided, in which instructions are stored. When the computer-readable storage medium is run on a computer, the computer executes the CTD data quality monitoring method described in the above embodiment.

在本发明提供的又一实施例中,还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例中所述的CTD数据质量监测方法。In another embodiment of the present invention, a computer program product including instructions is provided. When the computer program product is run on a computer, the computer executes the CTD data quality monitoring method described in the above embodiment.

在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof.

需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the statement "comprise a ..." do not exclude the existence of other identical elements in the process, method, article or device including the elements.

本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

以上所述仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。The above description is only a preferred embodiment of the present invention and is not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (10)

1.一种CTD数据质量监测方法,其特征在于,所述方法包括:1. A CTD data quality monitoring method, characterized in that the method comprises: 采集水域的CTD数据,所述CTD数据包括温盐深数据、生物环境信息、其他辅助参数数据;Collect CTD data of water areas, including temperature, salinity, depth, biological environment information, and other auxiliary parameter data; 分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合;Respectively analyzing the correlation between the biological environment information, the other auxiliary parameter data and the temperature, salinity and depth data, and selecting target CTD data according to the correlation to construct a target CTD data set; 采用孤立森林算法对所述目标CTD数据进行异常判断;An isolation forest algorithm is used to judge abnormality of the target CTD data; 根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息。It determines whether the current batch of CTD data is abnormal based on the outlier target CTD data and issues an abnormal alarm message when it is determined that the current batch of CTD data is abnormal. 2.根据权利要求1所述的方法,其特征在于,所述温盐深数据包括电导率、温度、压力,分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合的步骤,包括:2. The method according to claim 1 is characterized in that the temperature-salinity-depth data include conductivity, temperature, and pressure, and the steps of analyzing the correlation between the biological environment information, the other auxiliary parameter data and the temperature-salinity-depth data and selecting target CTD data according to the correlation to construct a target CTD data set include: 对所述CTD数据进行去量纲化和归一化,得到经过预处理的CTD数据;De-dimensionalizing and normalizing the CTD data to obtain pre-processed CTD data; 以电导率、温度、压力为目标参考项,计算经过预处理的生物环境信息和其他辅助参数数据中的每一个CTD参数特征分别与每一个目标参考项的灰色关联度系数,并按照灰色关联度系数从高到低的顺序排列所有CTD参数特征,得到三组CTD参数特征有序序列;Taking conductivity, temperature and pressure as target reference items, the grey correlation coefficient between each CTD parameter feature in the preprocessed biological environment information and other auxiliary parameter data and each target reference item is calculated, and all CTD parameter features are arranged in descending order according to the grey correlation coefficient, and three groups of CTD parameter feature ordered sequences are obtained; 从每一组CTD参数特征有序序列中选取排序靠前的预设数量的CTD参数特征;Selecting a preset number of CTD parameter features with top rankings from each group of CTD parameter feature ordered sequences; 采用电导率、温度、压力和所述排序靠前的预设数量的CTD参数特征作为目标CTD数据构成所述目标CTD数据集合。The target CTD data set is formed by using conductivity, temperature, pressure and the preset number of CTD parameter features ranked at the top as target CTD data. 3.根据权利要求1所述的方法,其特征在于,采用孤立森林算法对所述目标CTD数据进行异常判断的步骤,包括:3. The method according to claim 1, characterized in that the step of using the isolation forest algorithm to perform abnormality judgment on the target CTD data comprises: 采用孤立森林算法计算所述目标CTD数据集合中每一个目标CTD数据的异常度分数;Using an isolation forest algorithm to calculate the abnormality score of each target CTD data in the target CTD data set; 根据每一个目标CTD数据的异常度分数,以及预设异常度分数阈值或平均异常度分数阈值确定离群的目标CTD数据。The outlier target CTD data is determined according to the abnormality score of each target CTD data and a preset abnormality score threshold or an average abnormality score threshold. 4.根据权利要求3所述的方法,其特征在于,采用孤立森林算法计算所述目标CTD数据集合中每一个目标CTD数据的异常度分数的步骤,包括:4. The method according to claim 3, characterized in that the step of using the isolation forest algorithm to calculate the abnormality score of each target CTD data in the target CTD data set comprises: S1,假设所述目标CTD数据集合中的元素总个数为n(n≥5000),定义孤立树的数量t(1≤t≤n)、平均深度h、最大深度H,根节点子样本集的容量k;S1, assuming that the total number of elements in the target CTD data set is n (n ≥ 5000), define the number of isolated trees t (1 ≤ t ≤ n), the average depth h, the maximum depth H, and the capacity k of the root node subsample set; S2,从所述目标CTD数据集合中多次有放回采样,以纯随机方式抽取指定数量的元素构建m个子样本集(a1、a2……ai……am);S2, sampling with replacement multiple times from the target CTD data set, extracting a specified number of elements in a purely random manner to construct m sub-sample sets (a1, a2...ai...am); S3,选择一个子样本集ai,作为一棵树的根节点,并随机选择一个CTD参数特征P;S3, select a subsample set ai as the root node of a tree and randomly select a CTD parameter feature P; S4,对于CTD参数特征的单个值q,对树进行二叉分裂,遍历子样本集ai所有元素,当CTD参数特征P的任意记录值r小于等于单个值q,则将此元素放在树的左子节点,否则放在右子节点;S4, for a single value q of the CTD parameter feature, perform a binary split on the tree, traverse all elements of the subsample set ai, and when any record value r of the CTD parameter feature P is less than or equal to the single value q, place this element in the left child node of the tree, otherwise place it in the right child node; S5,采用S4步骤,递归构造左子节点和右子节点,构建二叉树,停止条件为数组ai中任意元素都被孤立或树的高度已经等于预设高度h,分叉停止时将本二叉树视为构成孤立树;S5, using step S4, recursively constructing the left child node and the right child node to build a binary tree. The stopping condition is that any element in the array ai is isolated or the height of the tree is equal to the preset height h. When the bifurcation stops, the binary tree is regarded as an isolated tree. S6:循环执行S3-S5,直至所有子样本集数都完成执行,此时共包含m个孤立树,形成容量为m的孤立森林;S6: Execute S3-S5 in a loop until all sub-sample sets are executed. At this time, there are m isolated trees in total, forming an isolation forest with a capacity of m; S7:统计任意元素在孤立森林中的平均路径长度,计算异常度分数;S7: Count the average path length of any element in the isolation forest and calculate the outlier score; 树的平均路径长度计算公式为:The average path length of a tree is calculated as: 其中,为给定样本数Ψ时路径长度的平均值;/>此处x代表 为欧拉常数,其值为0.5772156649;in, is the average path length when the number of samples Ψ is given; /> Here x represents is the Euler constant, its value is 0.5772156649; 异常度分数计算: Abnormality score calculation: 其中,h(x)是样本点x在孤立树中检索到的节点的深度;E(h(x))为所有孤立树的h(x)的期望值。Where h(x) is the depth of the node retrieved by the sample point x in the isolated tree; E(h(x)) is the expected value of h(x) of all isolated trees. 5.根据权利要求3所述的方法,其特征在于,根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息的步骤,包括:5. The method according to claim 3 is characterized in that the step of judging whether the current batch of CTD data is abnormal according to the outlier target CTD data and issuing abnormal alarm information when it is determined that the current batch of CTD data is abnormal comprises: 判断所有离群的目标CTD数据的数量是否超过预设的离群点数量阈值;Determine whether the number of all outlier target CTD data exceeds a preset outlier number threshold; 若所有离群的CTD数据的数量超过预设的离群点数量阈值,则确定本批CTD数据异常,发出异常告警信息。If the number of all outlier CTD data exceeds the preset outlier number threshold, the current batch of CTD data is determined to be abnormal and an abnormal alarm message is issued. 6.根据权利要求1所述的方法,所述方法还包括:6. The method according to claim 1, further comprising: 采用所述水域的CTD数据生成可视化图表,以直观展示CTD数据的趋势和变化;所述可视化图表包括趋势图、柱状图和、饼状图;Generate a visualization chart using the CTD data of the water area to intuitively display the trend and change of the CTD data; the visualization chart includes a trend chart, a bar chart and a pie chart; 对所述水域的CTD数据进行统计分析,获得指标数据,以体现CTD数据的特征和规律;所述指标数据包括平均值、方差、标准差。Statistical analysis is performed on the CTD data of the waters to obtain index data to reflect the characteristics and laws of the CTD data; the index data includes mean value, variance, and standard deviation. 7.一种CTD数据质量监测装置,其特征在于,所述装置包括:7. A CTD data quality monitoring device, characterized in that the device comprises: 采集模块,用于采集水域的CTD数据,所述CTD数据包括温盐深数据、生物环境信息、其他辅助参数数据;The acquisition module is used to collect CTD data of the water area, and the CTD data includes temperature, salinity and depth data, biological environment information, and other auxiliary parameter data; 关联度分析选取模块,用于分别分析所述生物环境信息、所述其他辅助参数数据与所述温盐深数据的关联度并根据所述关联度选取目标CTD数据构造目标CTD数据集合;A correlation analysis and selection module, used to analyze the correlation between the biological environment information, the other auxiliary parameter data and the temperature, salinity and depth data respectively, and select target CTD data according to the correlation to construct a target CTD data set; 异常判断模块,用于采用孤立森林算法对所述目标CTD数据进行异常判断;An abnormality judgment module, used for performing abnormality judgment on the target CTD data by using an isolation forest algorithm; 异常告警模块,用于根据离群的目标CTD数据判断本批CTD数据是否异常并在确定本批CTD数据异常时发出异常告警信息。The abnormal alarm module is used to determine whether the current batch of CTD data is abnormal based on the outlier target CTD data and issue an abnormal alarm message when it is determined that the current batch of CTD data is abnormal. 8.根据权利要求7所述的装置,其特征在于,所述温盐深数据包括电导率、温度、压力,所述关联度分析选取模块包括:8. The device according to claim 7, characterized in that the temperature-salinity-depth data include conductivity, temperature, and pressure, and the correlation analysis and selection module includes: 预处理子模块,用于对所述CTD数据进行去量纲化和归一化,得到经过预处理的CTD数据;A preprocessing submodule, used for de-dimensionalizing and normalizing the CTD data to obtain preprocessed CTD data; 关联度分析排列子模块,用于以电导率、温度、压力为目标参考项,计算经过预处理的生物环境信息和其他辅助参数数据中的每一个CTD参数特征分别与每一个目标参考项的灰色关联度系数,并按照灰色关联度系数从高到低的顺序排列所有CTD参数特征,得到三组CTD参数特征有序序列;The correlation analysis arrangement submodule is used to calculate the grey correlation coefficient between each CTD parameter feature in the preprocessed biological environment information and other auxiliary parameter data and each target reference item, and arrange all CTD parameter features in descending order of the grey correlation coefficient to obtain three groups of CTD parameter feature ordered sequences; 选取子模块,用于从每一组CTD参数特征有序序列中选取排序靠前的预设数量的CTD参数特征;A selection submodule is used to select a preset number of CTD parameter features that are ranked top from each group of CTD parameter feature ordered sequences; 目标CTD数据集合构造子模块,用于采用电导率、温度、压力和所述排序靠前的预设数量的CTD参数特征作为目标CTD数据构成所述目标CTD数据集合。The target CTD data set construction submodule is used to use conductivity, temperature, pressure and the CTD parameter features of the top ranking preset number as target CTD data to construct the target CTD data set. 9.一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,所述处理器、所述通信接口以及所述存储器通过所述通信总线完成相互间的通信;9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other through the communication bus; 所述存储器,用于存放计算机程序;The memory is used to store computer programs; 所述处理器,用于执行存储器上所存放的程序时,实现如权利要求1-6任一项所述的CTD数据质量监测方法。The processor is used to implement the CTD data quality monitoring method according to any one of claims 1 to 6 when executing the program stored in the memory. 10.一个或多个计算机可读介质,其上存储有指令,当由一个或多个处理器执行时,使得所述处理器执行如权利要求1-6任一项所述的CTD数据质量监测方法。10. One or more computer-readable media having instructions stored thereon, which, when executed by one or more processors, enable the processors to perform the CTD data quality monitoring method according to any one of claims 1 to 6.
CN202311720647.7A 2023-12-13 2023-12-13 CTD data quality monitoring method and device, electronic equipment and medium Pending CN118094417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311720647.7A CN118094417A (en) 2023-12-13 2023-12-13 CTD data quality monitoring method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311720647.7A CN118094417A (en) 2023-12-13 2023-12-13 CTD data quality monitoring method and device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN118094417A true CN118094417A (en) 2024-05-28

Family

ID=91157266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311720647.7A Pending CN118094417A (en) 2023-12-13 2023-12-13 CTD data quality monitoring method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN118094417A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118332168A (en) * 2024-06-13 2024-07-12 北京华城工程管理咨询有限公司 Intelligent supervision method and device based on artificial intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118332168A (en) * 2024-06-13 2024-07-12 北京华城工程管理咨询有限公司 Intelligent supervision method and device based on artificial intelligence

Similar Documents

Publication Publication Date Title
Song et al. Identifying performance anomalies in fluctuating cloud environments: A robust correlative-GNN-based explainable approach
CN106951925B (en) Data processing method, device, server and system
CN115034600B (en) A kind of early warning method and system for geological disaster monitoring
CN117116382B (en) Method and system for spatial and temporal prediction of water quality of lakes affected by water diversion projects
US10268836B2 (en) System and method for detecting sensitivity content in time-series data
CN117394337A (en) Power grid load early warning method and system thereof
CN118094417A (en) CTD data quality monitoring method and device, electronic equipment and medium
CN117849302A (en) Multi-parameter water quality on-line monitoring method
CN118297774A (en) Water ecology risk assessment method, computer equipment and readable storage medium
Niaki et al. The economic design of multivariate binomial EWMA VSSI control charts
Suren et al. eDNA is a useful environmental monitoring tool for assessing stream ecological health
Lee et al. Chaos in air pollutant concentration (APC) time series
CN117723726B (en) Method, system and equipment for rapidly detecting water quality change of industrial wastewater
CN114118306B (en) Method and device for analyzing SDS (sodium dodecyl sulfate) gel electrophoresis experimental data and SDS gel reagent
EP4027277A1 (en) Method, system and computer program product for drift detection in a data stream
Galatro et al. Exploratory Data Analysis
Koutrouvelis et al. Cumulant plots and goodness-of-fit tests for the inverse Gaussian distribution
CN114004138A (en) Building monitoring method and system based on big data artificial intelligence and storage medium
Tounkara et al. Mixture regression models for closed population capture–recapture data
Galatro et al. Data analytics for process engineers: prediction, control and optimization
CN118394832B (en) Water sample code generation method and device, electronic equipment and storage medium
Irmanda et al. Enhancing Weather Prediction Models through the Application of Random Forest Method and Chi-Square Feature Selection
CN110991940A (en) Ocean observation data product quality online inspection method and device and server
Dymora et al. A Comparative Analysis of Selected Predictive Algorithms in Control of Machine Processes
Daugaard et al. The dependence of forecasts on sampling frequency as a guide to optimizing monitoring in community ecology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination