CN119621855A

CN119621855A - Industrial equipment time series data storage and preprocessing method

Info

Publication number: CN119621855A
Application number: CN202510156909.4A
Authority: CN
Inventors: 许晋瑞; 来健强; 王永宗; 商广勇
Original assignee: Inspur Industrial Internet Co Ltd
Current assignee: Inspur Industrial Internet Co Ltd
Priority date: 2025-02-13
Filing date: 2025-02-13
Publication date: 2025-03-14

Abstract

The invention provides a time sequence data storage and preprocessing method of industrial equipment, which belongs to the field of data storage and data mining, and comprises the steps of 1, configuring a sensor to collect original data of the industrial equipment in real time and send the original data to a message queue, 2, setting a data importing module, analyzing a subject of the message queue based on a data synchronizing tool to obtain first data and storing the first data in a table A, 3, designing a table B corresponding to the table A, preprocessing and standardizing the first data in the table A to obtain standard data, 4, extracting time characteristics of the standard data to obtain time domain characteristics, partitioning the standard data according to the time domain characteristics to obtain partitioned data lists, 5, obtaining dimension attributes corresponding to unit data of each partitioned data list, and storing the partitioned data lists in the table C according to a storage logic sequence. The industrial data acquisition and storage are realized, and the data processing efficiency and manageability are improved.

Description

Industrial equipment time sequence data storage and preprocessing method

Technical Field

The invention relates to the field of data storage and data mining, in particular to a time sequence data storage and preprocessing method for industrial equipment.

Background

At present, in the fields of intelligent manufacturing and industrial Internet, log data or measurement data generated by the operation of industrial equipment has time sequence attributes, such as vibration amplitude data sampled by a fan according to fixed frequency at a certain time, and frequency domain characteristic values such as peak values, mean values, variances, waveforms and the like are obtained through data mining, so that the method can be used for classification of vibration signals, fault diagnosis and fault prediction, thereby predicting the service life of the equipment and carrying out periodic maintenance on the equipment. The traditional storage mode adopts a file server or cloud object to store an original data file, or collects the original data file into a message queue, or adopts a time sequence database to store, and then data mining is carried out through Python or Spark, so that the mode needs to occupy a large amount of storage space and occupies a large amount of memory space when Python codes are used for reading, or the SQL capability of a time sequence database such as TDengine is used for carrying out data analysis on time sequence data through a preset function.

Therefore, the invention provides a time sequence data storage and preprocessing method for industrial equipment.

Disclosure of Invention

The invention provides a time sequence data storage and preprocessing method of industrial equipment, which is used for storing data into a table A of HBase by utilizing a data synchronization tool through collecting the industrial equipment data in real time and sending the industrial equipment data to a message queue. Next, standard data is obtained by preprocessing and normalization, and a corresponding table B is designed. And on the basis of the standardized data, extracting time characteristics and carrying out partition processing to generate a partition data list. And finally, storing the data according to the dimension attribute, and storing the partition data into a table C according to the storage logic sequence. The data processing and storage efficiency is optimized, and efficient management and analysis of the data are ensured.

In one aspect, the present invention provides a method for storing and preprocessing time-series data of industrial equipment, comprising:

Step 1, configuring a sensor to acquire original data of industrial equipment in real time, and sending the original data to a message queue;

step 2, setting a data import module, analyzing the subject of the message queue based on a data synchronization tool, acquiring first data and storing the first data in a table A of an HBase database;

step 3, designing a table B corresponding to the table A, and preprocessing and normalizing the first data of the table A to obtain standard data;

Step 4, extracting time features of the standard data to obtain time domain features, and partitioning the standard data according to the time domain features to obtain a partitioned data list;

And 5, acquiring dimension attributes corresponding to the unit data of each partition data list, and storing the partition data list into a table C according to a storage logic sequence.

In another aspect, configuring a sensor to collect raw data of an industrial device in real time includes:

Acquiring the working environment and monitoring requirements of industrial equipment, selecting the type of a sensor, and configuring a unique first number for the sensor;

determining the installation position of the sensor according to the original design drawing of the industrial equipment and the surrounding environment, and configuring a unique second number for the installation position;

and configuring and installing a sensor according to the corresponding relation between the first number and the second number, and initializing and starting the sensor based on a preset time sequence sampling frequency, wherein the sensor is used for acquiring the original data of the industrial equipment in real time.

On the other hand, sending the original data to the message queue includes:

Creating a message queue, serializing the original data into a byte stream, and inserting the byte stream into the message queue according to byte iteration;

Until the byte stream of the original data is completely inserted into the message queue, stopping iteration.

On the other hand, a data importing module is set, and the method for analyzing the theme of the message queue based on the data synchronizing tool comprises the following steps:

Constructing a data import module, configuring and installing a data synchronization tool, and generating a row key value pair group of original data based on the queue identification of the message queue analyzed by the data synchronization tool;

creating and registering consumers in the data synchronization tool, and creating a consumption record data table;

and the consumer consumes the theme of the message queue, and inserts the data into the original data table A to generate a row key value.

On the other hand, obtaining the original data and storing the original data in a table a of the HBase database, including:

constructing a table named table A for storing original time sequence data in the HBase database according to a standard preset time sequence field;

according to the result of executing consumption, obtaining first original data and analyzing;

if the first original data are measured values at the same time point, adopting a character string splicing mode to splice the first original data into a value, wherein special symbols are adopted to separate the single measured values;

If the first raw data are measured values at different time points, the data at different measuring times are different rows.

On the other hand, designing a table B corresponding to the table A, preprocessing and normalizing the first data of the table A to obtain standard data, wherein the method comprises the following steps:

acquiring first data of a table A, and converting all the data into second data in a preset format;

selecting a preset neighbor number K, and calculating the KNN distance between any two measured values in the second data as follows:

Wherein, the method comprises the steps of, Represents the distance between the ith measured value and the jth measured value in the second data, n represents the total n measured values in the second data,Representing the second dataIn (2), ln () represents a logarithmic function; Representing the second data Variance of all measured values in (a); the min and max respectively represent the minimum value and the maximum value;

Selecting any measured value as an intermediate value based on a preset neighbor number K, screening K measured values near the intermediate value to form a sample group, acquiring the average distance of the sample group, judging that the corresponding measured value is an abnormal value if the distance between any measured value in the sample group and the intermediate value is larger than the average distance, and otherwise, judging that the corresponding measured value is normal;

removing the abnormal value of the second data to obtain third data, and normalizing the third data to obtain standard data;

On the other hand, the time feature extraction is carried out on the standard data to obtain time domain features, the standard data is partitioned according to the time domain features to obtain a partitioned data list, and the method comprises the following steps:

standard data of a table B is obtained, and a preset time sequence sampling frequency of the standard data is obtained according to a time sequence field of the standard data;

Extracting time features of the standard data, converting timestamp information of the standard data into specific time features, and taking the specific time features as time domain features;

defining a time interval based on the time domain features, specifically:

Wherein The time interval is represented by a time interval,The starting point in time is indicated as such,Representing the time-origin mapping coefficient,The time-endpoint mapping coefficient is represented,The characteristics of the time domain are represented and,Represents the time interval mean of the standard data,Representing the maximum value of the time interval of standard data, T () represents an event handling function;

According to time intervals Carrying out partition cutting processing on the time part of the standard data, wherein each time partition corresponds to a time partition with the size ofWherein the time partition and its corresponding measured value constitute unit data of a partition data list.

On the other hand, acquiring the dimension attribute corresponding to the unit data of each partition data list, and storing the partition data list into the table C according to the storage logic sequence, including:

traversing the partition data list to obtain the time range of each unit data;

Acquiring dimension attributes corresponding to all fields according to fields of standard data corresponding to any unit data in any time interval and a field name-dimension attribute mapping table, wherein the dimension attributes are dimension attributes of the unit data;

A record is created for each unit data and its time interval, unit data values and dimension properties are stored in a table C in a stored logical order.

Compared with the prior art, the invention has the beneficial effects that:

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for storing and preprocessing time-series data of industrial equipment according to an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1:

As shown in fig. 1, the method for storing and preprocessing time-series data of industrial equipment provided by the embodiment of the invention includes:

In this embodiment, the sensor is a device for monitoring and collecting industrial equipment status or environmental data in real time, including types of temperature, pressure, humidity, vibration, and the like.

In this embodiment, industrial equipment refers to machinery, instruments, tools, and other equipment used in industrial processes for production, processing, inspection, or control.

In this embodiment, the raw data refers to raw information collected in real time by sensors, meters, etc. during the operation of the industrial equipment, either raw or analyzed.

In this embodiment, message queuing is a technique for communicating between different services by sending and receiving messages without requiring a direct synchronous connection.

In this embodiment, the data import module refers to a component that is used to obtain data from a message queue and store it in a database (e.g., HBase).

In this embodiment, the data synchronization tool is a software tool that is primarily used to synchronize data between different data sources or systems, such as kafka et al.

In this embodiment, the topic refers to the class of message or data flow in the message queue.

In this embodiment, the HBase database is an open-source, distributed, columnar storage database, which is part of the Apache Hadoop ecosystem, and is designed to handle large-scale, distributed data storage requirements.

In this embodiment, table a is an HBase table that stores raw data retrieved from a message queue.

In this embodiment, table B is an HBase table for storing standard data after preprocessing, normalization, and partitioning, and includes a column of dimension attributes and a column of time series data after preprocessing.

In this embodiment, by means of the coprocessor function of the HBase, the logic of data preprocessing is placed at the server, a large amount of data is not pulled to the client for processing, excessive memory is occupied, higher data preprocessing efficiency is obtained through the distributed storage and calculation capability of the HBase, synchronization of the data storage and preprocessing functions is also realized, and the subsequent data mining only needs to query the table after data preprocessing, and processing steps such as data cleaning, duplication removal and normalization are not needed.

In this embodiment, the data processing of table a and table B are performed synchronously, the coprocessor of HBase is mounted on the original data table a, and each time a new line of data is inserted into table a, the coprocessor is triggered to run, and the data after preprocessing is written into table B.

In this embodiment, the pre-processing normalization is used to convert the raw data into a standard form suitable for subsequent analysis.

In this embodiment, the standard data refers to data after preprocessing and standardization, and has a uniform format and structure.

In this embodiment, the time domain features refer to data features related to time, and features such as time nature, periodicity, and trending of data are extracted from time stamps or time fields and reflected.

In this embodiment, partitioning refers to dividing data into different blocks according to certain specific characteristics during data storage, querying, and processing.

In this embodiment, the partition data list refers to a series of data units obtained by partitioning standard data according to the extracted time domain features (such as year, month, day, hour, etc.) in step 4.

In this embodiment, the unit data refers to the smallest data unit in table a after being pre-processed and normalized, and partitioned by time characteristics.

In this embodiment, dimension attributes refer to various feature fields that can be used to describe or classify data in data processing and storage.

In this embodiment, the storage logic order refers to a manner of storing the data in the partition data list into the target table (table C) according to a certain rule.

In this embodiment, table C is a table for storing dimension attribute data corresponding to unit data of each partition data list.

The technical scheme has the working principle and beneficial effects that the industrial equipment data is collected and processed in real time, the data is stored in the HBase by utilizing the message queue and the data synchronization tool, and the data storage efficiency and the accuracy of subsequent analysis are improved through pretreatment, feature extraction and partition storage, so that efficient data processing and management are supported.

Example 2:

On the basis of the above embodiment 1, configuring the sensor to collect raw data of the industrial equipment in real time includes:

In this embodiment, the working environment refers to the physical and environmental conditions in which the device is in actual operation, including temperature, humidity, pressure, vibration, gas composition, electromagnetic interference, and many other factors.

In this embodiment, the monitoring requirements refer to the requirements for real-time monitoring and data acquisition of the operating state, environmental conditions and equipment performance of the industrial equipment.

In this embodiment, the first number is a unique identifier for identifying each sensor.

In this embodiment, the original design drawing refers to a detailed technical drawing drawn by an engineer during the design and construction stages of an industrial plant or system.

In this embodiment, the mounting location refers to a specific location or area where the sensor is actually placed in the device or work environment.

In this embodiment, the second number is a unique identifier for identifying each sensor location.

In this embodiment, the preset time sequence sampling frequency refers to the frequency of data acquisition of the industrial equipment by the sensor in a specified time interval.

The technical scheme has the advantages that the sensor type is selected and the unique number is configured by combining the equipment working environment and the monitoring requirement, the installation position is determined, the number is configured, accurate sensor installation and real-time data acquisition are realized, the equipment monitoring efficiency is optimized, and the accuracy and the reliability of data acquisition are ensured.

Example 3:

On the basis of the above embodiment 2, sending the original data to the message queue includes:

Serialization refers in this embodiment to the process of converting the state of a data structure into a format that can be stored or transmitted.

In this embodiment, byte stream refers to a way in which data is processed and transferred in units of bytes in a computer system, in binary representation.

In this embodiment, iterative insertion refers to the process of inserting sequentially into a message queue in bytes until all bytes are inserted.

The technical scheme has the advantages that the method and the device have the advantages that through serializing original data and iteratively inserting the original data into the message queue, the sequence of inserting the data is ensured by controlling the inserting process through the queue identification and the pointer value, the data collision and repetition are avoided, and the reliability and the efficiency of data transmission are improved.

Example 4:

On the basis of the above embodiment 3, setting a data import module, parsing the subject of the message queue based on a data synchronization tool, including:

In this embodiment, parsing is the process of converting a byte stream into the original data.

In this embodiment, one row key (Rowkey) in the row key value pair corresponds to a plurality of columns, each column corresponds to storing a value of a dimension attribute or a value after splicing the time series data, and the preprocessing only processes the columns of the time series data.

In this embodiment, consumer refers to a component that processes data or consumes data.

In this embodiment, the consumption record data table is a table storing time series data after consumption (i.e., data processing).

In this embodiment, traversing refers to analyzing each data record one by one as the original time series data is processed.

The technical scheme has the advantages that the data synchronization tool analyzes the message queue and generates the row key value pair group, a consumer traverses and judges whether the row key value is consumed, unique consumption of data is ensured, the consumption state is recorded, the accuracy and the efficiency of data processing are improved, and repeated consumption is avoided.

Example 5:

on the basis of the above embodiment 4, the raw data is acquired and stored in table a of the HBase database, including:

In this embodiment, the standard preset timing field refers to a basic field for defining and identifying time series data, such as a time stamp, a device identification, a data type, and the like.

In this embodiment, the HBase database is an open-source, distributed, columnar-store NoSQL database system for handling large-scale data sets, particularly suited for storing and managing non-relational data.

In this embodiment, raw time series data refers to data representing a certain physical virtual phenomenon acquired in time series.

In this embodiment, the first raw data refers to raw data that is initially acquired during the time series data acquisition process.

In this embodiment, the measured value refers to data representing a certain physical quantity or state, such as temperature, humidity, voltage, air pressure, speed, flow rate, etc., collected by a sensor, device or system.

In this embodiment, the string concatenation means that a plurality of measured values are connected together through specific symbols to form a complete string.

The technical scheme has the working principle and beneficial effects that the time sequence data are stored through the HBase table A, and the data are processed according to the measurement time, wherein measured values at the same time point are spliced into one value, and the data at different time points are stored in a plurality of rows, so that the time sequence data storage and query are optimized, and the flexibility and the efficiency of data processing are improved.

Example 6:

on the basis of the above embodiment 5, designing a table B corresponding to the table a, and performing pretreatment normalization on the first data of the table a to obtain standard data, where the method includes:

In this example, the second data is the data obtained after a certain processing and conversion, and the original data is from table a.

In this embodiment, the preset number of neighbors refers to the number of neighbors selected when calculating the KNN (K-nearest neighbor) distance.

In this embodiment, KNN distance is a core concept in the K-Nearest Neighbor (K-Nearest Neighbor) algorithm, measuring the distance between two data points.

In this embodiment, the intermediate value refers to a measurement value obtained by screening K nearest neighbor data points as a reference point when KNN calculation is performed.

In this embodiment, the sample set refers to K adjacent measured values screened from around the selected intermediate value based on a preset number of neighbors (K) according to KNN algorithm.

In this embodiment, outliers refer to measurements in the dataset that deviate significantly from the overall data trend.

In this embodiment, the third data is a data set from which an outlier is removed, and the result is obtained by performing normalization processing.

In this embodiment, the standard data is data after normalization processing.

The technical scheme has the advantages that the distance between measured values is calculated through the KNN algorithm, abnormal values are screened and removed, and standard data are generated based on standardized processing. The method effectively improves the accuracy and quality of the data, removes abnormal values, and ensures the reliability and stability of data analysis results.

Example 7:

On the basis of the above embodiment 1, performing time feature extraction on the standard data to obtain time domain features, and partitioning the standard data according to the time domain features to obtain a partitioned data list, where the partitioning data list includes:

defining a time interval based on the time domain features, specifically:

In this embodiment, the timing field refers to a data field related to time, and is used to indicate a point in time when the data recording occurs.

In this embodiment, the time stamp information refers to a specific time point at which each piece of data is recorded, and exists in the form of a time stamp.

In this embodiment, the time rule matching degree is a degree of matching between the extracted time feature and the preset time sequence sampling frequency.

In this embodiment, the specific time feature refers to a specific data attribute extracted from the time stamp information in the standard data, for example, year, month, day, minute, hour, etc., which can accurately describe the time dimension.

The technical scheme has the working principle and beneficial effects that the time interval is defined for data partition cutting by extracting the time characteristics of the standard data and comparing the time characteristics with the preset sampling frequency and selecting the optimal time domain characteristics. The method optimizes the time processing and partitioning of the data and improves the time sequence analysis and processing efficiency of the data.

Example 8:

On the basis of the above embodiment 1, acquiring the dimension attribute corresponding to the unit data of each partition data list, and storing the partition data list in the table C according to the storage logic order, including:

traversing the partition data list to obtain the time range of each unit data;

In this embodiment, the field name-dimension attribute mapping table is a mapping structure that associates data fields with their corresponding dimension attributes.

In this embodiment, the dimension attribute refers to descriptive information associated with the data field, such as, for example, time, region, product, sales, etc.

The technical scheme has the advantages that the partition data list is traversed, the time interval and the dimension attribute mapping table are combined, records are created for each unit data, and the records are stored in the table C according to storage logic. The method improves the structured storage efficiency of the data and is convenient for subsequent data query and analysis.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims

1. A method for storing and preprocessing time series data of industrial equipment, characterized by comprising:

Step 1: Configure sensors to collect raw data from industrial equipment in real time and send the raw data to the message queue;

Step 2: Set a data import module, parse the topic of the message queue based on the data synchronization tool, obtain the first data and store it in Table A of the HBase database;

Step 3: Design table B corresponding to table A, and pre-process and standardize the first data of table A to obtain standard data;

Step 4: Extract the time features of the standard data to obtain the time domain features, partition the standard data according to the time domain features, and obtain a partition data list;

Step 5: Get the dimension attributes corresponding to the cell data of each partition data list, and store the partition data list in table C according to the storage logic order.

2. According to claim 1, a method for storing and preprocessing time series data of industrial equipment is characterized in that the sensors are configured to collect the raw data of the industrial equipment in real time, including:

Obtain the working environment and monitoring requirements of the industrial equipment to select the type of sensor and configure a unique first number for the sensor;

Determine the installation location of the sensor based on the original design of the industrial equipment and the surrounding environment, and assign a unique second number to the installation location;

According to the corresponding relationship between the first number and the second number, the sensor is configured and installed, and the sensor is initialized and started based on a preset timing sampling frequency, wherein the sensor is used to collect the original data of the industrial equipment in real time.

3. The method for storing and preprocessing time series data of industrial equipment according to claim 2, wherein sending the original data to the message queue comprises:

Create a message queue, serialize the original data into a byte stream, and insert the byte stream iteratively according to the bytes;

The iteration stops until the byte stream of the original data is completely inserted into the message queue.

4. The method for storing and preprocessing time series data of industrial equipment according to claim 3 is characterized in that a data import module is set to parse the topic of the message queue based on a data synchronization tool, including:

Build a data import module, configure and install a data synchronization tool, parse the queue ID of the message queue based on the data synchronization tool, and generate row key-value pairs of the original data;

Create and register consumers in the data synchronization tool, and create a consumption record data table;

The consumer consumes the topic of the message queue and inserts the data into the original data table A to generate a row key value.

5. The method for storing and preprocessing time series data of industrial equipment according to claim 4, characterized in that the raw data is obtained and stored in table A of the HBase database, comprising:

According to the standard preset time series fields, a table named Table A is constructed in the HBase database to store the original time series data;

According to the result of executing the consumption, first original data is obtained and analyzed;

If the first original data are measurement values at the same time point, they are concatenated into one value by string concatenation, wherein the individual measurement values are separated by special symbols;

If the first original data are measurement values at different time points, the data at different measurement times are in different rows.

6. The method for storing and preprocessing time series data of industrial equipment according to claim 5, characterized in that a table B corresponding to table A is designed, and the first data of table A is preprocessed and standardized to obtain standard data, including:

Obtain first data from table A, and convert all data into second data in a preset format;

Select a preset number of neighbors K and calculate the KNN distance between any two measured values in the second data as:

;in, represents the distance between the i-th measurement value and the j-th measurement value in the second data, n represents that there are a total of n measurement values in the second data, Indicates the second data The mean of all measurements in , ln( ) represents the logarithmic function; Indicates the second data The variance of all measurements in ; Respectively represent the weights of the i-th measurement value and the j-th measurement value; min and max represent the minimum and maximum values, respectively;

Based on the preset number of neighbors K, any measurement value is selected as the middle value, K measurement values near the middle value are screened to form a sample group, and the average distance of the sample group is obtained. If the distance between any measurement value in the sample group and the middle value is greater than the average distance, the corresponding measurement value is determined to be an abnormal value, otherwise, the corresponding measurement value is determined to be normal;

The outliers in the second data are removed to obtain the third data, and the third data are standardized to obtain the standard data.

7. The method for storing and preprocessing time series data of industrial equipment according to claim 1 is characterized in that the standard data is subjected to time feature extraction to obtain time domain features, and the standard data is partitioned according to the time domain features to obtain a partition data list, including:

Obtain standard data from Table B, and obtain a preset timing sampling frequency of the standard data according to a timing field of the standard data;

Extracting time features from the standard data and converting timestamp information of the standard data into specific time features, using the specific time features as time domain features;

Define the time interval based on the time domain characteristics, specifically:

;in Indicates the time interval, Indicates the starting time point, represents the time starting point mapping coefficient, represents the time end mapping coefficient, Represents the time domain characteristics, represents the time interval mean of the standard data, represents the maximum time interval of standard data, T( ) represents the event processing function;

By time interval The time part of the standard data is partitioned and cut, and each time partition corresponds to a segment of size time interval, wherein the time partition and its corresponding measurement value constitute unit data of the partition data list.

8. The method for storing and preprocessing time series data of industrial equipment according to claim 1, characterized in that the dimension attribute corresponding to the unit data of each partition data list is obtained, and the partition data list is stored in table C according to the storage logic order, including:

Traverse the partition data list and obtain the time range of each unit data;

According to the fields of the standard data corresponding to any unit data in any time interval, the dimension attributes corresponding to all the fields are obtained according to the field name-dimension attribute mapping table, wherein the dimension attributes are the dimension attributes of the unit data;

Create a record for each unit data, and store its time interval, unit data value, and dimension attributes in table C in the storage logic order.