Disclosure of Invention
The invention aims to provide a water quality detection method and a water quality detection system, which are used for solving the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme that the water quality detection method comprises the following steps:
The method comprises the steps of collecting historical water quality data, specifically covering water source samples in different time periods, measuring and recording various pollutant concentrations in the water source samples by using a water quality detection instrument, including ammonia nitrogen, total phosphorus, heavy metals and total bacteria, and simultaneously recording the flow rate, temperature, pH value, chemical Oxygen Demand (COD) and Biological Oxygen Demand (BOD) of the water source to form a complete historical water quality data set;
Preprocessing the collected historical water quality data, including data cleaning, missing value processing and abnormal value detection;
dividing the preprocessed historical water quality data into a training set and a testing set for training and verifying a model;
training a water quality change prediction model by using a machine learning algorithm, wherein the water quality change prediction model is used for predicting a water quality parameter change trend in a future period of time according to current water quality detection data;
Parameter tuning and model training are carried out on the selected machine learning algorithm, and algorithm parameters and model structures are optimized through iteration;
verifying the trained model by using a test set, and evaluating the prediction performance and generalization capability of the model;
Deploying a real-time sensor at a water quality monitoring point, periodically collecting water quality data, and keeping the collected data consistent with historical data in parameter types and measurement units;
The method comprises the steps of collecting water quality data in real time, sending the water quality data collected in real time to a data center through a data transmission system, preprocessing the real-time data in the data center, including data cleaning and format conversion, so as to meet the input requirement of a machine learning model, and inputting the preprocessed real-time data into a trained water quality change prediction model;
the water quality change prediction model predicts the water quality parameter change in a period of time in the future according to the historical data and the mode learned in the training process;
And adjusting parameters and operation strategies of the water treatment process according to the prediction result and the real-time water quality data.
Preferably, in preprocessing the historical water quality data, the data cleaning method comprises the following steps:
Identifying and removing duplicate records in the dataset;
Checking data consistency, correcting or deleting format errors or unreasonable records;
the non-numeric data is transcoded to a numeric form suitable for analysis.
Preferably, in the preprocessing of the historical water quality data, the missing value processing method includes:
Filling the missing values of the numerical variable by adopting a mean filling method, a median filling method or a mode filling method;
for the classification variables, filling the missing values with the most frequently occurring class;
and (3) applying an interpolation method, including linear interpolation and polynomial interpolation, and estimating and filling missing values in the time sequence data.
Preferably, in preprocessing the historical water quality data, the abnormal value detection method includes:
Calculating a Z-score value of each data point by using a Z-score method, and if the absolute value of the Z-score is larger than a preset threshold value, determining the Z-score value as an abnormal value, wherein the algorithm formula is as follows:
where x is the raw data, μ is the mean of the data, σ is the standard deviation of the data, and Z is the Z-score value obtained by calculation.
Preferably, a long-short-term memory network LSTM model is selected to construct a water quality prediction model, and the specific method comprises the steps of designing a multi-layer LSTM neural network structure, comprising an input layer, one or more LSTM hidden layers and an output layer, configuring the input layer to receive and process training data of a plurality of time steps, setting a proper number of LSTM units in the LSTM hidden layer to learn and memorize time dependence in the data, capturing water quality change trend, and configuring the output layer to generate a prediction result of future water quality parameters.
Preferably, the step of initializing the parameters of the water quality prediction model comprises the steps of S1, initializing the weight parameters and the bias parameters of an LSTM neural network, S2, setting a learning rate for controlling the step length of updating the network weights so as to ensure the stability and the convergence speed of training, S3, determining the batch size, namely the number of samples used in each training iteration, for balancing the training speed and the memory use, S4, setting training rounds, namely the training pass number of the whole data set, S5, selecting an adaptive moment estimation Adam as an optimizer for adjusting the network weights in the training process, and S6, defining a mean square error MSE as a loss function for quantifying the difference between model prediction and actual change.
Preferably, the specific method for training the water quality prediction model is as follows:
a. performing iterative training on the LSTM neural network by using the divided training data set, wherein in each iteration, training data is used as input, and corresponding future change values are used as target output;
b. calculating an error between the predicted value and the actual value by using a defined loss function through the output of the forward propagation calculation network;
c. applying a back propagation algorithm and an optimizer to update the weight parameters and bias parameters of the network according to the calculated errors to minimize the prediction errors;
d. During the training process, the performance of the model is evaluated by periodically using the verification data set, whether the fitting occurs is judged by monitoring the loss function value and the verification error, and the model structure is adjusted accordingly or the training is terminated in advance.
Preferably, a water quality detection system comprises:
The data acquisition layer comprises a real-time sensor module and is used for periodically acquiring water quality data at water quality monitoring points, wherein the water quality data comprise various pollutant concentrations, flow rates, temperatures, pH values, chemical Oxygen Demand (COD) and Biological Oxygen Demand (BOD) in water source samples in different time periods;
The data preprocessing layer comprises a data cleaning module, a missing value processing module and an abnormal value detection module and is used for preprocessing the collected water quality data to form a complete historical water quality data set, and preprocessing the real-time data, including data cleaning and format conversion, so as to meet the input requirement of a machine learning model;
The model training and verifying layer comprises a machine learning algorithm module, a parameter tuning module and a model training module, and is used for training a water quality change prediction model by using the machine learning algorithm, performing parameter tuning and model training on the selected machine learning algorithm, verifying the trained model by using a test set through iterative optimization algorithm parameters and model structures, evaluating the prediction performance and generalization capability of the model, and selecting model parameters with optimal performance as final water quality change prediction model parameters;
The prediction and application layer comprises a real-time data input module, a water quality change prediction module and a result analysis module, and is used for inputting the preprocessed real-time data into a trained water quality change prediction model, predicting the water quality parameter change in a future period of time, receiving a prediction result output by a machine learning model, analyzing the prediction result, identifying a potential high pollution load or a harmful event, and adjusting the parameters and the operation strategy of the water treatment process according to the prediction result and the real-time water quality data.
Compared with the prior art, the invention has the beneficial effects that:
According to the invention, the real-time sensor is deployed, so that water quality data can be periodically collected and sent to the data center for analysis, and the real-time monitoring of water quality change is realized. By combining with a machine learning prediction model, the water quality parameter change trend in a period of time in the future can be predicted, and timely early warning is provided for potential high pollution load or harmful events, so that sudden pollution events are effectively treated.
Traditional methods rely on manual sampling and laboratory analysis, and are complex in flow and time-consuming. According to the invention, the machine learning algorithm is utilized to carry out deep analysis on the water quality data, so that the accuracy of water quality detection is improved. In the sudden pollution event, the invention can provide early warning information rapidly, is helpful for starting an emergency response mechanism in time, controls pollution diffusion and protects ecological environment and human health.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to FIGS. 1-2, the present invention provides a technical solution, wherein the present invention provides a water quality detection method, the method includes:
Step 1, collecting historical water quality data, namely covering water source samples in different time periods, and ensuring the diversity and representativeness of the data. Using a water quality testing instrument, various contaminant concentrations in the water source sample are measured and recorded, including but not limited to ammonia nitrogen, total phosphorus, heavy metals, and total bacteria count. The water source flow rate, temperature, pH, chemical Oxygen Demand (COD), and Biological Oxygen Demand (BOD) are simultaneously recorded to form a complete historical water quality dataset.
And 2, preprocessing the collected historical water quality data, including data cleaning, missing value processing and abnormal value detection, so as to ensure the quality and consistency of the data.
And step 3, dividing the preprocessed historical water quality data into a training set and a testing set for training and verifying the model.
And 4, training a model, namely training a water quality change prediction model by using a machine learning algorithm. The model can predict the water quality parameter change trend in a period of time in the future according to the current water quality detection data.
And 5, parameter tuning and model training, namely performing parameter tuning and model training on the selected machine learning algorithm, and improving the prediction performance and generalization capability of the model through iterative optimization algorithm parameters and model structures.
And 6, verifying the trained model by using a test set, and evaluating the prediction performance and generalization capability of the model. And selecting the model parameter with optimal performance as the final water quality change prediction model parameter according to the verification result.
And 7, deploying a real-time sensor, namely deploying the real-time sensor at a water quality monitoring point and periodically collecting water quality data. Ensuring that the collected data and the historical data are consistent in terms of parameter type and unit of measurement.
And 8, data transmission and preprocessing, namely sending the water quality data acquired in real time to a data center through a data transmission system. In a data center, real-time data is preprocessed, including data cleansing and format conversion, to meet the input requirements of a machine learning model.
And 9, predicting the real-time data, namely inputting the preprocessed real-time data into a trained water quality change prediction model. The water quality change prediction model predicts the water quality parameter change in a future period according to the historical data and the mode learned in the training process.
And 10, analyzing and applying the result, namely receiving the prediction result output by the machine learning model, analyzing the prediction result, and identifying the potential high pollution load or the harmful event. And adjusting parameters and operation strategies of the water treatment process according to the prediction result and the real-time water quality data so as to cope with potential water quality problems.
The invention is further illustrated in the following in connection with examples 1 to 3:
example 1:
In the process of preprocessing the historical water quality data, the data cleaning implementation mode specifically comprises the following links:
And (3) identifying and removing repeated records in the data set, namely adopting a specific data processing software or a deduplication function in a programming language (such as Python, R and the like) to compare all records in the historical water quality data set one by one. Duplicate records that are identical are identified by comparing the fields of the records (e.g., sampling time, sampling location, contaminant concentration, etc.). The identified duplicate records are deleted from the dataset, ensuring that each record is unique, avoiding bias in subsequent analysis.
Checking the consistency of the data, correcting or deleting the records with wrong or unreasonable format, namely checking each record in the data set one by one, and ensuring the correct format and reasonable data. And carrying out format correction on records with wrong formats, such as incorrect date formats, disordered numerical formats and the like, so as to ensure the consistency and the readability of the data. Further verification is performed for records with unreasonable data, such as contaminant concentrations outside of normal ranges, negative flow rates, etc. If the data error is confirmed after verification, it is deleted from the dataset.
And (3) performing code conversion on the non-numerical data, and converting the non-numerical data into a numerical form suitable for analysis, wherein the non-numerical data in the data set is identified, such as descriptive information of water quality category, pollution degree and the like.
And according to the characteristics and analysis requirements of the non-numerical data, a reasonable coding scheme is designed. For example, the water quality class may be coded according to the quality level, for example, a "good" code is 1, a "good" code is 2, and a "bad" code is 3.
The non-numeric data is converted using a coding scheme to a numeric form for use in subsequent analysis and modeling.
In the process of preprocessing the historical water quality data, the implementation mode of the missing value processing specifically comprises the following steps:
filling up missing values of numerical variables:
First, numerical variables in the dataset are identified and checked for missing values in these variables.
For the missing values of the numerical variables, the filling can be performed by adopting a mean filling method, a median filling method or a mode filling method. Which method is specifically selected is determined according to the distribution characteristics of the variables and the analysis requirements. For example, if the variables are normally distributed, mean filling may be more appropriate, and if there is a significant bias in the variables, median or mode filling may be more appropriate. And filling the missing values of the digital variables by using the selected filling method, and ensuring the integrity of the data set.
Filling up missing values of the classification variables:
Classification variables in the dataset are identified and checked for missing values in these variables. For missing values of the classification variables, the most frequently occurring class is used for filling. This approach is based on the statistical mode principle, i.e. the true class that is considered to be most frequently occurring is most likely to represent the missing value. The most frequently occurring category is determined by counting the occurrence frequency of each category and is used to fill in the missing value of the classification variable.
Missing value estimation padding of time series data:
Time-series data in the data set are identified and checked for missing values in the data. For missing values in the time series data, an interpolation method can be applied to estimate and fill. The interpolation method comprises linear interpolation, polynomial interpolation and the like, and the specific selection of which method is determined according to the characteristics and analysis requirements of the time series data. The linear interpolation is a simple interpolation method which assumes that time-series data changes linearly between two observation points before and after a missing value. By calculating the slopes of the two observation points, the magnitude of the missing value is estimated accordingly. Polynomial interpolation is a more complex interpolation method that uses a polynomial function to fit time series data and estimates the magnitude of the missing values based on the fit. Polynomial interpolation can accommodate more complex time series data change patterns.
In the process of preprocessing the historical water quality data, the implementation mode of abnormal value detection specifically comprises the following links:
application of the Z-score method:
First, for each numerical variable in the dataset, the mean (μ) and standard deviation (σ) thereof are calculated. These two statistics will be used for subsequent Z-score calculations.
Next, for each data point x in the dataset, a calculation formula of z= (x- μ)/σ is applied, where Z represents the Z-score value obtained by the calculation.
This formula reflects the relative position between the data point x and the data set mean μ, measured in standard deviation σ.
Identification and processing of outliers:
After obtaining the Z-score value for each data point, a preset threshold value needs to be set. This threshold is typically determined based on the nature of the data and the analysis requirements, e.g., 2.5, 3 or 3.5, etc. may be chosen as the threshold.
Then, it is checked whether the absolute value of the Z-score value of each data point is greater than a preset threshold. If so, the data point is considered an outlier.
For the identified outliers, various processing approaches may be taken. For example, these outliers may be deleted directly to maintain data consistency, or if the outliers have a practical meaning (e.g., representing a particular event), they may be selected for retention and special handling.
In practical applications, the Z-score method may need to be optimized and tuned according to the specific circumstances and analysis requirements of the data. For example, the magnitude of the preset threshold may be adjusted to more accurately identify outliers. In addition, other abnormal value detection methods (such as a distribution-based method, a distance-based method and the like) can be combined to further improve the accuracy and reliability of abnormal value detection.
Example 2:
when constructing a water quality prediction model, selecting a long-short-term memory network (LSTM) model as a core algorithm, wherein the specific implementation mode comprises the following steps:
A multi-layered LSTM neural network is designed that includes an input layer, one or more LSTM hidden layers, and an output layer. The structural design aims to fully utilize the advantages of LSTM in the aspect of processing time series data and capture the change trend of water quality parameters along with time.
An input layer is configured to be capable of receiving and processing training data for a plurality of time steps. This means that the input layer is able to receive a series of time-ordered water quality parameter data, such as contaminant concentration, water temperature, etc. at different points in time.
By taking data of a plurality of time steps as input, the LSTM model can better understand the change rule of the water quality parameters along with time and provide rich information for subsequent prediction tasks.
Setting LSTM hidden layer, namely setting proper number of LSTM units in the LSTM hidden layer. These LSTM cells are the core part of the model, responsible for learning and memorizing the time dependencies in the data.
The LSTM unit controls the flow of information through its internal gating mechanisms (forget gate, input gate and output gate) to effectively capture time series characteristics of water quality parameters.
By configuring a plurality of LSTM hidden layers, the learning ability of the model and the capturing ability of complex water quality change trend can be further enhanced.
And configuring an output layer, wherein the output layer is configured to generate a prediction result of the future water quality parameter. Depending on the particular prediction task, the output layer may be one or more neurons, each corresponding to a different water quality parameter.
And the neurons of the output layer receive the output of the LSTM hidden layer, further calculate and process the output, and finally generate the predicted value of the future water quality parameter.
Model training and prediction, namely training a designed LSTM model by using historical water quality data. During the training process, the model aims at minimizing the prediction error by constantly learning and adjusting its internal parameters. After training is completed, the trained LSTM model can be used for predicting future water quality parameters. By inputting the data of the new time step into the model, the model will output the corresponding prediction result.
The parameters of the water quality prediction model are initialized to include:
Initializing the weight and bias parameters of the LSTM neural network are the core of the network learning, which are continually adjusted during training to minimize the prediction error. When initializing these parameters, it is common to use a randomly generated manner, such as generating initial values using a normal distribution or a uniform distribution. More complex initialization methods, such as He or Glorot based initialization strategies, may also be employed to ensure the rationality and validity of parameter initialization.
The learning rate is set, namely the learning rate is an important super parameter, and controls the step length of updating the network weight. Setting a proper learning rate is important to ensure the stability and convergence speed of training. Typically, the choice of learning rate needs to be determined experimentally, and a learning rate decay strategy can be used to dynamically adjust the learning rate during the training process.
Batch size is determined by the number of samples used in each training iteration. The choice of batch size requires balancing training speed and memory usage. A smaller lot size may increase the training speed but may increase memory usage, and a larger lot size may decrease memory usage but may decrease the training speed. The choice of batch size needs to be determined based on the specific data set and hardware conditions.
Training rounds are set, wherein the training rounds refer to the training times of the whole data set. Setting the proper training rounds is critical to ensure that the model is adequately learned and converged. The choice of training rounds needs to be determined experimentally, and an early stop strategy can be used to prevent overfitting.
An optimizer is selected which adjusts the network weights during the training process to minimize the loss function. An adaptive moment estimate (Adam) is chosen as the optimizer because it combines the advantages of the momentum method and RMSprop to be able to adaptively adjust the learning rate during training. Adam optimizers are particularly effective for handling large-scale data and complex network structures.
A loss function is defined for the gap between the quantized model predictions and the actual changes. A Mean Square Error (MSE) is defined as a loss function because it can measure the square error between the model's predicted value and the actual value. MSE is a commonly used regression loss function, suitable for continuous value prediction tasks.
Example 3:
A water quality detection system comprises a data acquisition layer, a data preprocessing layer, a model training and verifying layer and a prediction and application layer.
For the data acquisition layer:
A real-time sensor module is designed and implemented that has the interface capability with various water quality monitoring sensors (e.g., contaminant concentration sensor, flow rate sensor, temperature sensor, pH sensor, etc.), ensuring that water quality data can be periodically collected from water quality monitoring points.
The real-time sensor module is configured to cover water source samples in different time periods, so that collected data is ensured to contain key water quality parameters such as various pollutant concentrations, flow rates, temperatures, pH values, chemical Oxygen Demand (COD), biological Oxygen Demand (BOD) and the like.
The timing task of data acquisition is realized, the real-time sensor module is ensured to automatically acquire water quality data according to a preset time interval, and the data is stored in a local or remote database.
For the data preprocessing layer:
the data cleaning module is developed, and the module can preprocess the collected water quality data, including removing repeated data, correcting error data, processing inconsistent data and the like, so as to ensure the accuracy and consistency of the data.
A missing value handling module is implemented that can identify missing values in the dataset and fill in the missing values using appropriate fill policies (e.g., mean fill, median fill, or mode fill) to ensure the integrity of the dataset.
An outlier detection module is designed and implemented, and the outlier detection module can detect outliers in the dataset by using a Z-score method or other statistical methods and perform appropriate processing (such as deleting or replacing) on the outliers so as to ensure the normal distribution and accuracy of the data.
The data format conversion function is developed, and the water quality data after data preprocessing is converted into an input format required by the machine learning model, wherein the data format conversion function comprises steps of feature extraction, data normalization or standardization and the like so as to ensure that the data can meet the input requirement of the machine learning model.
The construction function of the historical water quality data set is realized, the pretreated water quality data is organized according to a time sequence to form a complete historical water quality data set, and data support is provided for subsequent model training and verification.
The model training and verification layer comprises:
and the machine learning algorithm module integrates a plurality of machine learning algorithm libraries, such as TensorFlow, pyTorch and the like, and provides rich algorithm selection for water quality change prediction. And selecting a proper machine learning algorithm, such as LSTM, according to the characteristics of the water quality data and the prediction requirements, and performing model training.
The parameter tuning module designs a parameter tuning strategy, including defining a parameter search space, setting a tuning target (such as minimizing a prediction error), and the like. And (3) performing parameter optimization on the selected machine learning algorithm by applying parameter optimization methods such as grid search, random search and the like. And (3) evaluating the model performance under different parameter combinations through iterative training, and selecting an optimal parameter combination as a basis of model training.
And the model training module is used for carrying out model training on the selected machine learning algorithm by using the preprocessed historical water quality data set as training data. During the training process, cross-validation strategies, such as K-fold cross-validation, are applied to evaluate the generalization ability of the model. Key indexes in the training process, such as loss function values, accuracy and the like, are recorded and used for subsequent analysis and optimization.
The prediction and application layer comprises:
And the real-time data input module is used for designing and realizing a real-time data interface and ensuring that the pretreated real-time water quality data from the data acquisition layer can be received. And a data caching mechanism is realized, and the access and processing efficiency of real-time data in the model prediction process is improved.
And the water quality change prediction module is used for inputting the real-time data into a trained water quality change prediction model, executing a model reasoning process and predicting the water quality parameter change in a future period of time. The real-time output function of the prediction result is realized, and the prediction result can be timely and accurately transmitted to a subsequent processing module or a user interface.
And the result analysis module is used for analyzing the prediction result output by the machine learning model and identifying the potential high pollution load or the harmful event. And designing and realizing an early warning mechanism, and automatically triggering an early warning signal to remind relevant personnel to take corresponding countermeasures when the predicted result exceeds a preset threshold value.
The visual tool or report generating function is provided, the predicted result is displayed to the user in the forms of charts, reports and the like, and the user is helped to intuitively understand the water quality change trend and the predicted result.
And the decision support and application module is used for providing decision support for the water treatment process according to the prediction result and the real-time water quality data and adjusting the process parameters and the operation strategy so as to optimize the water quality treatment effect. The system is integrated into a wider water quality management system, and cooperates with other modules (such as a data acquisition layer, a data preprocessing layer and the like) to realize closed-loop management of water quality monitoring, prediction, early warning and processing.
And an API (application program interface) or an SDK (software development kit) is provided, so that a third party system or an application can integrate and use the prediction function of the water quality detection system.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.