CN117540277B

CN117540277B - A lost circulation warning method based on WGAN-GP-TabNet algorithm

Info

Publication number: CN117540277B
Application number: CN202311587126.9A
Authority: CN
Inventors: 许成元; 周杰; 康毅力; 郭昆; 谢军; 郝克桃
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-06-21
Anticipated expiration: 2043-11-27
Also published as: CN117540277A

Abstract

The invention discloses a lost circulation early warning method based on WGAN-GP-TabNet algorithm, and belongs to the technical field of lost circulation prediction. Comprising the following steps: collecting field data; screening characteristic parameters with strong correlation with the leakage flow and deleting parameters with weak correlation in the field data so as to form initial parameters; classifying the initial parameters according to the leakage flow, and classifying each class into a majority class and a minority class according to the number of each class of parameters after classification; inputting the initial parameters into WGAN-GP model to generate minority class data; training and evaluating the lost circulation early warning model TabNet by using the initial parameters and the generated minority class data; site data is collected and the extent of leakage is predicted using a TabNet model that is trained. The invention effectively solves the problems of low recall rate and few kinds of prediction precision caused by unbalanced kinds in the deep learning leakage prevention and blocking, and has the advantages of stability, reliability, high accuracy, convenient operation, high reaction speed, strong mobility and the like.

Description

A lost circulation warning method based on WGAN-GP-TabNet algorithm

技术领域Technical Field

本发明涉及钻井过程中井漏预测技术领域，也涉及数据驱动的深度学习技术领域，具体为一种基于WGAN-GP-TabNet算法的井漏预警方法。The present invention relates to the technical field of well leakage prediction during drilling, and also to the technical field of data-driven deep learning, and specifically to a well leakage warning method based on a WGAN-GP-TabNet algorithm.

背景技术Background technique

数字信息化时代，使用计算机和自动化技术处理油气钻井中处理油气钻井中遇到的各种问题越来越成为一种趋势。面对井下渗漏问题，处理措施不当会导致封堵成功率低，钻井液持续渗漏，现场工作时间损失增加，甚至会导致弃井发生。频繁的井漏问题耗费了大量的施工时间，堵漏施工会增加钻井周期，大大增加了钻井成本，而且不能满足低成本发展的战略需要。In the digital information age, it is becoming more and more a trend to use computers and automation technology to deal with various problems encountered in oil and gas drilling. In the face of downhole leakage problems, improper treatment measures will lead to low plugging success rate, continuous leakage of drilling fluid, increased loss of on-site working time, and even well abandonment. Frequent well leakage problems consume a lot of construction time, plugging construction will increase the drilling cycle, greatly increase drilling costs, and cannot meet the strategic needs of low-cost development.

基于数据驱动的机器学习或者深度学习方法似乎是一种解决方案。基于数据驱动的机器学习模型依赖于油田钻井时的所收集的各类参数，包括钻井参数、地质参数、工程参数、钻井液参数等，通过数据处理，特征提取，模型训练，模型评价等步骤，建立漏失层位的预测模型。深度学习模型在处理数据量大的情形下比机器学习模型具有优势，因为深度学习模型能够自动学习数据的高级特征表示，这意味着它们可以从原始数据中提取有用的特征，而无需手动进行特征工程。深度学习模型通常由多个层次组成，可以处理大量的参数，这使得它们能够适应大规模数据并学习复杂的关系。机器学习模型可能会在面对大规模数据时变得不够灵活或不足以捕获数据中的模式，而且深度学习模型通过多层次的表示逐渐提取数据的抽象信息。Data-driven machine learning or deep learning methods seem to be a solution. Data-driven machine learning models rely on various parameters collected during oil field drilling, including drilling parameters, geological parameters, engineering parameters, drilling fluid parameters, etc. Through data processing, feature extraction, model training, model evaluation and other steps, a prediction model for lost formations is established. Deep learning models have advantages over machine learning models when processing large amounts of data, because deep learning models can automatically learn high-level feature representations of data, which means they can extract useful features from raw data without manual feature engineering. Deep learning models are usually composed of multiple layers and can handle a large number of parameters, which enables them to adapt to large-scale data and learn complex relationships. Machine learning models may become inflexible or insufficient to capture patterns in the data when faced with large-scale data, and deep learning models gradually extract abstract information from the data through multi-level representations.

但是在最近年深度学习模型(如RNN，LSTM、一维CNN等)在漏失预测方面的效果不如传统机器学习模型。在钻井时发生严重漏失所占的时间比漏失不发生的时间少得多，即严重漏失类别数量远小于无漏失的情况，导致深度学习模型中预测中等和严重漏失类别的准确率以及召回率不高。However, in recent years, deep learning models (such as RNN, LSTM, one-dimensional CNN, etc.) have not been as effective as traditional machine learning models in leak prediction. The time when severe leaks occur during drilling is much less than the time when no leaks occur, that is, the number of severe leak categories is much smaller than the number of no leaks, resulting in low accuracy and recall rates for predicting medium and severe leak categories in deep learning models.

发明内容Summary of the invention

为了解决上述问题中，本发明提供了一种基于WGAN-GP-TabNet算法的井漏预警方法，其采用甚多学习模型TabNet来判断漏失等级，并基于WGAN-GP模型丰富了TabNet模型进行训练、测试的少数类数据，所得的WGAN-GP-TabNet模型在预测井漏上具有较好的效果。In order to solve the above problems, the present invention provides a well leakage warning method based on the WGAN-GP-TabNet algorithm, which adopts a multi-learning model TabNet to judge the leakage level, and enriches the minority class data for training and testing the TabNet model based on the WGAN-GP model. The obtained WGAN-GP-TabNet model has a good effect in predicting well leakage.

本发明的具体方案如下：The specific scheme of the present invention is as follows:

一种基于WGAN-GP-TabNet算法的井漏预警方法，包括如下步骤：A well leakage early warning method based on WGAN-GP-TabNet algorithm includes the following steps:

步骤1、从现场收集钻井工程数据，建立防漏堵漏大数据库，并对数据进行预处理；Step 1: Collect drilling engineering data from the site, establish a large database for leak prevention and plugging, and pre-process the data;

步骤2、基于预处理后的数据中各特征参数与漏失流量的相关性筛选相关性强的特征参数，以预处理后的数据中相关性强的特征参数作为初始参数，并对初始参数进行LOESS降噪；Step 2: based on the correlation between each characteristic parameter in the preprocessed data and the leakage flow, the characteristic parameters with strong correlation are selected, the characteristic parameters with strong correlation in the preprocessed data are used as initial parameters, and LOESS denoising is performed on the initial parameters;

步骤3、根据漏失流量对LOESS降噪后的初始参数进行分级，并根据分级后各个等级中初始参数的数量对各个等级进行分类，分为多数类和少数类；Step 3: Classify the initial parameters after LOESS noise reduction according to the leakage flow, and classify each level into majority class and minority class according to the number of initial parameters in each level after classification;

步骤4、将步骤3处理后的数据输入WGAN-GP模型，生成少数类数据；Step 4: Input the data processed in step 3 into the WGAN-GP model to generate minority class data;

步骤5、利用步骤2经过LOESS降噪后的初始参数以及步骤4生成的少数类数据来训练并评估井漏预警模型TabNet，如果训练不合格则返回步骤2-4的任一步骤修改相关参数后继续训练，训练合格则确定TabNet模型并进入下一步；Step 5: Use the initial parameters after LOESS denoising in step 2 and the minority class data generated in step 4 to train and evaluate the lost circulation warning model TabNet. If the training fails, return to any step of steps 2-4 to modify the relevant parameters and continue training. If the training passes, determine the TabNet model and proceed to the next step.

步骤6、采集现场数据并通过TabNet模型预测其漏失程度，所述现场数据中包括所述相关性强的特征参数。Step 6: Collect field data and predict its loss degree through TabNet model, wherein the field data includes the characteristic parameters with strong correlation.

作为本发明的一种具体实施方式，根据分级后各个等级中初始参数的数量对各个等级进行分类具体包括：统计LOESS处理后的初始参数中各类漏失等级的数据的比例，以漏失数据比例最高的漏失等级为基准，对于所有漏失等级，如果其所占比例与比例最高的漏失等级所占比例的比值低于设定的归类阈值，则该漏失等级属于少数类，否则属于多数类。As a specific implementation of the present invention, classifying each level according to the number of initial parameters in each level after grading specifically includes: counting the proportion of data of each missing level in the initial parameters after LOESS processing, taking the missing level with the highest proportion of missing data as the benchmark, for all missing levels, if the ratio of its proportion to the proportion of the missing level with the highest proportion is lower than the set classification threshold, then the missing level belongs to the minority class, otherwise it belongs to the majority class.

进一步，所述设定的归类阈值为20％。Furthermore, the classification threshold is set to 20%.

作为本发明的一种具体实施方式，筛选相关性强的特征参数包括如下步骤：As a specific implementation of the present invention, screening characteristic parameters with strong correlation includes the following steps:

S1、分别利用斯皮尔曼相关系数、互信息法、LightGBM三种模型处理预处理后的数据，确定各模型中各特征参数与漏失流量的相关性；S1. Use the Spearman correlation coefficient, mutual information method and LightGBM models to process the preprocessed data and determine the correlation between each characteristic parameter and the loss flow in each model;

S2、根据各特征参数与漏失流量的相关性确定不同模型中各特征参数的重要程度；S2. Determine the importance of each characteristic parameter in different models according to the correlation between each characteristic parameter and the leakage flow;

S3、综合同一特征参数在不同模型中的重要程度确定该特征参数的综合重要程度，筛选综合相关性强的特征参数。S3. Determine the comprehensive importance of the feature parameter by comprehensively considering the importance of the same feature parameter in different models, and select feature parameters with strong comprehensive correlation.

进一步，步骤S2包括对于同一模型的特征参数按照相关性递增的顺序对各个特征参数进行排序，各个特征参数的分值等于其位次的值，步骤S3包括计算各个特征参数的总分，并根据总分排序确定各个特征参数的综合重要程度；Further, step S2 includes sorting the characteristic parameters of the same model in order of increasing relevance, and the score of each characteristic parameter is equal to the value of its rank. Step S3 includes calculating the total score of each characteristic parameter, and determining the comprehensive importance of each characteristic parameter according to the total score ranking;

其中，各特征参数总分的计算式如下：Among them, the calculation formula for the total score of each characteristic parameter is as follows:

式中，D_j为第j个特征参数的总分，n为表征相关性的模型的总数，f_i为第i种相关性模型的分值在总分值中的权重；D_ij为第i种相关性模型中第j特征参数的分值。In the formula, _Dj is the total score of the jth characteristic parameter, n is the total number of models representing correlation, _fi is the weight of the score of the i-th correlation model in the total score; _Dij is the score of the j-th characteristic parameter in the i-th correlation model.

作为本发明的一种具体实施方式，所述WGAN-GP模型如下：生成器具有4层隐藏层，隐藏层神经元数目为256，128，64，64；判别器具有5层隐藏层，隐藏层神经元数目为256,128,64,64,32，判别器输出层为1个神经元；在生成器和判别器隐藏层后均设置Dropout层，丢弃率为0.25，判别器输出层有一个节点，用于判断输入样本的真实性，采用Adam作为优化函数。As a specific implementation of the present invention, the WGAN-GP model is as follows: the generator has 4 hidden layers, and the number of neurons in the hidden layer is 256, 128, 64, and 64; the discriminator has 5 hidden layers, and the number of neurons in the hidden layer is 256, 128, 64, 64, and 32, and the discriminator output layer is 1 neuron; a Dropout layer is set after the hidden layer of the generator and the discriminator, and the dropout rate is 0.25. The discriminator output layer has a node for judging the authenticity of the input sample, and Adam is used as the optimization function.

作为本发明的一种具体实施方式，步骤4包括如下步骤：As a specific implementation of the present invention, step 4 includes the following steps:

在生成器输入随机噪声时加入类别标签进行类别指导，在判别器输出端判别数据所属类别，实现按类别生成数据；When the generator inputs random noise, a category label is added for category guidance, and the category to which the data belongs is determined at the output of the discriminator to generate data by category.

将真实样本和生成样本一同输入判别器，使其学习捕捉数据分布的特征，以便让生成样本更接近真实样本；在数据分布逐渐接近的同时，引入分等级的任务，进一步训练判别器以区分真实样本和生成样本，并确保生成样本在分类任务上也具有良好的性能；迭代过程中，利用Adam方法，用当前迭代时WGAN-GP网络中判别器的损失值、生成器的损失值依次更新WGAN-GP网络中判别器、生成器的参数直至生成器和判别器误差。The real samples and generated samples are input into the discriminator together, so that it can learn to capture the characteristics of data distribution, so as to make the generated samples closer to the real samples; while the data distribution is gradually approaching, hierarchical tasks are introduced to further train the discriminator to distinguish between real samples and generated samples, and ensure that the generated samples also have good performance in classification tasks; during the iterative process, the Adam method is used to update the parameters of the discriminator and generator in the WGAN-GP network in turn with the loss value of the discriminator and the loss value of the generator in the current iteration until the generator and discriminator errors.

作为本发明的一种具体实施方式，步骤4中生成的少数类数据数量满足如下条件：经过LOESS降噪后的初始参数以及步骤4生成的少数类数据合并后，数据数量最多的等级的数据数量与少数类中各个等级的数据数量与之比均为5：1。As a specific implementation of the present invention, the number of minority class data generated in step 4 satisfies the following conditions: after the initial parameters after LOESS denoising and the minority class data generated in step 4 are merged, the ratio of the number of data at the level with the largest number of data to the number of data at each level in the minority class is 5:1.

作为本发明的一种具体实施方式，所述TabNet模型由多个决策步堆叠而成，每个决策步由Feature transformer和Attentive transformer、Mask层、Split层和ReLU组成；输入的样本特征如有离散型特征，TabNet首先利用训练嵌入的方式，将离散化特征映射成连续性数值特征，然后保证每个决策步数据输入形式都是B×D矩阵，其中B代表batch size的大小，D代表井漏漏失参数的维度；每个决策步的特征由上一个决策步的Attentivetransformer输出，最后将决策步输出结果集成到整体决策中；As a specific implementation of the present invention, the TabNet model is composed of a plurality of decision steps stacked, each of which is composed of a Feature transformer and an Attentive transformer, a Mask layer, a Split layer and a ReLU; if the input sample features have discrete features, TabNet first uses a training embedding method to map the discretized features into continuous numerical features, and then ensures that the data input form of each decision step is a B×D matrix, where B represents the size of the batch size and D represents the dimension of the well leakage parameter; the features of each decision step are output by the Attentive transformer of the previous decision step, and finally the output results of the decision step are integrated into the overall decision;

与现有技术相比，具有以下优点：Compared with the existing technology, it has the following advantages:

(1)本发明提出了一种基于WGAN-GP-TabNet模型预测井漏的方法，该方法基于对钻井数据的深度学习进行预测，准确度高。(1) The present invention proposes a method for predicting well leakage based on the WGAN-GP-TabNet model. The method is based on deep learning of drilling data and has high accuracy.

(2)针对钻井数据中漏失类数据不平衡情况，本发明使用生成数据模型平衡样本分布增强了数据特征。(2) To address the imbalance of missing data in drilling data, the present invention uses a generative data model to balance sample distribution and enhance data features.

(3)本发明提出了一种基于相关系数、互信息、LightGBM的井漏特征参数组合提取方法。相关性有线性相关性和非线性相关性，而非线性关系要比线性关系复杂的多且更加难以描述。目前，单一特征筛选方法难以准确描述出变量之间的所有相关关系，并且每一种方法衡量相关性的指标各不相同，因此我们考虑利用三种包含线性和非线性特征筛选方法的进行综合选择。(3) The present invention proposes a combined extraction method for well leakage characteristic parameters based on correlation coefficient, mutual information and LightGBM. Correlation includes linear correlation and nonlinear correlation, and nonlinear relationship is much more complex and more difficult to describe than linear relationship. At present, it is difficult for a single feature screening method to accurately describe all the correlations between variables, and each method has different indicators for measuring correlation. Therefore, we consider using three methods including linear and nonlinear feature screening methods for comprehensive selection.

(4)本发明采取了多种方式(LOESS降噪，平衡漏失样本分布)增强了模型的鲁棒性，具有稳定可靠，准确率高，便于操作，反应速度快，可迁移性强等优势，为场工程人员根据漏失预警采取相关的堵漏措施，维护钻井人员的安全和提高钻井过程的效率有积极的影响。(4) The present invention adopts multiple methods (LOESS noise reduction, balanced leakage sample distribution) to enhance the robustness of the model, which has the advantages of stability, reliability, high accuracy, easy operation, fast response speed, and strong portability. It has a positive impact on enabling field engineering personnel to take relevant plugging measures based on leakage warning, maintain the safety of drilling personnel and improve the efficiency of the drilling process.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是WGAN-GP-TabNet井漏预测系统流程图；FIG1 is a flow chart of the WGAN-GP-TabNet lost circulation prediction system;

图2是特征筛选流程图；Fig. 2 is a feature screening flow chart;

图3是WGAN-GP生成器和判别器损失函数变化曲线；Figure 3 is the loss function change curve of the WGAN-GP generator and discriminator;

图4是TabNet结构流程图；Figure 4 is a flow chart of TabNet structure;

图5是不同比例生成数据WGAN-GP-TabNet性能图。Figure 5 is the performance graph of WGAN-GP-TabNet with different ratios of generated data.

具体实施方式Detailed ways

下面结合实施例及附图，对本发明作进一步地的详细说明，但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with embodiments and drawings, but the embodiments of the present invention are not limited thereto.

实施例1Example 1

图1是WGAN-GP-TabNet井漏预测系统流程图，具体实施方式将结合本图给与说明，如下所示：FIG1 is a flow chart of the WGAN-GP-TabNet lost circulation prediction system. The specific implementation method will be described in conjunction with this figure, as shown below:

步骤1：从现场收集钻井工程数据，建立防漏堵漏大数据库，并对数据进行预处理。本实施例收集西南油气田一口井数据，其特征参数有30余种。本实施例的数据预处理即为数据清洗，其包括缺失值处理，数据插补，异常值“3σ”法则检测剔除、数据整合、数据归一化。Step 1: Collect drilling engineering data from the site, establish a leak prevention and plugging database, and pre-process the data. This embodiment collects data from a well in the Southwest Oil and Gas Field, which has more than 30 characteristic parameters. The data pre-processing of this embodiment is data cleaning, which includes missing value processing, data interpolation, outlier "3σ" rule detection and elimination, data integration, and data normalization.

具体而言，可以用python中Pandas和NumPy两个库进行数据分析与处理，缺失值处理可用isnull()函数检测和fillna()函数填补，“3σ”法则是对于正态分布N(μ,)的数据,约有99.73％的数据落在μ正负3σ范围内，3σ范围外基本是缺失值，删除；数据清洗还包括删除空白、错误异常字符，检查数据行列结构，行列标题是否有误，是否存在重复的数组；数据归一化的计算式如下：Specifically, the Pandas and NumPy libraries in Python can be used for data analysis and processing. Missing values can be detected by the isnull() function and filled by the fillna() function. The "3σ" rule is that for data with normal distribution N(μ,), about 99.73% of the data falls within the range of μ plus or minus 3σ. Data outside the 3σ range are basically missing values and are deleted. Data cleaning also includes deleting blanks, erroneous and abnormal characters, checking the data row and column structure, whether the row and column headers are correct, and whether there are repeated arrays. The calculation formula for data normalization is as follows:

式中，为第i特征第j个数据的归一化值，x_ij为第i特征第j个参数的值；X_i为第i特征所有参数的集合，x_ij∈X_i。In the formula, is the normalized value of the jth data of the ith feature, x _ij is the value of the jth parameter of the ith feature; _Xi is the set of all parameters of the ith feature, x _ij ∈X _i .

步骤2：基于预处理后的数据中各特征参数与漏失流量的相关性筛选相关性强的特征参数，以预处理后的数据中相关性强的特征参数作为初始参数，对初始参数进行LOESS降噪。Step 2: Based on the correlation between each characteristic parameter in the preprocessed data and the leakage flow, the characteristic parameters with strong correlation are selected, and the characteristic parameters with strong correlation in the preprocessed data are used as the initial parameters, and LOESS denoising is performed on the initial parameters.

预处理后的数据中，相关的特征参数有30余种，周知的，各特征参数对漏失的影响是不同的，因此，有必要筛选出相关性最强的特征参数作为后期模型中的特征参数，本实施例具体采用如下方法筛选相关性强的特征参数：There are more than 30 relevant feature parameters in the preprocessed data. As is known, each feature parameter has a different effect on the loss. Therefore, it is necessary to select the feature parameters with the strongest correlation as the feature parameters in the later model. This embodiment specifically uses the following method to select the feature parameters with strong correlation:

S2、根据各特征参数与漏失流量的相关性确定不同模型中各特征参数的重要程度，本实施例以打分法表征特征参数的重要程度，比如，对于同一模型的特征参数按照相关性递增的顺序对各个特征参数进行排序，各个特征参数的分值等于其位次的值，即第一位次的特征参数为1分，第m位次的特征参数为m分；S2. Determine the importance of each characteristic parameter in different models according to the correlation between each characteristic parameter and the loss flow. In this embodiment, the importance of the characteristic parameters is characterized by a scoring method. For example, the characteristic parameters of the same model are sorted in the order of increasing correlation. The score of each characteristic parameter is equal to the value of its rank, that is, the first-rank characteristic parameter is 1 point, and the m-th-rank characteristic parameter is m points;

S3、综合同一特征参数在不同模型中的重要程度确定该特征参数的综合重要程度，筛选综合相关性强的特征参数，比如本实施中，计算各个特征参数的总分，最后根据总分排序确定各个特征参数的综合重要程度，然后对分值进行归一化(即所有特征参数的总分之和为1)，筛选归一化后数值大于0.01的特征参数作为目标特征参数，最终获得16个特征因素，如表1所示。S3. Determine the comprehensive importance of the feature parameter by comprehensively considering the importance of the same feature parameter in different models, and select feature parameters with strong comprehensive correlation. For example, in this implementation, calculate the total score of each feature parameter, and finally determine the comprehensive importance of each feature parameter according to the total score ranking, and then normalize the score (that is, the sum of the total scores of all feature parameters is 1), select the feature parameters with normalized values greater than 0.01 as the target feature parameters, and finally obtain 16 feature factors, as shown in Table 1.

各特征参数总分的计算式如下：The calculation formula for the total score of each characteristic parameter is as follows:

式中，D_j为第j个特征参数的总分，n为表征相关性的模型的总数，本实施例为3，f_i为第i种相关性模型的分值在总分值中的权重，本实施例均为1/3；D_ij为第i种相关性模型中第j特征参数的分值。In the formula, _Dj is the total score of the j-th characteristic parameter, n is the total number of models characterizing the correlation, which is 3 in this embodiment, _fi is the weight of the score of the i-th correlation model in the total score, which is 1/3 in this embodiment; _Dij is the score of the j-th characteristic parameter in the i-th correlation model.

特征参数数值归一化的计算式如下：The calculation formula for normalizing the characteristic parameter values is as follows:

式中，为第j个特征参数的归一化分值，m为特征参数的总数。In the formula, is the normalized score of the jth feature parameter, and m is the total number of feature parameters.

表1三种模型最终结果图Table 1 Final results of three models

S4：以预处理后数据中相关性强的特征参数作为初始参数，对初始参数使用LOESS算法降噪，在每个数据点附近使用一个局部的加权回归来拟合数据，从而得到更平滑的曲线或曲面，而不受全局拟合的影响；对于每个数据点x_i，LOESS使用一个权重函数w(x_i,x)来调整附近数据点的影响。S4: The characteristic parameters with strong correlation in the preprocessed data are used as the initial parameters, and the LOESS algorithm is used to reduce the noise of the initial parameters. A local weighted regression is used to fit the data near each data point to obtain a smoother curve or surface without being affected by the global fitting. For each data point x _i , LOESS uses a weight function w(x _i ,x) to adjust the influence of nearby data points.

步骤3：根据漏失流量对漏失程度进行分级，具体分级的数量可以根据需要选择，比如本实施确定为4级，具体分级标准如表2所示，由表可知，各个等级的数据之间是高度不平衡的，大部分数据集中在0类和1类，2类和3类一共也少于2％，不采取其他措施，只采用机器学习或者深度学习模型是难以准确且全面的识别出漏失最为严重的2类和3类。因此，统计LOESS处理后的初始参数中各类漏失等级的数据的比例，以漏失数据比例最高的漏失等级为基准，对于其余的漏失级别，如果其所占比例与比例最高的漏失等级所占比例的比值低于设定的归类阈值，则该漏失等级属于少数类，否则属于多数类，此处的归类阈值可以认为规定，但在机器学习或深度学习中，一般来说少数类为多数类的20％以下即可认为存在类不平衡问题，类不平衡问题会导致模型对少数类数据学习不够，使得少数类预测精确率和查全率不高，因此，此处，将归类阈值定为20％，其分类如表2所示。Step 3: Classify the degree of leakage according to the leakage flow. The specific number of levels can be selected as needed. For example, this implementation determines it as 4 levels. The specific classification standards are shown in Table 2. It can be seen from the table that the data of each level is highly unbalanced. Most of the data is concentrated in Class 0 and Class 1, and Class 2 and Class 3 account for less than 2% in total. Without taking other measures, it is difficult to accurately and comprehensively identify Class 2 and Class 3 with the most serious leakage using only machine learning or deep learning models. Therefore, the proportion of data of various missing levels in the initial parameters after LOESS processing is statistically calculated, and the missing level with the highest proportion of missing data is taken as the benchmark. For the remaining missing levels, if the ratio of its proportion to the proportion of the missing level with the highest proportion is lower than the set classification threshold, then the missing level belongs to the minority class, otherwise it belongs to the majority class. The classification threshold here can be considered as a regulation, but in machine learning or deep learning, generally speaking, when the minority class is less than 20% of the majority class, it can be considered that there is a class imbalance problem. The class imbalance problem will cause the model to not learn enough about the minority class data, resulting in low prediction accuracy and recall rate of the minority class. Therefore, here, the classification threshold is set to 20%, and its classification is shown in Table 2.

表2某井漏失等级分级表Table 2 Classification table of leakage level in a well

步骤4：将步骤3处理后的数据输入WGAN-GP模型，WGAN-GP是带有梯度惩罚项Wasserstein距离的生成对抗网络，把少量漏失、中等漏失、严重漏失类别对应的随机噪声输入生成器，生成器输出端生成相应类别数据，在判别器输出端判别数据误差，以此生成少数类数据。Step 4: Input the data processed in step 3 into the WGAN-GP model. WGAN-GP is a generative adversarial network with a gradient penalty term Wasserstein distance. The random noise corresponding to the categories of small dropouts, medium dropouts, and severe dropouts is input into the generator. The output of the generator generates data of the corresponding category, and the data error is discriminated at the output of the discriminator to generate minority class data.

本实施例构建WGAN-GP网络如下：其生成器具有4层隐藏层，隐藏层神经元数目为256，128，64，64，生成器输入层接受输入变量，输出层输出变量；判别器具有5层隐藏层，隐藏层神经元数目为256,128,64,64,32，判别器接受变量作为输入，判别器输出层为1个神经元，用于判断真假。此外，为了避免过拟合，在生成器和判别器隐藏层后均设置Dropout层，丢弃率为0.25，判别器输出层有一个节点，用于判断输入样本的真实性，采用Adam作为优化函数。This embodiment constructs the WGAN-GP network as follows: its generator has 4 hidden layers, the number of neurons in the hidden layer is 256, 128, 64, 64, the generator input layer accepts input variables, and the output layer outputs variables; the discriminator has 5 hidden layers, the number of neurons in the hidden layer is 256, 128, 64, 64, 32, the discriminator accepts variables as input, and the discriminator output layer is 1 neuron for judging true or false. In addition, in order to avoid overfitting, a Dropout layer is set after the generator and discriminator hidden layers, with a dropout rate of 0.25. The discriminator output layer has a node for judging the authenticity of the input sample, and Adam is used as the optimization function.

利用WGAN-GP生成数据的方法如下：The method of generating data using WGAN-GP is as follows:

(2a)在生成器输入随机噪声时加入类别标签进行类别指导，在判别器输出端判别数据所属类别(多数类或少数类)，实现按类别生成数据。(2a) When the generator inputs random noise, a category label is added for category guidance. At the output of the discriminator, the category to which the data belongs (majority class or minority class) is determined, thereby generating data by category.

(2b)将真实样本和生成样本一同输入判别器，使其学习捕捉数据分布的特征，以便让生成样本更接近真实样本。在数据分布逐渐接近的同时，引入分等级的任务，进一步训练判别器以区分真实样本和生成样本，并确保生成样本在分类任务上也具有良好的性能。(2b) The real samples and the generated samples are fed into the discriminator together, so that it can learn to capture the characteristics of the data distribution, so as to make the generated samples closer to the real samples. As the data distribution gradually approaches, hierarchical tasks are introduced to further train the discriminator to distinguish between real samples and generated samples, and ensure that the generated samples also have good performance in the classification task.

(3c)利用Adam方法，用当前迭代时WGAN-GP网络中判别器的损失值、生成器的损失值依次更新WGAN-GP网络中判别器、生成器的参数(附图3)。(3c) Using the Adam method, the loss value of the discriminator and the loss value of the generator in the WGAN-GP network at the current iteration are used to update the parameters of the discriminator and the generator in the WGAN-GP network in turn (Figure 3).

由图3可以看出，在6000轮次的训练中，在5500轮左右生成器和判别器误差达到最佳，可以提前停止训练，进行下一步：As can be seen from Figure 3, in 6000 rounds of training, the errors between the generator and the discriminator reach the best around 5500 rounds, so the training can be stopped early and the next step can be taken:

L_G＝1-D(G(z)) _LG ＝1-D(G(z))

式中L_G为生成器的损失函数；G(·)为生成函数；z为输入噪声；D(·)为判别器函数；L(G,D)为判别器损失函数；G为生成器；D为判别器；E(·)为期望函数；x为归一化后的输入数据；P_t(·)为真实数据分布；λ为惩罚项系数；·_p为p范数；为梯度算子；/>为真实数据和生成数据之间的随机插值，/>ζ服从[0,1]范围的均匀分布；/>为从真实数据分布和生成数据分布抽样点之间的均匀取样。Where L _G is the loss function of the generator; G(·) is the generating function; z is the input noise; D(·) is the discriminator function; L(G,D) is the discriminator loss function; G is the generator; D is the discriminator; E(·) is the expected function; x is the normalized input data; P _t (·) is the true data distribution; λ is the penalty term coefficient; · _p is the p-norm; is the gradient operator; /> is a random interpolation between the real data and the generated data, /> ζ follows a uniform distribution in the range [0,1];/> To sample uniformly between the sampling points from the true data distribution and the generated data distribution.

步骤5：利用步骤2经过LOESS降噪后的初始参数以及步骤4生成的少数类数据来训练井漏预警模型1000轮，然后在测试集上对模型进行性能评估；如果评估合格则确定模型参数得到TabNet模型，否则，则继续训练模型。Step 5: Use the initial parameters after LOESS denoising in step 2 and the minority class data generated in step 4 to train the well leakage warning model for 1000 rounds, and then evaluate the performance of the model on the test set; if the evaluation is qualified, determine the model parameters to obtain the TabNet model, otherwise, continue to train the model.

将生成的少数类数据和经过LOESS处理后的初始参数合并作为特征加强的数据，以特征加强过后的数据样本训练TabNet井漏预警模型训练。将数据集按照8:2比例随机划分为训练集和测试集，把数据输入TabNet模型，因TabNet模型超参数对于性能有着重要影响，使用TPE算法对模型参数进行调优，建立井漏预警模型。The generated minority data and the initial parameters after LOESS processing are combined as feature-enhanced data, and the TabNet well leakage warning model is trained with the feature-enhanced data samples. The data set is randomly divided into training set and test set in an 8:2 ratio, and the data is input into the TabNet model. Since the TabNet model hyperparameters have an important impact on performance, the TPE algorithm is used to tune the model parameters and establish a well leakage warning model.

本实施例的TabNet模型如下：The TabNet model of this embodiment is as follows:

(4.1)TabNet由多个决策步堆叠而成，每个决策步由Feature transformer和Attentive transformer、Mask层、Split层和ReLU组成。输入的样本特征如有离散型特征，TabNet首先利用训练嵌入的方式，将离散化特征映射成连续性数值特征，然后保证每个决策步数据输入形式都是B×D矩阵，其中B代表batch size的大小，D代表井漏漏失参数的维度。每个决策步的特征由上一个决策步的Attentive transformer输出，最后将决策步输出结果集成到整体决策中，如附图3所示。(4.1) TabNet is composed of multiple decision steps stacked together, each of which consists of Feature transformer and Attentive transformer, Mask layer, Split layer and ReLU. If the input sample features have discrete features, TabNet first uses the training embedding method to map the discretized features into continuous numerical features, and then ensures that the data input form of each decision step is a B×D matrix, where B represents the batch size and D represents the dimension of the well leakage parameter. The features of each decision step are output by the Attentive transformer of the previous decision step, and finally the output results of the decision step are integrated into the overall decision, as shown in Figure 3.

(4.2)Feature transformer的功能是实现了决策步的特征计算。Featuretransformer由BN层、门控线性单元(GLU)层和全连接层组成，GLU的目的是在原始FC层的基础上增加一个门单元，计算如公式如下所示：(4.2) The function of Feature transformer is to realize the feature calculation of the decision step. Feature transformer consists of BN layer, gated linear unit (GLU) layer and fully connected layer. The purpose of GLU is to add a gate unit on the basis of the original FC layer. The calculation formula is as follows:

其中，h(X)为特征变换器的输出；X为输入特征；W,b分别为全连接层的权重和偏置；*表示矩阵乘法(矩阵点积)；表示元素级别的异或操作；σ是sigmoid激活函数；V,c分别为GLU层的权重和偏置。特征转换层由两部分组成。该层前半部分属于共享决策步骤，每个决策步的feature transformer的共享决策步骤的参数共享，而后半部分是独立决策步骤，参数需要在每一决策步上单独训练的。Where h(X) is the output of the feature transformer; X is the input feature; W, b are the weight and bias of the fully connected layer respectively; * represents matrix multiplication (matrix dot product); represents the element-level XOR operation; σ is the sigmoid activation function; V and c are the weight and bias of the GLU layer respectively. The feature transformation layer consists of two parts. The first half of this layer belongs to the shared decision step, and the parameters of the shared decision step of the feature transformer of each decision step are shared, while the second half is an independent decision step, and the parameters need to be trained separately at each decision step.

(4.3)Split层的作用是对Feature transformer输出的向量切割，计算如下所示：(4.3) The function of the Split layer is to cut the vector output by the Feature transformer. The calculation is as follows:

[d[i]，a[i]]＝f_i(M[i].f)[d[i]，a[i]]＝ _fi (M[i].f)

上式中d[i]表示计算模型的最终输出，a[i]表示下一个决策步的Mask层；f_i为函数，用于处理Feature Transformer输出的向量的第i个元素；M[i]为Feature Transformer模型输出的向量的第i个元素；f为Feature Transformer模型。In the above formula, d[i] represents the final output of the calculation model, a[i] represents the Mask layer of the next decision step; _fi is a function used to process the i-th element of the vector output by the Feature Transformer; M[i] is the i-th element of the vector output by the Feature Transformer model; and f is the Feature Transformer model.

(4.4)Attentive transformer根据上一个决策步的输出结果获取当前决策步的Mask层矩阵，并使Mask矩阵是稀疏且不重复的。(4.4) The Attentive transformer obtains the Mask layer matrix of the current decision step based on the output result of the previous decision step, and makes the Mask matrix sparse and non-repetitive.

此外，为了克服数据类不平衡的问题，本实施例进一步探究了少数类中各等级数据生成量对模型准确度的影响，具体而言：把少数类样本数据使用WGAN-GP算法训练，A中16个井漏特征因素按照表3分组，设计8组不同比例的数据，以TabNet作为分类模型，在测试集上使用精确率，召回率，F1值和G-mean四个评价指标，寻找最佳的少数类数据生成量，结果如图5所示(图中，横坐标编号表示表3中实验编号)，当特征加强的数据(将生成的少数类数据和经过LOESS处理后的初始参数合并)中少数类数据中各等级数据数量与基准数据(0泄漏等级)数量比值为5:1时，效果最好，有效克服类不平衡，因此，设定WGAN-GP-TabNet模型中生成少数类数据时，以最终特征加强的数据中少数类数据中各等级数据与基准等级数据量比值为5:1为准。In addition, in order to overcome the problem of data class imbalance, this embodiment further explores the influence of the amount of data generated at each level in the minority class on the accuracy of the model. Specifically, the minority class sample data is trained using the WGAN-GP algorithm, the 16 well leakage characteristic factors in A are grouped according to Table 3, 8 groups of data with different proportions are designed, TabNet is used as the classification model, and four evaluation indicators of precision, recall, F1 value and G-mean are used on the test set to find the optimal amount of minority class data generated. The results are shown in Figure 5 (in the figure, the horizontal axis number represents the experimental number in Table 3). When the ratio of the number of data of each level in the minority class data to the number of benchmark data (0 leakage level) in the feature-enhanced data (the generated minority class data and the initial parameters after LOESS processing are merged) is 5:1, the effect is best, and the class imbalance is effectively overcome. Therefore, when setting the generation of minority class data in the WGAN-GP-TabNet model, the ratio of the amount of data of each level in the minority class data to the benchmark level in the final feature-enhanced data is 5:1.

表3加入WGAN-GP生成数据后的数据集Table 3. Dataset after adding WGAN-GP generated data

步骤:6：将数据进行步骤1相同数据处理并提取16个已筛选好的井漏特征因素，输入到训练好的TabNet模型，就能够预测漏失程度等级。Step 6: Perform the same data processing as step 1 and extract 16 well-screened well leakage characteristic factors, input them into the trained TabNet model, and the leakage degree can be predicted.

本发明使用模型测试15个不同深度的样本数据，间隔取样，从数据开始记录的500m开始，750m、1000m、1250m、1500m、1750m、2000m、2250m、2500m、2750m、3000m、3250m、3500m预测结果如下表所示。The present invention uses the model to test sample data at 15 different depths, with interval sampling, starting from 500m when the data starts to be recorded, 750m, 1000m, 1250m, 1500m, 1750m, 2000m, 2250m, 2500m, 2750m, 3000m, 3250m, and 3500m. The prediction results are shown in the following table.

表4漏失预测结果表Table 4. Leakage prediction results

深度(*10m)Depth (*10m) 5050 7575 100100 125125 150150 175175 200200 225225 250250 275275 300300 325325 350350 真实值actual value 00 00 00 00 00 11 22 00 33 11 11 00 00 预测值Predictive value 00 00 00 00 00 11 22 00 22 11 11 00 00

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明实施例揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above is only a preferred specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily thought of by a person skilled in the art within the technical scope disclosed in the embodiments of the present invention should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A lost circulation early warning method based on WGAN-GP-TabNet algorithm is characterized by comprising the following steps:

Step 1, collecting drilling engineering data from the site, establishing a leakage prevention and blocking large database, and preprocessing the data;

step 2, screening the characteristic parameters with strong correlation based on the correlation between each characteristic parameter in the preprocessed data and the leakage flow, taking the characteristic parameters with strong correlation in the preprocessed data as initial parameters, and carrying out LOESS noise reduction on the initial parameters;

step 3, classifying the initial parameters after LOESS noise reduction according to the leakage flow, classifying each class according to the number of the initial parameters in each class after classification, and classifying the classes into a majority class and a minority class;

Step 4, inputting the data processed in the step 3 into WGAN-GP model to generate minority class data;

The WGAN-GP model is as follows: the generator has 4 hidden layers with a number of hidden layer neurons of 256,128,64, 64; the arbiter has 5 hidden layers, the number of neurons in the hidden layers is 256,128,64,64,32, and the output layer of the arbiter is 1 neuron; a Dropout layer is arranged behind the generator and the discriminator hiding layer, the discarding rate is 0.25, the discriminator output layer is provided with a node for judging the authenticity of an input sample, and Adam is adopted as an optimization function;

step 5, training and evaluating the lost circulation early warning model TabNet by using the initial parameters after the LOESS noise reduction in the step 2 and the minority class data generated in the step 4;

The TabNet model is formed by stacking a plurality of decision steps, and each decision step consists of Feature transformer and ATTENTIVE TRANSFORMER, a Mask layer, a Split layer and a ReLU; the input sample features are discrete features, tabNet firstly map the discrete features into continuous numerical features by using a training embedding mode, and then ensure that the data input form of each decision step is a B X D matrix, wherein B represents the size of batch size, and D represents the dimension of lost circulation parameters; the characteristics of each decision step are output by ATTENTIVE TRANSFORMER of the last decision step, and finally, the output result of the decision step is integrated into the overall decision;

and 6, acquiring field data and predicting the leakage degree by using a TabNet model which is trained, wherein the field data comprises the characteristic parameters with strong correlation.

2. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein the initial parameters after LOESS noise reduction are classified according to the leakage flow, and specific classification standards are as follows:

if the leak rate is 0m ³/h, the leak grade is no leak, and the leak grade is 0;

If the leak rate is greater than 0m ³/h and less than or equal to 5m ³/h, the leak rating is a small amount of leak, the leak rating is 1;

If the leak rate is greater than 5m ³/h and less than or equal to 15m ³/h, the leak rating is medium leak and the leak rating is 2;

if the leak rate is greater than 15m ³/h, the leak rating is severe and the leak rating is 3.

3. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein classifying each level according to the number of initial parameters in each level after classification specifically comprises: and counting the proportion of the data of various leakage grades in the initial parameters after the LOESS processing, taking the leakage grade with the highest leakage data proportion as a reference, and if the ratio of the proportion of the data of all the leakage grades to the proportion of the leakage grade with the highest proportion is lower than a set classifying threshold value, the leakage grade belongs to a minority class, and otherwise, the leakage grade belongs to a majority class.

4. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 3, wherein the set classification threshold is 20%.

5. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein the step 2 of screening the characteristic parameters with strong correlation comprises the following steps:

S1, respectively processing the preprocessed data by using three models of a Szechwan correlation coefficient, a mutual information method and LightGBM, and determining the correlation between each characteristic parameter in each model and the leakage flow;

S2, determining the importance degree of each characteristic parameter in different models according to the correlation between each characteristic parameter and the leakage flow;

S3, determining the comprehensive importance degree of the same feature parameter in different models by integrating the importance degree of the feature parameter, and screening the feature parameter with strong comprehensive relevance.

6. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 4, wherein step S2 includes ranking the feature parameters of the same model in the order of increasing correlation, the score of each feature parameter being equal to the value of its rank, step S3 includes calculating the total score of each feature parameter, and determining the comprehensive importance of each feature parameter according to the total score ranking;

Wherein, the calculation formula of each characteristic parameter total score is as follows:

Wherein D _j is the total score of the jth feature parameter, n is the total number of models characterizing the correlation, and f _i is the weight of the score of the ith correlation model in the total score; d _ij is the score for the j-th feature parameter in the i-th correlation model.

7. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein,

Step 4 comprises the following steps:

Adding a class label to conduct class guidance when random noise is input into the generator, judging the class to which the data belongs at the output end of the discriminator, and generating the data according to the class;

inputting the real sample and the generated sample into a discriminator together, so that the characteristics of the data distribution are learned and captured, and the generated sample is more similar to the real sample; introducing hierarchical tasks while gradually approaching data distribution, further training a discriminator to distinguish real samples from generated samples, and ensuring that the generated samples have good performance on classification tasks; in the iteration process, parameters of the discriminator and the generator in the WGAN-GP network are sequentially updated by using the Adam method according to the loss value of the discriminator in the WGAN-GP network and the loss value of the generator in the current iteration until errors of the generator and the discriminator are reached.

8. The lost circulation warning method based on WGAN-GP-TabNet algorithm according to claim 1, wherein the number of minority data generated in step 4 satisfies the following conditions: after merging the initial parameters after the LOESS noise reduction and the minority class data generated in the step 4, the ratio of the data quantity of the class with the largest data quantity to the data quantity of each class in the minority class is 5:1.