Open AccessArticle

Prediction of Oil–Water Two-Phase Flow Patterns Based on Bayesian Optimisation of the XGBoost Algorithm

Dudu Wang

^1,2,

Haimin Guo

^1,2,*,

Yongtuo Sun

^1,2,

Haoxun Liang

^1,2,

Ao Li

^1,2 and

Yuqing Guo

^1,2

College of Geophysics and Petroleum Resources, Yangtze University, Wuhan 430100, China

Key Laboratory of Exploration Technologies for Oil and Gas Resources, Yangtze University, Ministry of Education, Wuhan 430100, China

Author to whom correspondence should be addressed.

Processes 2024, 12(8), 1660; https://doi.org/10.3390/pr12081660

Submission received: 3 July 2024 / Revised: 31 July 2024 / Accepted: 3 August 2024 / Published: 7 August 2024

(This article belongs to the Section Automation Control Systems)

Download

Browse Figures

Figure 1
Bayesian optimisation algorithm optimisation XGBoost model process. "> Figure 2
XGBoost training flow chart. "> Figure 3
Schematic of oil–water flow patterns (left) and the photographed diagram (right). "> Figure 4
Schematic of the experimental setup, including: 1. simulation well; 2. well inclination regulator; 3. oil–water mixer; 4, 5. position control valves; 6, 7. flow meters; 8. water pump; 9. oil pump; 10. water tank; 11. oil tank; 12. oil–water separation tank. "> Figure 5
Confusion matrix of prediction results of the XGBoost algorithm training set. (a) Non-normalized data; (b) Normalized data. "> Figure 6
Confusion matrix of prediction results of the XGBoost algorithm test set. (a) Non-normalized data; (b) Normalized data. "> Figure 7
Scatter plot of the XGBoost training set and test set flow prediction results. "> Figure 8
Confusion matrix of the prediction results of the BO-XGBoost algorithm training set. (a) Non-normalized data; (b) Normalized data. "> Figure 9
Confusion matrix of the prediction results of the BO-XGBoost algorithm test set. (a) Non-normalized data; (b) Normalized data. "> Figure 10
Scatter plot of the BO-XGBoost training set and test set flow prediction results. "> Figure 11
XGBoost ROC curve. "> Figure 12
BO-XGBoost ROC curve. "> Figure 13
Flow pattern prediction accuracy statistics. "> Figure 14
Feature importance image. "> Figure 15
Feature global explanation image. ">

Versions Notes

Abstract

With the continuous advancement of petroleum extraction technologies, the importance of horizontal and inclined wells in reservoir exploitation has been increasing. However, accurately predicting oil–water two-phase flow regimes is challenging due to the complexity of subsurface fluid flow patterns. This paper introduces a novel approach to address this challenge by employing extreme gradient boosting (XGBoost, version 2.1.0) optimised through Bayesian techniques (using the Bayesian-optimization library, version 1.4.3) to predict oil–water two-phase flow regimes. The integration of Bayesian optimisation aims to enhance the efficiency of parameter tuning and the precision of predictive models. The methodology commenced with experimental studies utilising a multiphase flow simulation apparatus to gather data across a spectrum of water cut rate, well inclination angles, and flow rates. Flow patterns were meticulously recorded via direct visual inspection, and these empirical datasets were subsequently used to train and validate both the conventional XGBoost model and its Bayesian-optimised counterpart. A total of 64 datasets were collected, with 48 sets used for training and 16 sets for testing, divided in a 3:1 ratio. The findings highlight a marked improvement in predictive accuracy for the Bayesian-optimised XGBoost model, achieving a testing accuracy of 93.8%, compared to 75% for the traditional XGBoost model. Precision, recall, and F1-score metrics also showed significant improvements: precision increased from 0.806 to 0.938, recall from 0.875 to 0.938, and F1-score from 0.873 to 0.938. The training accuracy further supported these results, with the Bayesian-optimised XGBoost (BO-XGBoost) model achieving an accuracy of 0.948 compared to 0.806 for the traditional XGBoost model. Comparative analyses demonstrate that Bayesian optimisation enhanced the predictive capabilities of the algorithm. Shapley additive explanations (SHAP) analysis revealed that well inclination angles, water cut rates, and daily flow rates were the most significant features contributing to the predictions. This study confirms the efficacy and superiority of the Bayesian-optimised XGBoost (BO-XGBoost) algorithm in predicting oil–water two-phase flow regimes, offering a robust and effective methodology for investigating complex subsurface fluid dynamics. The research outcomes are crucial in improving the accuracy of oil–water two-phase flow predictions and introducing innovative technical approaches within the domain of petroleum engineering. This work lays a foundational stone for the advancement and application of multiphase flow studies.

Keywords:

horizontal well; vertical well; inclined well; Bayesian algorithm; XGBoost algorithm; flow pattern prediction

1. Introduction

In the direction of fluid flow, the study of multiphase flow patterns has consistently been a focal point for researchers [1,2,3,4,5], with oil–water two-phase flow research forming the foundation of this field. As oilfield development advances into its later stages, the influence of water on flow dynamics becomes increasingly significant, necessitating accurate prediction of oil–water two-phase flow behaviours during extraction. However, compared to vertical wells, the flow states of oil–water two-phase flows in horizontal and inclined wells are more challenging to predict. This difficulty is exacerbated by the constantly changing development environment, where the drilling angle varies according to the actual situation, and the rapidly changing downhole conditions lead to significant variations in fluid flow velocity, which in turn significantly affect the flow patterns.

Despite ongoing research, consensus on flow pattern definitions remains elusive due to varying influencing factors and research emphases. Current research primarily relies on subjective observation and flow pattern maps, which are influenced by the observer’s subjective factors, resulting in qualitative rather than quantitative identification methods. Therefore, accurately predicting oil–water two-phase flow patterns is crucial for process design, operational safety, and economic efficiency. It can also promote technological innovation and development, enhance production efficiency, reduce risks, and optimise resource utilisation.

In recent years, scholars have adopted computer numerical simulation methods to study fluid flow patterns. Through numerical simulations, researchers can model the effects of different factors on flow patterns. However, the impact of logging instruments on fluids is often overlooked in actual wells, leading to discrepancies between simulation results and actual downhole flow patterns. Physical experiments, while needing to be consistent with real wells, are inconvenient, as they must replicate challenging factors such as temperature and pressure, provide limited data points, and are labour intensive and error prone.

In recent years, many scholars have adopted computer numerical simulation methods to study fluid flow patterns. Through numerical simulations, researchers can model the effects of different factors on flow patterns. However, in actual wells, the impact of logging instruments on fluids is often overlooked, leading to discrepancies between simulation results and actual downhole flow patterns. Physical experiments, on the other hand, need to be consistent with real wells, which is inconvenient. Factors such as temperature and pressure are also challenging to replicate fully, and physical experiments provide limited data points, are labour-intensive, and are prone to errors.

The advent and continuous development of deep learning and machine learning have made data processing and analysis more efficient and accurate. These technologies have improved productivity and enhanced traditional methods. In many fields, machine learning algorithms are widely used for data prediction. For example, in the financial sector, machine learning has been applied to predict goodwill impairment [6], helping investors identify goodwill impairment risks and mitigate its market impact. Researchers like Zhang Yanan [7] and Zhang Xiangrong [8] have used optimisation algorithms and multi-core learning methods to improve the accuracy of financial risk predictions. Recent advances in the energy sector include the application of machine learning to optimise biodiesel production, as demonstrated by Sukpancharoen et al. (2023) [9], who explored the potential of transesterification catalysts through machine-learning approaches. In addition, Şahin (2023) conducted a comparative study of machine learning algorithms to predict the performance and emissions of diesel/biodiesel/isoamyl alcohol blends [10]. These studies highlight the growing importance of machine learning in improving the efficiency and sustainability of biofuel production and use.

Extreme gradient boosting (XGBoost) is an emerging machine learning algorithm known for its exceptional modelling capabilities and fast computation speed, which surpasses many other algorithms. Currently, XGBoost has been widely applied in the field of petroleum geology. For instance, Tang Qinxin et al. [11] employed the XGBoost algorithm to build a model for predicting the productivity of fractured horizontal wells. At the same time, Zhao Ranlei et al. [12] used XGBoost for lithology identification in volcanic rocks. However, the application of this algorithm for predicting downhole fluid flow patterns still needs to be improved.

This study aims to leverage the XGBoost algorithm to predict downhole fluid flow patterns and evaluate its performance. Given that the effectiveness of the algorithm is influenced by hyperparameters [13], we utilised the Bayesian optimisation algorithm (BO) to optimise the hyperparameters of XGBoost. The Bayesian optimisation algorithm, known for its global parameter search capability and high efficiency, has been successfully applied across various domains.

For example, in the study by [14], the Bayesian optimisation algorithm was used for precise detection and localisation of targets in remote sensing images, significantly enhancing the accuracy of detection boundaries. In the study of [15], the Bayesian optimisation algorithm was applied to optimise the hyperparameters of XGBoost, resulting in the optimal parameter combination for constructing a grain temperature prediction model. The findings indicated that this model had low prediction error and high accuracy, providing a valuable decision-making tool for temperature control management in granaries. Additionally, the research of [16] proposed a coal spontaneous combustion grading warning model based on Bayesian optimised XGBoost (BO-XGBoost), demonstrating superior stability and classification accuracy of the BO-XGBoost model.

In this study, a multiphase flow simulation experimental apparatus was used to conduct oil–water two-phase flow simulation experiments, collecting 64 sets of flow pattern data. Subsequently, the Bayesian optimisation algorithm was employed to optimise the hyperparameters of XGBoost, thereby aligning the prediction results more closely with actual conditions. This approach provides an effective method for predicting downhole fluid flow patterns, offering a scientific basis for practical engineering applications and fostering the integration of traditional industrial technology with cutting-edge innovations.

The novelty of this work lies in the integration of Bayesian optimisation with the XGBoost algorithm to enhance the prediction accuracy of oil–water two-phase flow patterns. Unlike traditional methods, this approach optimises hyperparameters more efficiently, improving model performance. By systematically combining experimental data with advanced machine learning techniques, this study introduces a robust methodology for accurately predicting complex subsurface fluid dynamics.

2. Algorithm Principle

2.1. XGBoost Algorithm

XGBoost is an iterative supervised learning algorithm based on the classification regression tree (CART) model [17]. This algorithm enhances calculation accuracy by performing a second-order Taylor expansion on the loss function (the loss function measures how well the model’s predictions match the actual data, with lower values indicating better performance). It adds a regularisation term to the objective function to reduce model complexity, effectively avoiding overfitting and improving generalisation ability. XGBoost’s working principle involves constructing multiple decision trees iteratively and summing their predicted results to obtain the final output. This process is represented by Equation (1):

\hat{y_{i}} = \sum_{k = 1}^{n} f_{k} (x_{i})

(1)

In Equation (1), n represents the total number of decision trees;

f_{k}

(x) denotes the predicted outcome generated by the k-th tree model; and

x_{i}

refers to the i-th sample.

Prediction accuracy is determined by both bias and variance. Thus, the objective function consists of the loss function and the regularisation term. The loss function measures the difference between the predicted and actual values, while the regularisation term suppresses model complexity, as expressed in Equation (2).

O = \sum_{i = 1}^{N} l (y_{i}, \hat{y_{i}}) + \sum_{k = 1}^{n} Ω (f_{k})

(2)

In the equation, N represents the number of samples;

y_{i}

and

\hat{y_{i}}

are the actual and predicted values of the i-th sample, respectively; l

(y_{i}, \hat{y_{i}})

is a loss function that reflects the deviation between the predicted value and the actual value; and Ω(

f_{k}

) is a regularisation term that represents the sum of the complexities of n tree models. The specific calculation formula is shown in Equation (3).

Ω (f_{x}) = γ T + \frac{1}{2} γ {||ω||}^{2}

(3)

In the specific calculation equation, γ represents the penalty coefficient for leaf nodes; (the penalty coefficient is a factor used to impose a cost on the complexity of the model, discouraging overly complex models and helping to prevent overfitting); T denotes the number of leaf nodes in the tree;

γ

is the regularisation penalty coefficient; and ω signifies the weights of the leaf nodes. By substituting Equation (3) into Equation (2) and applying the forward distribution additive principle along with a second-order Taylor expansion on the loss function, the approximate objective function is derived, as expressed in Equation (4).

O ≅ \sum_{i = 1}^{N} [g_{i} f_{k} (x_{i}) + \frac{1}{2} h_{i} f_{k}^{2} (x_{i})] + Ω (f_{k})

(4)

In the equation,

g_{i}

and

h_{i}

are the first and second derivatives of the loss function, respectively.

For the decision tree model, it is defined by the branching structure and the weights of the leaf nodes, as provided in Equation (5).

f_{k} (x) = ω_{q (x)}, ω \in R^{T}

(5)

In the equation,

q_{x}

represents the leaf node index or sample x, and

R^{T}

is a set of leaf node weights with T real-valued dimensions. The complexity of the decision tree is determined jointly by the number of leaf nodes and the L2 norm of the vector composed of all weights. The new expression is provided in Equation (6) (the L2 norm of a vector is a measure of its magnitude, calculated as the square root of the sum of the squares of its components).

Ω (f_{k}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}

(6)

To simplify the calculation, the set of all samples

x_{i}

in leaf node j is defined as

I_{j} = {i | q (x_{i}) = j}

. The objective function is then reformulated with the new expression provided in Equation (7).

O = - \frac{1}{2} \sum_{j = 1}^{T} \frac{G_{j}^{2}}{H_{j} + λ} + γ T

(7)

In the equation,

G_{j}

and

H_{j}

are the sums of the first and second derivatives, respectively, of the samples contained in leaf node j. The weight corresponding to leaf node j is given in Equation (8).

ω_{j}^{*} = \frac{G_{j}}{H_{j} + λ}

(8)

During the construction of the decision tree, a greedy algorithm is employed to find the optimal split points for the leaf nodes in the model. This approach involves enumerating all features of each leaf node. For each feature, the feature values of the training samples in that node are sorted to determine the best split point and calculate the split gain. The feature with the highest split gain is selected, and the best split point for this feature is used as the split location, creating new leaf nodes at that position. Thus, a gain function is defined to compute the split gain for the features, as expressed in Equation (9).

G = \frac{1}{2} [\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{(G_{L} + G_{R})^{2}}{H_{L} + H_{R} + λ}] - γ

(9)

In the equation, the three fractions represent the score of the left, the score of the right, and the score when there is no split.

G_{L}

and

H_{L}

denote the sums of

g_{i}

and

h_{i}

for the left subtree after the split, respectively, while

G_{R}

and

H_{R}

denote the sums of

g_{i}

and

h_{i}

for the right subtree after the split, respectively. Thus, the objective function can be optimised by transforming it into the process of finding the minimum value of a quadratic function.

2.2. Bayesian Optimisation Algorithm

XGBoost has several hyperparameters, and tuning them can be complex because their selection significantly impacts the model’s performance. Therefore, careful adjustment of these hyperparameters is crucial. In previous studies, grid search (GS) has been applied to hyperparameter tuning for models with fewer hyperparameters. However, this approach is not feasible for our XGBoost model, which contains many hyperparameters [18].

Bayesian optimisation is an effective method for the global optimisation of objective functions, but the evaluation cost of the objective function is high [19]. The Bayesian optimisation algorithm (BO) is primarily used for hyperparameter tuning in machine learning and deep learning models. It is also widely applied in advanced fields such as meta-learning and neural architecture search (NAS). As a highly efficient global optimisation algorithm, its goal is to find the global optimum of the objective function. In hyperparameter tuning, Bayesian optimisation utilises Bayes’ theorem, as shown in Equation (10).

p (f∣ D_{1 : t}) = \frac{p (D_{1, t}∣ f) p (f)}{p (D)}

(10)

In the equation,

f

denotes the defined objective function;

D_{1 : t} = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{t}, y_{t})}

represents the observed dataset, where

x_{t}

is the decision vector;

y_{t} = f (x_{t}) + ε_{t}

denotes the observed value, with

ε_{t}

representing the observation error;

p (D_{1 : t} ∣ f)

signifies the likelihood distribution of

f

;

p

(

f

) denotes the prior probability distribution, an assumption about the black-box objective function;

p

(

D_{1 : t}

) represents the marginal likelihood distribution of

f

;

p

(D) here acts as a coefficient; and

p

(

f

D_{1 : t}

) denotes the posterior probability distribution of

f

[20]. Bayesian optimisation leverages previous optimisation results to select the best observation points to approximate the minimum value of the objective function.

The Bayesian optimisation algorithm consists of two essential components: the probabilistic surrogate model and the acquisition function [20]. The probabilistic surrogate model updates the prior continuously based on a finite set of observation points and uses Bayes’ theorem to estimate the posterior probability distribution

p (f∣ D_{1 : t})

, which incorporates more data information and approximates the distribution of the target black-box function (observation points are specific sets of hyperparameters chosen to initiate the optimisation process, providing a starting dataset for the algorithm to learn from).

The Bayesian optimisation algorithm mainly consists of the following four steps [20]:

Initialise the model by randomly selecting several sets of $x_{t}$ as observation points.
Use a probabilistic surrogate model to estimate the objective function.
Use the acquisition function to determine the next observation point $x_{t}^{*}$ and substitute it into $y_{t} = f (x_{t}) + ε_{t}$ to obtain the observation value $y_{t}^{*}$ .
Add the obtained $(x_{t}^{*}, y_{t}^{*})$ to the historical dataset $D_{1 : t}$ and update the probabilistic surrogate model.

The Bayesian optimisation algorithm iterates through steps 2 to 4 to obtain the optimal value of the objective function.

3. Method Application

3.1. Data Preprocessing

Due to the diverse well conditions encountered in actual production, experimental data were collected under various conditions: well inclination angles of 0°, 60°, 85°, and 90°; water cut rates of 20%, 40%, 60%, 80%, and 90%; and flow rates of 100 m³/d, 300 m³/d, and 600 m³/d. The significant differences among these data points could result in certain data having an undue influence on the final results if used directly. Therefore, data standardisation is necessary. A range normalisation method was used to map all data values to the [0, 1] interval. The function used is as follows:

x^{*} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(11)

where

x_{m i n}

is the minimum value in the dataset, and

x_{m a x}

is the maximum value in the dataset.

3.2. Bayesian Optimisation XGBoost

After standardisation, the dataset was divided into training and testing sets. The training set was utilised for model training and parameter optimisation, while the testing set was employed to evaluate the model’s performance on unseen data. Bayesian optimisation was adopted to fine-tune the hyperparameters of the XGBoost model. The algorithmic principles of the Bayesian Optimisation XGBoost model (BO-XGBoost) are depicted in Figure 1, and the specific steps are as follows:

Define the objective function: The objective function was established as the mean accuracy of 10-fold cross-validation, with the maximum number of iterations for the Bayesian optimisation algorithm set to 200.
Initial observation point selection: Within the predefined search ranges of the XGBoost model’s hyperparameters (such as n_estimators, learning_rate, gamma, max_depth, and subsample), several sets of hyperparameters were randomly selected as initial observation points. These points were used to train the model and obtain the initial distribution of the objective function and the initial observation set D.
Gaussian process estimation: Based on the observation set D, a Gaussian process was employed as the probabilistic surrogate model to estimate the objective function (a Gaussian process is a statistical method used to predict unknown values by assuming that the function values follow a Gaussian distribution, allowing for a probabilistic approach to modelling and estimation).
Acquisition function calculation: The acquisition function was utilised to calculate the next observation point $x_{t}^{*}$ and to compute its corresponding observation value, which represents the model’s predicted accuracy $y_{t}^{*}$ .
Update the observation set: The new observation point ${x_{t}^{*}, y_{t}^{*}}$ was added to the historical observation set D, and the Gaussian process surrogate model was updated.
Iteration judgment: It was determined whether the maximum number of iterations had been reached. If not, the steps from 3 onwards were repeated; if the maximum number of iterations had been reached, the optimal hyperparameter combination and the corresponding optimal value of the objective function were output, and the model’s performance was evaluated using the testing set.

Through a maximum of 200 iterations, the optimal hyperparameters for the BO-XGBoost model were obtained. This BO-XGBoost model was then applied to the prediction of fluid flow patterns in wells. During the prediction process, the same feature set as the training data was used, and the standardised data were input into the BO-XGBoost model for prediction. The prediction results were used to guide the determination and control of fluid flow patterns in wells during actual production.

3.3. XGBoost

XGBoost is an efficient gradient-boosting framework that optimises a model by iteratively training decision trees. In the initial phase of the training process, XGBoost sets the prediction values of all training instances to a constant, providing a baseline prediction for the model. The algorithm then enters the iterative phase, where each iteration aims to correct the deficiencies of the current model by constructing new decision trees.

In each iteration, XGBoost calculates the gradient of the loss function concerning the current model’s predictions. This gradient information gives the model the adjustment direction, guiding it to modify the predictions to reduce the loss. The construction of the new decision tree leverages the gradient information, selecting a subset of features for splitting to minimise the loss function. Each node’s split is based on the gradient information to achieve optimal improvement in model fitting.

To control the model complexity and prevent overfitting, the contribution of each tree is scaled by the learning rate. Additionally, XGBoost incorporates a regularisation term in the objective function, which helps penalise the model complexity and further prevents overfitting.

XGBoost offers various parameters to control the structure of the trees, such as maximum tree depth, minimum child weight, and the gamma parameter. These parameters enable effective pruning, ensuring the model is manageable and fits well with the training data.

XGBoost makes predictions on the training set through an ensemble of trees, with each tree’s prediction weighted by the learning rate and summed to produce the final prediction. The algorithm uses the same method for the test set, providing the final prediction through the weighted summation of the ensemble of trees’ predictions. Figure 2 shows the XGBoost training flow chart.

4. Experiment

4.1. Design Experiment

A multiphase flow (oil–water) simulator was utilised in the multiphase flow laboratory to conduct the experimental work. The experiments were conducted under ambient temperature (20 °C) and atmospheric pressure. Industrial white oil and tap water were utilised instead of actual downhole oil and water. Table 1 details the density, viscosity, and surface tension of the oil and water used. In experiments, the well inclination was 90° when horizontal. During the experiments, raw data and photographs were recorded. Figure 3 shows the oil–water two-phase flow patterns and a schematic of high-speed camera recordings, including smooth stratified flow, interface mixed stratified flow, water-in-oil emulsion, dispersed water-in-oil and oil-in-water, and dispersed oil-in-water and water, each representing different flow states.

The schematic diagram of the experimental setup is shown in Figure 4.

In this experiment, a total of 64 sets of valid and accurate data were obtained using the simulation experimental apparatus. These data were categorised into five distinct flow patterns: bubble flow, emulsion flow, frothy flow, wavy flow, and stratified flow. For ease of subsequent experimental processes, these patterns were assigned specific codes ranging from 0 to 4, as detailed in Table 2, which also shows the actual images corresponding to each flow type.

The experimental data, along with actual field data, were used. The experimental data were used as the dataset for the XGBoost and BO-XGBoost algorithms to learn, and the actual data were then input into the trained algorithms for prediction. The predicted results were compared with the actual fluid flow patterns to test the feasibility of the algorithms. The experimental data included variables such as well inclination angles, fluid flow rates, water cut, temperature, and pipe diameter. The actual data were fed into the model trained with the experimental data to obtain the predicted fluid flow patterns for the actual data, which were then compared with the actual results. By analysing accuracy under different flow rates, inclinations, and water cut rates, the effectiveness of the algorithms was assessed.

The hyperparameters of the XGBoost model, optimised by the Bayesian optimisation algorithm, are shown in Table 3.

4.2. Prediction Results Analysis

After training, 16 sets of known data with varying well inclinations, water cuts, and flow rates were randomly selected for testing. Following data preprocessing, the trained models predicted flow patterns, and these predictions were compared with actual data to evaluate accuracy. The performance of the models was illustrated using confusion matrices and scatter plots.

Figure 5a shows the unstandardised confusion matrix of the XGBoost algorithm’s predictions on the training set, while Figure 5b presents the standardised confusion matrix on the test set. In the confusion matrix, rows represent observed flow pattern categories, and columns represent predicted categories. The numbers on the axes correspond to the five flow patterns listed in Table 2. Correct predictions are indicated by blue squares on the diagonal. Off-diagonal squares represent incorrect predictions.

In contrast, squares off the diagonal represent the number of incorrectly predicted samples. In Figure 5b, the numbers on the diagonal represent the probability of correctly predicting the corresponding flow pattern. In contrast, the numbers off the diagonal represent the probability of predicting an incorrect flow pattern.

Figure 5, Figure 6 and Figure 7 depict the confusion matrices and scatter plots for the XGBoost model’s predictions on the training and test sets, respectively.

From Figure 5, it can be observed that the XGBoost algorithm model had five incorrect predictions in the training set.

From Figure 5 and Figure 7, it can be observed that the XGBoost model made five erroneous predictions in the training set results. Figure 6 and Figure 7 illustrate the test set results, where two bubbly flows were predicted as dispersed flows, one frothy flow as a bubbly flow, and one dispersed flow as a frothy flow. The overall accuracy reached 75%. The XGBoost algorithm demonstrated some level of accuracy in flow pattern prediction, but there is significant room for improvement.

Figure 8, Figure 9 and Figure 10 show the confusion matrices and scatter plots for the BO-XGBoost model’s predictions on the training and test sets.

Figure 8 and Figure 9 show the BO-XGBoost model’s confusion matrices on the test and training sets, respectively. Figure 10 shows the scatter plots. From Figure 8 and Figure 10, it is evident that the BO-XGBoost model achieved 100% accuracy on the training set, demonstrating significantly better performance than the XGBoost model.

Figure 9 and Figure 10 show only one misprediction in the test set, with the BO-XGBoost model achieving 93.75% accuracy, where one frothy flow was mispredicted as a bubbly flow. The results highlight the BO-XGBoost algorithm’s marked improvement in learning and predicting flow patterns.

Table 4 compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset.

Table 4 compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset. The BO-XGBoost model demonstrated a significant improvement, with 93.75% accuracy compared to the XGBoost model’s 75%. Precision increased from 0.788 to 0.967, recall from 0.791 to 0.971, and the F1 score from 0.784 to 0.966, further validating the BO-XGBoost model’s superiority. These results indicate that Bayesian optimisation significantly enhanced the XGBoost model’s predictive accuracy and classification performance.

To comprehensively assess the classification performance of the models, we also utilised receiver operating characteristic (ROC) curves. In multi-class classification problems, ROC curves and the area under the curve (AUC) values provide an overall view of the model’s classification capability. The ROC curves are shown in Figure 11 and Figure 12.

Figure 11 and Figure 12 compare the ROC curves of the traditional XGBoost model and the BO-XGBoost model across different classes of oil–water two-phase flow patterns.

Figure 11 displays the ROC curve for the XGBoost model, with AUC values as follows: Class 0 (0.964), Class 1 (0.857), Class 2 (0.873), Class 3 (1.000), and Class 4 (1.000). Figure 12 shows the ROC curve for the BO-XGBoost model, with AUC values as follows: Class 0 (0.982), Class 1 (0.929), Class 2 (0.921), Class 3 (1.000), and Class 4 (1.000). A higher AUC value indicates better classification accuracy.

The BO-XGBoost model exhibited higher AUC values for Class 0, Class 1, and Class 2 compared to the traditional XGBoost model. Both models achieved perfect AUC values of 1.000 for Class 3 and Class 4, likely due to small sample sizes.

In summary, the comparative analysis of the ROC curves and their respective AUC values highlights the superior predictive capability of the BO-XGBoost model following Bayesian optimisation. The BO-XGBoost model consistently achieved higher AUC values across most classes, indicating a more robust and accurate classification of oil–water two-phase flow patterns.

Table 5 shows that XGBoost accuracy decreased notably for inclinations of 0°, 60°, and 85°. At 85°, with a flow rate of 300 m³/d and a water cut of 80%, both XGBoost and BO-XGBoost failed to predict accurately. However, at 90°, both models demonstrated accurate predictions. XGBoost achieved 75% overall accuracy, while BO-XGBoost achieved 93.75%, demonstrating the feasibility and precision of both algorithms. Figure 13 shows the prediction accuracy of each flow type for the two models.

BO-XGBoost’s superior accuracy is due to its advanced hyperparameter optimisation strategy. Unlike traditional XGBoost models, BO-XGBoost uses Bayesian optimisation to find optimal hyperparameters, adapting better to data characteristics and improving accuracy. Despite slightly lower accuracy, XGBoost is advantageous in training speed and ease of implementation, recognised for its stable and efficient performance.

Considering all factors, BO-XGBoost has demonstrated higher prediction accuracy in this study, providing a robust choice for applications requiring high-precision predictions. However, we acknowledge that the choice of model and tuning strategy should be based on specific application needs and resource constraints.

Future research will focus on exploring and studying the performance of BO-XGBoost and XGBoost across various datasets and problem environments, with the goal of providing deeper insights into the selection of machine learning models.

4.3. Model Interpretability and Feature Analysis

While the BO-XGBoost model achieves high prediction accuracy through training, it remains largely a black-box model in terms of interpretability. To address this issue, we employed Shapley additive explanations [21] (SHAP) to interpret the experimental results of the model, analysing the contribution of each feature to the prediction outcomes. Figure 14 illustrates the key features influencing the oil–water two-phase flow patterns.

From Figure 14, it is evident that the most significant feature affecting the flow pattern was the well inclination angle, followed by the daily production flow rates, with the water cut having the least impact.

In addition to the feature importance plot, we generated detailed feature explanation plots to obtain richer information. Figure 15 presents the global interpretation of features such as well inclination angles (Angle), flow rates (Flow), and water cut (Con). This comprehensive visualisation explains the contribution of these features to the prediction target, integrating feature values and multi-feature presentations.

In these plots, the vertical axis represents the feature names, and the horizontal axis represents the SHAP values. Each point corresponds to the SHAP value of a feature for a specific sample. Positive SHAP values indicate a positive impact on the prediction, while negative SHAP values indicate a negative impact. The colour of the points represents the feature values, with red points indicating higher values and blue points indicating lower values.

From Figure 15, it can be observed that the well inclination angle had the most substantial impact on the model output, with SHAP values ranging from −1 to 1.5. This indicates that the well inclination angle played a decisive role in predicting the oil–water two-phase flow patterns. The SHAP values for flow rates and water cut varied within narrower ranges but still significantly influenced the model output. The SHAP values for flow rates ranged from 0.0 to 1.0, suggesting a positive impact on the prediction outcome. In contrast, the SHAP values for water cut rate ranged from −0.5 to 0.5, indicating that in some cases, water cut rate may have a negative impact on the prediction results.

4.4. Limitations of the Study

While our research demonstrates significant improvements in the predictive accuracy of oil–water two-phase flow regimes using the Bayesian-optimised XGBoost algorithm, it is important to acknowledge certain limitations to provide a comprehensive understanding of our study.

Firstly, the experimental data used in this study were obtained under controlled laboratory conditions, which may only partially replicate the complexities and variabilities of real-world reservoir environments. Factors such as temperature variations, reservoir heterogeneities, and the presence of impurities in the fluids were not accounted for in our simulations, potentially affecting the generalisability of our findings.

Secondly, the dataset was limited by the range of water cut rate, well inclination angles, and flow rates explored, as well as the sample size. Although we endeavoured to cover a broad spectrum of conditions, certain flow patterns occurring under extreme or less common operational scenarios might not have been adequately represented. These limitations indicate a need for further studies encompassing a wider range of parameters to enhance the robustness of the predictive model.

Lastly, our study primarily focused on the application of the BO-XGBoost algorithm. Comparative studies involving other advanced machine learning algorithms and optimisation techniques could provide further insights into the relative advantages and potential limitations of our approach. Additionally, incorporating real-time data from field operations and conducting validation studies in actual reservoir conditions would be critical steps toward translating our laboratory-based findings into practical, field-applicable solutions.

5. Conclusions

This study presents the BO-XGBoost model for predicting oil–water two-phase flow patterns. The BO-XGBoost model achieved significant improvements over the traditional XGBoost model, with an accuracy of 93.8% compared to 75%. Precision, recall, and F1-score metrics also demonstrated substantial enhancements, highlighting the effectiveness of Bayesian optimisation in refining model hyperparameters. Our experimental setup, utilising data from a multiphase flow simulation apparatus, confirmed the superior performance of BO-XGBoost. Key features, such as well inclination angles, water cut rate, and flow rates, were effectively captured, leading to accurate predictions. SHAP analysis further emphasised the importance of these features in the model’s predictions. Compared to existing literature, our approach offers greater accuracy and robustness in predicting flow patterns. Future research will focus on multi-step predictions and exploring additional machine learning techniques, such as ensemble learning and reinforcement learning, to further improve model performance. In conclusion, the BO-XGBoost model provides a robust methodology for investigating complex subsurface fluid dynamics with significant implications for petroleum engineering. Continuous refinement and optimisation of the model hold promise further advancements in this field.

Author Contributions

Conceptualization, A.L.; Methodology, H.L. and A.L.; Software, D.W.; Validation, Y.G.; Resources, Y.S.; Data curation, Y.S.; Writing—original draft, D.W.; Writing—review & editing, H.G.; Supervision, H.G.; Project administration, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from Yangtze University but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Yangtze University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wu, Y.; Guo, H.; Song, H.; Deng, R. Fuzzy inference system application for oil-water flow patterns identification. Energy 2022, 239, 122359. [Google Scholar] [CrossRef]
Ohnuki, A.; Akimoto, H. Experimental study on transition of flow pattern and phase distribution in upward air-water two-phase flow along a large vertical pipe. Int. J. Multiph. Flow 2000, 26, 367–386. [Google Scholar] [CrossRef]
Xu, X.X. Study on oil-water two-phase flow in horizontal pipelines. J. Pet. Sci. Eng. 2007, 59, 43–58. [Google Scholar] [CrossRef]
Bannwart, A.C.; Rodriguez, O.M.; Trevisan, F.E.; Vieira, F.F.; De Carvalho, C.H. Experimental investigation on liquid-liquid-gas flow: Flow patterns and pressure-gradient. J. Pet. Sci. Eng. 2009, 65, 1–13. [Google Scholar] [CrossRef]
Sun, Y.; Guo, H.; Liang, H.; Li, A.; Zhang, Y.; Zhang, D. A Comparative Study of Oil-Water Two-Phase Flow Pattern Prediction Based on the GA-BP Neural Network and Random Forest Algorithm. Processes 2023, 11, 3155. [Google Scholar] [CrossRef]
Wang, Y.; Cai, Z.; Yu, L. Prediction Model for Goodwill Impairment Based on Machine Learning. Account. Res. 2024, 3, 51–64. [Google Scholar]
Zhang, Y.; Liu, R.; Chen, H. Financial Crisis Prediction Model Based on Particle Swarm Optimization and Kernel Extreme Learning Machine. Stat. Decis. 2019, 35, 67–71. [Google Scholar] [CrossRef]
Zhang, X. Enterprise Financial Distress Prediction Method Based on Subspace Multi-Kernel Learning. Oper. Manag. 2021, 30, 184–191. [Google Scholar]
Sukpancharoen, S.; Katongtung, T.; Rattanachoung, N.; Tippayawong, N. Unlocking the potential of transesterification catalysts for biodiesel production through machine learning approach. Bioresour. Technol. 2023, 378, 128961. [Google Scholar] [CrossRef] [PubMed]
Şahin, S. Comparison of machine learning algorithms for predicting diesel/biodiesel/iso-pentanol blend engine performance and emissions. Heliyon 2023, 9, e21365. [Google Scholar] [CrossRef] [PubMed]
Tang, Q.; Wang, T. Productivity Prediction of Fractured Horizontal Wells Based on XGBoost. China Petrochem. Stand. Qual. 2023, 43, 15–17. [Google Scholar]
Zhao, R.; Yang, L.; Xu, X.; Ma, W.; Li, J. Lithology Identification Method and Research of Volcanic Rocks Based on XGBoost Algorithm. Adv. Geophys. 2024, 1–12. Available online: http://kns.cnki.net/kcms/detail/11.2982.P.20240611.1227.017.html (accessed on 20 June 2024).
Wu, J.; Chen, S.; Chen, X.; Zhou, R. Model Selection and Hyperparameter Optimization Based on Reinforcement Learning. J. Univ. Electron. Sci. Technol. China 2020, 49, 255–261. [Google Scholar]
Chai, D.; Xu, S.; Luo, C.; Lu, Y. Object Accurate Localization of Remote Sensing Image Based on Bayesian Optimization. Remote Sens. Technol. Appl. 2020, 35, 1377–1385. [Google Scholar]
Guo, L.; Wang, Y. Research on Prediction of Stored Grain Temperature Based on XGBoost Optimization Algorithm. Cereals Oils 2022, 35, 78–82. [Google Scholar]
Zhou, X.; Wang, R.; Dai, Y.; Zhang, J.; Sun, Y. Classified Early Warning of Coal Spontaneous Combustion Based on BO-XGBoost. Coal Eng. 2022, 54, 108–114. [Google Scholar]
Chen, T.Q.; Guestrin, C. XGBoost: A scalable tree boosting system. arXiv 2016, arXiv:1603.02754. Available online: http://arxiv.org/abs/1603.02754.pdf (accessed on 20 June 2024).
Mockus, J. Application of Bayesian approach to numerical methods of global and stochastic optimization. J. Global Optim. 1994, 4, 347–365. [Google Scholar] [CrossRef]
Pelikan, M. Bayesian Optimization Algorithm: From Single Level to Hierarchy. Ph.D. Thesis, University of Illinois at Urbana-Champaign, Urbana, IL, USA, 2002. [Google Scholar]
Cui, J.; Yang, B. Survey on Bayesian optimization methodology and applications. J. Softw. 2018, 29, 3068–3090. (In Chinese) [Google Scholar]
Luo, Y.; Wang, C.; Ye, W. Interpretable prediction model of acute kidney injury based on XGBoost and SHAP. J. Electron. Inf. Technol. 2022, 44, 27–38. [Google Scholar]

Figure 1. Bayesian optimisation algorithm optimisation XGBoost model process.

Figure 2. XGBoost training flow chart.

Figure 3. Schematic of oil–water flow patterns (left) and the photographed diagram (right).

Figure 4. Schematic of the experimental setup, including: 1. simulation well; 2. well inclination regulator; 3. oil–water mixer; 4, 5. position control valves; 6, 7. flow meters; 8. water pump; 9. oil pump; 10. water tank; 11. oil tank; 12. oil–water separation tank.

Figure 5. Confusion matrix of prediction results of the XGBoost algorithm training set. (a) Non-normalized data; (b) Normalized data.

Figure 6. Confusion matrix of prediction results of the XGBoost algorithm test set. (a) Non-normalized data; (b) Normalized data.

Figure 7. Scatter plot of the XGBoost training set and test set flow prediction results.

Figure 8. Confusion matrix of the prediction results of the BO-XGBoost algorithm training set. (a) Non-normalized data; (b) Normalized data.

Figure 9. Confusion matrix of the prediction results of the BO-XGBoost algorithm test set. (a) Non-normalized data; (b) Normalized data.

Figure 10. Scatter plot of the BO-XGBoost training set and test set flow prediction results.

Figure 11. XGBoost ROC curve.

Figure 12. BO-XGBoost ROC curve.

Figure 13. Flow pattern prediction accuracy statistics.

Figure 14. Feature importance image.

Figure 15. Feature global explanation image.

Table 1. Parameters oil and water.

	Density (g/cm³)	Viscosity (mPa·s)	Surface Tension (mN/m)
Oil	0.826	2.92	30.00
Water	0.988	1.16	72.00

Table 2. Flow patterns and encoding.

Flow Pattern	Schematic Diagram	Coding
Bubbly flow		0
Emulsion flow		1
Frothy flow		2
Wavy flow		3
Stratified flow		4

Table 3. BO-XGBoost hyperparameters.

Parameter	Search Scope	Optimal Parameters	Parameter Meanings
colsample_bytree	[0.5, 1.0]	0.71	Feature random sampling ratio
learning_rate	[0.01, 0.3]	0.23	Learning rate
max_depth	[3, 15]	12	Maximum tree depth
n_estimators	[100, 500]	200	Number of decision trees
subsample	[0.5, 1.0]	0.79	Sample sampling ratio
gamma	[0, 5]	1.0	Node split reduction factor
alpha	[0, 10]	3.56	Regularisation coefficient
min_child_weight	[0, 10]	0.3	Minimum weight of leaf nodes

Table 4. Model evaluation indicators.

Algorithm Model	Accuracy	Precision	Recall	F1 Score
XGBoost	0.750	0.788	0.791	0.784
BO-XGBoost	0.938	0.967	0.971	0.966

Table 5. Algorithm prediction results.

Inclination (°)	Flow Rates (m³/d)	Water Cut (%)	Actual Flow Pattern	XGBoost Prediction	BO-XGBoost Prediction	Accuracy (%)
0	100	20	bubble flow	emulsion flow	bubble flow	93.75%
0	300	40	bubble flow	bubble flow	bubble flow
0	300	60	bubble flow	bubble flow	bubble flow
0	600	80	emulsion flow	frothy flow	emulsion flow
60	100	20	bubble flow	emulsion flow	bubble flow
60	300	40	emulsion flow	emulsion flow	emulsion flow
60	600	60	frothy flow	frothy flow	frothy flow
60	600	90	frothy flow	frothy flow	frothy flow
85	100	20	wavy flow	wavy flow	wavy flow
85	300	40	bubble flow	bubble flow	bubble flow
85	300	80	frothy flow	bubble flow	bubble flow
85	600	90	foam flow	frothy flow	frothy flow
90	100	20	stratified flow	stratified flow	stratified flow
90	300	40	frothy flow	frothy flow	frothy flow
90	600	60	frothy flow	frothy flow	frothy flow
90	600	90	frothy flow	frothy flow	frothy flow

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, D.; Guo, H.; Sun, Y.; Liang, H.; Li, A.; Guo, Y. Prediction of Oil–Water Two-Phase Flow Patterns Based on Bayesian Optimisation of the XGBoost Algorithm. Processes 2024, 12, 1660. https://doi.org/10.3390/pr12081660

AMA Style

Wang D, Guo H, Sun Y, Liang H, Li A, Guo Y. Prediction of Oil–Water Two-Phase Flow Patterns Based on Bayesian Optimisation of the XGBoost Algorithm. Processes. 2024; 12(8):1660. https://doi.org/10.3390/pr12081660

Chicago/Turabian Style

Wang, Dudu, Haimin Guo, Yongtuo Sun, Haoxun Liang, Ao Li, and Yuqing Guo. 2024. "Prediction of Oil–Water Two-Phase Flow Patterns Based on Bayesian Optimisation of the XGBoost Algorithm" Processes 12, no. 8: 1660. https://doi.org/10.3390/pr12081660

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Oil–Water Two-Phase Flow Patterns Based on Bayesian Optimisation of the XGBoost Algorithm

Abstract

1. Introduction

2. Algorithm Principle

2.1. XGBoost Algorithm

2.2. Bayesian Optimisation Algorithm

3. Method Application

3.1. Data Preprocessing

3.2. Bayesian Optimisation XGBoost

3.3. XGBoost

4. Experiment

4.1. Design Experiment

4.2. Prediction Results Analysis

4.3. Model Interpretability and Feature Analysis

4.4. Limitations of the Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI