Fracturing Productivity Prediction Model and Optimization of the Operation Parameters of Shale Gas Well Based on Machine Learning

Based on the massive static and dynamic data of 137 fractured wells in WY shale gas block in Sichuan, China, this paper carried out the analysis of shale gas fracturing production influencing factors, production prediction model, and fracturing parameter optimization model research. Taking geological, engineering, fracturing operation, and production data of fractured wells in WY block as data set, the main control analysis method is used to construct the shale gas fracturing production influencing factors as the sample set. A production prediction model based on six machine learning (ML) algorithms including random forest (RF), back propagation (BP) neural network, support vector regression (SVR), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and multivariable linear regression (LR) has been established; the evaluation results show that the XGBoost model has the best performance on this sample set. The selection method of shale gas well fracturing operation scheme set is studied; the production rate and the ratio of cost and profit (ROCP) are comprehensively considered to select the final fracturing operation scheme. Research result shows that the data-driven production prediction model and fracturing parameter optimization model can not only be used to predict the production of shale gas fracturing and optimize operation parameters but also realize the sensitivity analysis of fracturing parameters and the effect comparison of fracturing operation schemes, which has good field application value.


Introduction
The porosity and permeability of shale gas reservoirs are extremely low, and it is generally difficult for a single well to obtain natural productivity. Through the development of horizontal wells and hydraulic fracturing technology, a complex fracture network of natural fractures and fracturing fractures is formed, and the gas seepage capacity in the reservoir is improved [1]. After fracturing, the fracture network relationship is complex, the fracture dynamic changes, the pressure channeling phenomenon between wells in multiwell platform is significant, the gas water fracturing fluid multiphase flow law is complex, and there are many factors affecting the single well production. Even though the differ-ence of reservoir geological characteristics is small, the different operation parameters often lead to the big difference of production between wells. Compared with conventional reservoirs, the flow laws in the matrix of shale gas reservoirs no longer follow Darcy's flow, and the transport of gas through extremely tight shale-gas formations at elevated pressure conditions involves a number of transport mechanisms that Darcy's laws cannot describe [2,3]. The factors affecting production include reservoir parameters, rock mechanics parameters, completion parameters, fracturing parameters, and production system after fracturing. There are many influencing factors, and the relationship with production is not clear, and there are also complex nonlinear relationships among various influencing factors. The stimulation mechanism of shale gas wells is complicated, and the fluid production is affected by the spontaneous imbibition effect of capillary, the energy utilization rate of fracturing fluid, the main fracture and induced fracture closure law, etc., which makes it difficult to predict the production of shale gas wells, leading to subsequent development scheme design and adjustment lack of scientific basis. Therefore, how to efficiently and accurately evaluate and predict the production of shale gas wells is very important to improve the development effect of shale gas resources.
Scholars at home and abroad have used empirical formulas, analytical methods, and numerical methods to carry out many studies on shale gas production forecasting. Among them, empirical formulas and analytical models are difficult to consider the complex seepage characteristics of shale gas reservoirs, and the applicable conditions and applicable stages of different models are different, resulting in a large difference between the predicted and the actual results [4,5]. Due to the extremely complex fracture system and many "unknowns" in shale gas production, the numerical model of fracture network of shale gas well group has a large amount of calculation, difficult historical fitting, low production prediction efficiency, and high result uncertainty. Numerical simulation may not be the most effective method to study shale reservoir [6]. Therefore, new methods are needed to carry out shale gas well production prediction research. The data-driven method can better solve the problem of multifactor and nonlinear forecasting optimization, which has attracted the attention and research of many scholars.
Gong et al. [7] and Yu et al. [8] studied the uncertainty in shale gas production forecasting. Ma et al. [9] proposed a method for nondeterministic prediction of shale gas production capacity based on ML. Sun et al. [10] proposed a shale gas production prediction method based on the LSTM algorithm. Compared with the traditional decline curve analysis (DCA) method, the effectiveness of this method is verified, showing higher accuracy and less calculation error. It also considers complex operation scenarios that can reflect more unconventional reservoir characteristics that cannot be achieved by the DCA method. Han et al. [11] established an artificial neural network model based on the decline curve analysis method, which can predict the future production of shale wells under transient flow conditions. Lee et al. [12] used the long short-term memory (LSTM) network to predict shale gas production, which has the characteristics of fast speed and high accuracy. Liang and Zhao [13] used the RF algorithm, combined with rock physics, completion variables, and spatial information to predict the recovery factor of shale gas. Tan et al. [14] combined the principal component analysis (PCA) and binary neural network (BNN) algorithms to establish a prediction model for the effective period of shale gas production and used the model to optimize the fracturing parameters such as the proppant intensity and operation sections. Al-Alwani et al. [15] based on Marcellus shale completion and stimulation parameters, using the partial least squares (PLS) machine learning method to establish shale gas production prediction model, which can not only predict short-term and long-term gas production but also analyze the importance of relevant parameters, so as to optimize the future productivity of the oilfield. Wang et al. [16] established depth belief network (DBN) models to predict the production performance of unconventional wells effectively and accurately and justified the effectiveness of the trained model in the application of fracturing design optimization. Because the characteristics of shale gas reservoirs in China are quite different from those of other countries, and the main controlling factors of their production are different, the adaptability of the model established in the above research needs to be further verified.
This paper analyzes the collected actual data of 137 fractured wells in the WY block in Sichuan, China, applies big data intelligent algorithms to establish a new method for fracturing production prediction of shale gas wells, and establishes six shale gas production prediction models including BP neural network, RF, SVR, XGBoost, LightGBM, and multivariable LR, compares the prediction performance of these models, and selects the model with the highest prediction accuracy to further predict the fracturing production of shale gas wells and optimize fracturing parameters.

Models and Methods
The main process steps of data-driven fracturing productivity prediction and operation parameter optimization of shale gas wells are shown in Figure 1.
Step 1. Build data sample set. Data preprocessing, analysis of main control factors, parameters with strong linear correlation, and small influence on fracturing productivity are eliminated, and a sample set of prediction models is constructed.
Step 2. Construct the data set into training set, validation set, and test set, which are, respectively, used to train the model, adjust hyperparameters, and evaluate model prediction effect.
Step 3. Train the machine learning model on the training set, and search for the optimal hyperparameters of the model on the validation set to obtain the selected fracturing production prediction model.
Step 4. Analyze the geological parameters and the range of operation parameters of historical fracturing wells, and establish the constraint conditions for selecting the fracturing operation scheme set of the well.
Step 5. Based on the constructed fracturing operation scheme set, the selected prediction models are used for sensitivity analysis, which can optimize and recommend shale gas operation fracturing parameters.
2.1. Analysis of Production Influencing Factors. This paper makes statistics on the geological, fracturing, and production data of 137 wells in WY block, Sichuan, as shown in Table 1. Pearson correlation coefficient is a linear correlation coefficient, recorded as R, which is used to measure the linear correlation between two random variables X and Y. The value 2 Lithosphere of R is between -1 and 1. The greater its absolute value is, the stronger the correlation is. The Pearson correlation coefficient method is used to analyze the relationship between 11 influencing factors and production of 137 shale gas wells in WY block. The analysis results are shown in Figure 2. It can be seen from Figure 2 that among the 11 production influencing factors, the fracturing section length and the operation sections are strongly related to the average daily gas rate in the first year. The average gamma while drilling, TOC content, fracturing fluid volume per well, and proppant volume per well are moderately related to the average daily gas rate in the first year. The air content and brittleness index are weakly correlated with the average daily gas rate in the first year. The proppant to liquid ratio, maximum pumping flow rate, and proppant intensity are weakly correlated with the average daily gas rate in the first year.
The fracturing section length depends on the length of horizontal section of shale gas well to a certain extent. Therefore, when the geological conditions are determined and the drilling has been completed, the fracturing effect can be improved by optimizing the number of operation sections, proppant volume per well, and other parameters.
Among the 11 influencing factors, proppant to liquid ratio, maximum pumping flow rate, and proppant intensity have little influence on production. When establishing the production prediction model, these three parameters are eliminated to simplify the model.

Optimization Model of Fracturing Operation Parameters.
There are three core issues in applying the production prediction model to construct the fracturing parameter optimization model: The first is to construct a reasonable fracturing operation scheme set, which can improve the calculation speed of the model and the matching degree of the model and data. The second is to select the production prediction model with the best prediction performance. A highprecision production prediction model is the prerequisite for the optimization of fracturing parameters, which can improve the optimization effect of the fracturing program. Finally, it is necessary to establish a reasonable fracturing parameter optimization evaluation method. It is not reasonable to only use the maximum production as the fracturing parameter optimization criterion. It is also necessary to incorporate the fracturing ROCP into the parameter optimization process. Therefore, In Equation (1), P i is the influencing factor of production, P 1 is average gamma while drilling, P 2 is TOC content, P 3 is air content, and P 4 is brittleness index; the above four parameters are static geological parameters of reservoir, and R is a constant. P 5 is fracturing section length, P 6 is operation sections, P 7 is fracturing fluid volume per well, and P 8 is proppant volume per well. Qg is the objective function; ROCP, P5, P7, and P8 are constraint variables, where R 0 is the maximum ROCP required by the oilfield site, M i,i=1,3,5 is the minimum value of the existing corresponding construction parameters on the oilfield site, and M i,i=2,4,6 is the maximum value of the existing corresponding construction parameters on the oilfield site. The last items of the constraints are auxiliary equations which represent the relationships between some of the decision variables.
The operation process of fracturing parameter optimization model is as follows:

Production Prediction Model and Performance
Evaluation. In order to establish the optimal shale gas fracturing production prediction model, this paper first uses six data mining algorithms, including BP neural network, RF, SVR, XGBoost, LightGBM, and multivariable LR, to establish production prediction models, compares the prediction performance of these models, and selects the model with the highest prediction accuracy to further predict the fracturing production of shale gas wells and optimize fracturing parameters.

Model Comparison.
The BP neural network is a multiple feedforward network trained according to the error BP algorithm. It can store and learn a large amount of input data and output data by simulating the function of human neurons and does not need to describe the mapping relationship of variables. It uses input and output data to model, which has a strong effect on nonlinear systems and simulation capabilities. The basic idea of RF is to first use bootstrap sampling to extract k samples from the original training set, and the sample size of each sample is the same as the original training set, then establish k decision tree models for k samples, respectively, to obtain a variety of prediction results, and the average value is obtained according to the K prediction results as the final prediction result.
SVM is a data mining method based on statistical learning theory. It is a binary classification model. Its purpose is to find a hyperplane to segment the sample. The strategy of this method is to construct a linear classifier with the largest interval w T x + b = ±1 and finally transform it into a convex quadratic programming problem to solve.
The basic principle of XGBoost is to combine thousands of tree models with lower accuracy into one model with higher accuracy [17]. XGBoost's base learner has both a tree (gbtree) and a linear classifier (gblinear), to obtain linear regression or logistic regression with L1 + L2 penalty. Its loss function adopts the second-order Taylor expansion, which has the characteristics of high accuracy, not easy to overfit, scalability, etc., and can process high-dimensional sparse features in a distributed manner.
LightGBM is a gradient boosting framework that uses a learning algorithm based on decision trees [18]. This algorithm introduces two new technologies on the basis of traditional gradient boosting decision tree (GBDT): gradient unilateral sampling technology and exclusive feature bundling (EFB) technology. Gradient unilateral sampling technology can eliminate a large part of the data with very small gradient and only use the remaining data to estimate the information gain, thus avoiding the influence of the long tail part of the low gradient. EFB technology realizes the bundling of mutually exclusive features to reduce the number of features.
Multiple LR is a method of studying the linear relationship between a dependent variable and multiple independent variables, using linearity to fit the relationship between multiple independent variables and dependent variables, to determine the parameters of the multiple linear regression model and return to the null hypothesis equation. The trend of the dependent variable is predicted by the regression equation, and the regression analysis is conditional on the given value of multiple explanatory variables.
Using the above six machine learning algorithms, eight influencing factors are finally determined from 137 fractured well samples, namely, average gamma while drilling, TOC content, air content, brittleness index, fracturing section length, operation sections, fracturing fluid volume per well, and proppant volume per well, which are used as the input of the model, and the average daily gas rate in the first year is used as the output of the model to establish the productivity prediction model. 60% of the constructed data set is divided into training set, 20% into verification set, and 20% into testing set, which are used for model training, super parameter optimization, and model prediction effect evaluation, respectively.
This paper uses four indicators to evaluate the performance of the six fracturing production prediction models. The four evaluation indicators are the magnitude of relative error (MRE), the mean absolute error (MAE), the root mean square error (RMSE), and the coefficient of determination (R 2 ), as shown in Table 2.
The smaller the MAE, the smaller the error The smaller the RMSE, the smaller the error; the larger the RMSE, the larger the error R 2 is between 0 and 1; the larger the value, the better the model fitting

Lithosphere
Evaluate the performance of the ML model based on the above four indicators, and select the best-performing production prediction model. The model performance comparison results are shown in Table 3. It can be seen that the prediction results of the training set and the test set of the XGBoost and LightGBM models are better, the R 2 of the training set and the test set are larger, and the RMSE, MAE, and MRE are all smaller. Among the two models, the XGBoost model training set and test set have a smaller prediction effect, and the model is more stable. Therefore, this paper chooses XGBoost as the final production prediction model.

XGBoost Model Parameter Setting.
Because the data used in this model is small, the k-fold cross-validation method is used to optimize the parameters of the model. The process of k-fold cross-validation is to divide the data set into k parts, take k − 1 part as training data and the remaining part as test data. The corresponding test accuracy rate will be obtained in each test. The average value of K test accuracy rates is taken as the final test accuracy rate. In this optimization, the value of K is taken as 10 according to the characteristics of small sample size. The value range of num_boost_round is set to 10000-3000; the tuning step is set to 100. The value range of learning_rate is set to 0.001-      - 8 Lithosphere 0.1; the tuning step is set to 0.001. The value range of max_ depth is 3-15; the tuning step is set to 1. And the value range of subsample which used to increase the randomness of the model is set to 0.5-1; the tuning step is set to 0.1. In this paper, the grid search is used to adjust the parameters. Some results are shown in Table 4. After the final adjustment, the parameters of the model are num boost round = 20000, learning rate = 0:005, max depth = 5, and subsample = 0:7. Based on the XGBoost algorithm, a shale gas horizontal well fracturing production prediction model is established. The prediction results are shown in Figure 3. It can be seen from Figure 3 that the prediction error of the training sample and the test sample is small, and the predicted production is concentrated near the diagonal, indicating that the production prediction result based on the XGBoost model is reliable.

Results and Discussion
WY shale gas field is a large dome anticline structure with the characteristics of low porosity, low permeability, and heterogeneity. Its development depends on horizontal wells and hydraulic fracturing technology. The production after fracturing is affected by geology, engineering, and fracturing operation parameters. Using data of 137 wells in WY block as samples, the influencing factors are analyzed, and the ML fracturing production prediction and operation parameter optimization model are constructed to realize the fracturing operation scheme recommendation.

Operation of Single Well Fracturing Operation Scheme
Set in WY Block. The statistics of the distribution of fracturing operation parameters of the wells that have been put into production in the WY block are shown in Table 5 According to the statistics of the upper and lower limits of the three parameters in Table 6, including operation sections, proppant volume per section and fracturing fluid volume per section, and setting the step length of each parameter, the number of values of each parameter can be obtained. That is, there are 21 values for operation sections, 47 values for proppant volume per section, and 15 values for fracturing fluid volume per section. And a scheme set containing 21 × 47 × 15 = 14805 schemes is obtained. In the scheme set, all fracturing parameters of single well must be within the range of constructed parameters. Therefore, 4201 fracturing parameter combination schemes are selected.

Analysis of Optimization Results of Fracturing
Parameters for Well A in WY Block. The basic parameters of a target well A in the WY shale gas area are shown in Table 7.
The XGBoost production forecast model is used to calculate the cumulative gas production, fracturing cost, and the ROCP for the first year corresponding to the operation parameters (operation sections, proppant volume per section, fracturing fluid volume per section, proppant volume per well, fracturing fluid volume per well, proppant to liquid ratio, proppant intensity, fracturing length per section, and fluid intensity) in all selected schemes. The calculation results of some schemes are shown in Table 8.
With the parameter range of the block and ROCP ≤ 0:5 million ¥RMB/10 4 m 3 of gas as the restriction conditions, 49 fracturing operation schemes were finally obtained. In the 49 schemes, operation sections are 18, the total proppant volume is between 1278 and 1422 m 3 (proppant volume per section is between 71 and 79 m 3 ), and fracturing fluid volume per section is between 28800 and 35100 m 3 (fracturing fluid volume per section is between 1600 and 1950 m 3 ), and the cumulative gas production in the first year is estimated to be between 3636.9 and 3772:0 × 10 4 m 3 (equivalent to the daily gas production of 9.96-10:33 × 10 4 m 3 ). Figures 4-6, respectively, show the changes in the predicted     10 Lithosphere production, fracturing cost, and ROCP with proppant volume and the fracturing fluid volume. As can be seen from Figures 4 and 5, when proppant volume per section is above 73 m 3 , and fracturing fluid volume per section is above 1800 m 3 , the predicted production will be higher, and the fracturing operation cost will be lower. When designing the fracturing operation scheme, the proppant volume is not less than 73 m 3 , and the liquid volume is not less than 1800 m 3 . It is obvious from Figure 6 that when the operation parameters are 73 m 3 proppant and 1800 m 3 fracturing fluid, the lowest ROCP can be obtained, while 79 m 3 proppant and 1800 m 3 fracturing fluid are used to predict the highest production.
Applying the fracturing parameter optimization model and production prediction model established above, the sensitivity analysis of fracturing parameters can also be realized. Taking well A in the above paper as an example, the sensitivity analysis of fracturing parameters is carried out with the goal of maximum predicted production.
It can be seen from Figure 7 that when the proppant volume per section is above 79 m 3 , fracturing fluid volume per section is above 1800 m 3 , and the design operation section is 18; the highest production can be obtained, and the ROCP is the lowest. Therefore, when the proppant volume per section is 79 m 3 and the liquid volume is 1800 m 3 , the optimal operation section is 18.

Lithosphere
It can be seen from Figure 8 that when operation section designed is 18, fracturing fluid volume per section is 1800 m 3 , the proppant volume per section is 79 m 3 , the highest gas production can be obtained, the proppant volume per section is 73 m 3 , the ROCP is the lowest, and the final operation scheme can be determined according to specific goals.
It can be seen from Figure 9 that when operation sections designed is 18, proppant volume per section is 79 m 3 , and fracturing fluid volume per section is 1800 m 3 ; the highest gas production and the lowest ROCP can be obtained. When operation section designed is 18 and proppant volume per section is 79 m 3 , the optimal fracturing fluid volume per section is 1800 m 3 .

Conclusion
(1) Aiming at the problems of fracturing engineering in WY shale gas block, this paper evaluates and optimizes the intelligent algorithms and establishes the shale gas fracturing production prediction model based on the field data samples. The analysis shows that the production prediction model established by the XGBoost algorithm has the best performance in the training set and test set of WY shale gas block (2) Based on the gas production rate and ROCP, the selection method for shale gas well fracturing operation scheme set established in this paper ensures the rationality of the input parameters of the prediction model and avoids the unscientific recommended results of the optimized operation parameters (3) In this paper, the fracturing production prediction model and parameter optimization model are established, which can not only be used to predict the shale gas well fracturing production and optimize the operation parameters but also realize the sensitivity analysis of fracturing parameters, the effect comparison of fracturing operation schemes, etc. The research on optimization decision of operation parameter scheme with the target of production and ROCP has a good field application value

Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.