## Abstract

Predicting the performance of a subsurface oil field is a large, multivariant problem. Production is controlled and influenced by a wide array of geological and engineering parameters which overlap and interact in ways that are difficult to unravel in a manner that can be predictive. Supervised machine learning is a statistical approach which uses empirical learnings from a training dataset to create models and make predictions about future outcomes. The goal of this study is to test a number of supervised machine learning methods on a dataset of oil fields from the United Kingdom continental shelf (UKCS), in order to assess whether, (a) it is possible to predict future oil field performance and (b), which methods are the most effective. The study is based on a dataset of 60 fields with 5 controlling parameters, (gross depositional environment, average permeability, net-to-gross, gas–oil ratio and total number of wells) and 2 outcome parameters (recovery factor and maximum field rate) for each. The choice of controlling parameters was based on a PCA of a larger dataset from a wider project database. Five different machine learning algorithms were tested. These include linear regression, robust linear regression, linear kernel support vector regression, cubic kernel support vector regression and boosted trees regression. Overall, 83% of the data was used as a training dataset while 17% was used to test the predictability of the algorithms. Results were compared using R-Squared, Mean Square Error, Root Mean Square Error and Mean Absolute Error. Graphs of predicted responses v. true (actual) responses are also shown to give a visual illustration of model performance. Results of this analysis show that certain methods perform better than others, depending on the outcome variable in question (recovery factor or maximum field rate). The best method for both outcome variables was the support vector regression, where, depending on the kernel function applied, a reliable level of predictability with low error rates were achieved. This demonstrates a strong potential for statistics-based prediction models of reservoir performance.

**Thematic collection:** This article is part of the Digitally enabled geoscience workflows: unlocking the power of our data collection available at https://www.lyellcollection.org/topic/collections/digitally-enabled-geoscience-workflows

The efficiency of the hydrocarbon extraction process is largely dependent on a host of interconnected factors, both intrinsic and imposed. The goal of this study is to investigate the ability of machine learning algorithms to produce a predictive tool which can be applied to other datasets.

This paper forms part of larger project with a database that comprises 424 fields on the UKCS. A subset of that database was analysed using methods of feature selection including principal component analysis (PCA) and best subset regression to determine which variables were critical to predicting reservoir performance. In this paper, those variables have been used to condition a number of machine learning algorithms to determine which are the most effective at predicting future field performance.

Variables that control reservoir performance have been subdivided into geological, PVT (fluids and reservoir conditions) and engineering. A number of metrics that record reservoir performance were identified, and for the purpose of this study two of these response variables were selected (recovery factor and maximum field production rate). The original project database included information from 424 fields subsampled into smaller subsets for PCA and regression analysis. A further subsampling has been undertaken here for this analysis in a subset for Machine Learning. This subset of the database includes information about 5 control and 2 outcome variables from 60 fields. These fields and variables are an outcome of both the PCA and a best subsets regression testing. The data were z-score standardized and used to test five different machine learning methods.

The study area (shown in Fig. 1) was chosen for its wealth of exploration and production data accumulated over fifty plus years as a hydrocarbon producing region. Production data were obtained from the UK Oil and Gas Authority, (www.ogauthority.co.uk) and geological parameters were compiled from a variety of published sources. A comprehensive list of references and data sources and more detailed discussions on study area, data distribution and petroleum system geology are provided in a separate publication discussing the database building process and the spatio-temporal distribution of that data.

## Study area

Sixty oil fields were selected for this study from the wider database of 424 oil, gas and condensate fields. These fields were randomly chosen based on fluid phase (oil) and filtered for completeness of data and to remove outliers and is representative of the region's oil fields spanning 4 separate basins. See Table 1 for a list of fields and Figure 1 for a map showing the location and spatial distribution of the fields used in this machine learning exercise.

The fields used in this study are from north of the Mid-North Sea High. Half are located in the Northern North Sea basin, a quarter in the Central North Sea basin and another quarter in the Moray Firth basin. 26 of these fields are strictly shallow marine, 25 are deep marine, 2 are continental and the remaining 7 contain a mix of gross depositional environments (including Chanter, Claymore, Crawford, Dunbar, Fulmar, Highlander and Maureen). In instances with multiple depositional environments the primary reservoir accounting for over 70%–80% of in place and produced volumes was used for the GDE classification. About 35 of these have further sedimentological data that was not used directly in this machine learning study (e.g. diagenetic impact, stratigraphic heterogeneity, etc.) which were recorded as having low to moderate intensity. Trapping mechanisms were mostly structural at reservoir depths between 1335 and 3980 m. The hydrocarbons were light crudes (mean of 38° API) in reservoirs with good porosities averaging >20%. Reservoirs are predominantly Jurassic in age with a few Triassic, Paleocene and other age.

This supervised learning experiment utilizes 5 predictor variables including gross depositional environment (GDE), average permeability, net-to-gross (NTG), gas–oil ratio (GOR) and total number of wells. These parameters were chosen from a wider selection of 27 predictor variables based on PCA which were then put through best-of-subset testing to assess minimum number of variables needed for prediction and suitable permutations (combinations) to achieve desired results. All variables applied here ranked among those that were found to control 83% of the correlation in the predictor variables of the PCA.

A summary of the feature selection process includes the following steps

Classification of input data into three groups (categorical, ordinal and numerical variables); where categorical refers to descriptive or qualitative data points such as gross depositional environment; ordinal refers to numerical data with no order or magnitude such as structural complexity; and numerical refers to data that are number and connote a change in intensity with ascension such as permeability.

Division of database into subsets of differing sample sizes (38 v. 136 oil fields) with overlapping variables to determine the impact of sample size on results. Results were seen to be consistent across sample sizes.

Preparation of data for statistical analyses including label encoding of non-numerical data and standardization of numerical data.

Principal component analysis (PCA) for feature selection to determine how variables interact with each other. PCA works by projecting numbered data into lower dimensions called principal components for the purpose of finding the most succinct/effective expression of all input parameters using a limited number of principal components (Lever

*et al.*2017). The main output of the PCA is expressed as a table of eigenvalues from which to determine the suitable number of principal components based on a cut-off at either the first principal component to exceed the 80% cumulative proportion threshold (Jolliffe's rule) or principal components with eigenvalues greater than 1 (Kaiser criterion) (Jolliffe 2002). Selected variables for each principal component are determined based on the magnitude of eigenvectors.

It is not the purpose of this selection of five variables to be made a definitive recommendation on what parameters should be used to predict recovery factor or any other oil production performance measures. This is simply one permutation out of many, based on a limited dataset with a decision informed by the work carried out as part of the larger project.

A brief discussion of the impact of these predictor variables on reservoir performance is as follows:

Gross Depositional Environment (GDE): the GDE of reservoir sediments imparts well understood characteristics that play a major role in controlling the level of productivity of fluids from within. The depositional environment controls the architecture and geometry of both reservoir bodies and baffles. It also affects the textural properties and the mineralogy of the reservoirs (Lorenz

*et al.*1989; Ingles and Anadon 1991; Reinson 1991; Hartmann and Beaumont 1999; Zhang*et al.*2008; Armitage*et al.*2010; Lai*et al.*2015; Wang*et al.*2018; Ärlebrand*et al.*2021), pore-water chemistry (Shaw*et al.*1990; Hartmann and Beaumont 1999; Toevs*et al.*2008) and reservoir geometry/structural control (Reinson 1991; Mode*et al.*2017; Levell 2021). This variable is categorical with three classes of Continental, Shallow Marine and Deep Marine. This variable was also selected because it ranked highly in the PCA, being one of the variables that accounted for 47% of the correlation in the predictor data.Net-to-Gross (NTG): refers to the proportion of the gross reservoir volume that can hold and deliver hydrocarbons. As a general rule, low net-to-gross reservoirs are associated with poor recovery factors (e.g. Richards and Bowman 1998) but this is not an explicit relationship as it does not capture geometry or architecture of reservoirs or baffles. Net-to-gross is a ratio between 0–1. Within the current dataset net-to-gross ranged from 0.35 to 1 with an average value of 0.73. This is one of the variables that accounted for 83% of correlation in the predictor data.

Average Permeability: permeability is the capacity of the reservoir to transmit its fluid contents through the pore network and internal fractures and fissures. This is a key component of reservoir quality assessment metrics and has great effect on the performance of the reservoir (Gunter

*et al.*1997). Discounting any flaws in production strategy, permeability and its partner index (porosity) provide a fair assessment of potential production efficiency. This metric was recorded in milliDarcy (mD) and average permeabilities were between <1 to 2000 mD, with a mean value over 500mD. This parameter ranked highly in the PCA being one of the parameters that accounted for 55% of the correlation in the data.Gas–Oil Ratio (GOR): this refers to the amount of gas in solution relative to a unit volume of oil at reservoir conditions. GOR as a predictive element has also played a role in previous studies of reservoir performance prediction including material balance equations. Ahmed and McKinney (2005) and Ahmed and Meehan (2012) dissect the intricacies of the topic in greater detail including equations for determining GOR as well as the relevance of GOR in predicting reservoir performance. Busahmin and Maini (2010) also discuss how GOR affects recovery factor and production rate in the context of heavy oil reservoirs, observing a decrease in oil recovery with increasing GOR. The unit of measurement applied here is standard cubic meters per standard cubic meter (m

^{3}m^{−3}). GOR values were between 15 to >500 m^{3}m^{−3}with a mean of >100 m^{3}m^{−3}. This parameter ranked highly in the PCA, being one of the variables that contribute to 78% of the correlation in the data.Total Number of Wells: here we account for all well bores (both producers and injectors) on the field. As a key element of the production process, well related parameters greatly affect overall field performance (Gurbanov

*et al.*2016). In relation to field size, this parameter factors in well spacing and well density while also being dependent on production strategy (primary, secondary or tertiary) and chosen drive mechanism. Total number of wells provides a singular compound measure of external forces of extraction (production wells) and input (injection wells) at play on the reservoir. Total Number of wells for our experimental dataset lies between 1 and 77 with an average of 18 wells. This parameter was chosen because it was one of the variables that accounts for the top 24% of correlation in the control variable data.

Recovery Factor (RF): this is the percentage of in-place volumes of hydrocarbon which is producible as per implicit technicalities (including whether primary or enhanced recovery techniques are applied) or recovered as at the end of field life. Values recorded for this project were either forecasted as indicated in existing literature or are coincidental with present realized recovery at cease-of-production. Recovery factors ranged from 6% to 77% with an average value of 40%.

Maximum Field Production Rate (MFPR): this is recorded in thousands of barrels of oil per day (mbpd) and indicates the ceiling of the achievable hydrocarbon extraction rate, recorded over production time span, through flow testing or during production. These rates are capped by a variety of physical factors and field planning decisions. As fields in this experiment are offshore fields, maximum flow rates are generally on the higher side, as profitability in offshore fields require higher production rates (Dake 1994). Larue and Friedmann (2005) suggest that flow from a reservoir is mostly influenced by the reservoir architecture which is related to GDE. Values for this metric are between 2 mbpd and 300 mbpd with a mean value of 58 mbpd.

## Previous work

Mustafiz and Islam (2008) suggested that there are three main types of analysis that can be used for reservoir performance prediction.

The Analogical Approach: a comparative and inferential assessment of reservoir performance hinged on similarities in characteristics between mature and early–stage zones or pools. This approach can be strictly qualitative or employ quantitative measures in the form of empirical statistics to observe correlations and approximate production; for example, as discussed in Meehan (2011) where analogues of fractured reservoirs were compared for performance.

The Experimental Approach: here PVT and other properties are measured in lab models and observed results are scaled to the level of the actual reservoirs (Manzir

*et al.*2015).The Mathematical Approach: these methods apply mathematical equations to predict performance. Ertekin

*et al.*(2001) gives a comprehensive description of mathematical methods including material balance equations, decline curves, statistical and analytical methods. Okotie and Ikporo (2018) also discuss material balance for performance prediction.

Machine learning combines the analogical and the mathematical approach. Here statistical equations are produced using requisite amounts of data samples with established independent-dependent (control-response) multivariate pairings. Derived equations are then applied to control variables to predict responses. Ertekin and Sun (2019), Pandey *et al.* (2020) and Sircar *et al.* (2021) also give broad and up to date overviews on the concepts behind the application of machine learning in forward and inverse reservoir performance and reservoir quality modelling, although not specifically focusing on any singular unique case studies.

A recent case study application of machine learning in hydrocarbon reservoir performance prediction includes Niu *et al.* (2021) where data from 172 gas wells from a single producing block and a selection of 19 engineering and geological variables were compiled. Following feature selection, 8 variables were chosen and used to create an ultimate recovery prediction model based on multiple regression.

Other examples of case study applications of various artificial intelligence and machine learning techniques (such as genetic algorithms, random forest, artificial neural networks and others) over the last decade include Al-Fattah and Startzman (2001); Mirzaei-Paiaman and Salavati (2012); Amirian *et al.* (2013); Chithra Chakra *et al.* (2013*a*); Chithra Chakra *et al.* (2013*b*); Li *et al.* (2013); Ahmadi *et al.* (2015); Choubineh *et al.* (2017); Ghahfarokhi *et al.* (2018); Bhattacharya *et al.* (2019); Ghorbani *et al.* (2019); Aliyuda *et al.* (2020); Liu *et al.* (2020); Al-Jifri *et al.* (2021); and Han and Kwon (2021); Bhattacharyya and Vyas (2022*a*); and Bhattacharyya and Vyas (2022*b*).

## Approach

For this project, 5 different statistical models based on 3 modelling techniques were developed using machine learning software – MATLAB R2019b Update 5 (9.7.0. 1319299). This was done to assess consistency in results and check potential biases that may present from any one chosen methodology. These modelling techniques broadly include:

Linear regression

Support vector regression

Boosted trees regression

Machine learning is the development and implementation of algorithms that improve automatically with experience. A widely used description is provided by Mitchell (1997), defining it as software being able to learn from experience (E) in application to a specific set of tasks (T) and indicators of performance (P) ‘if its performance at tasks in T, as measured by P, improves with experience E’.

Various texts list a plethora of approaches to machine learning for different purposes, including supervised, unsupervised, semi-supervised, reinforcement, self, feature, sparse dictionary, anomaly detection, robot learning, association rules, etc.

For this experiment, supervised machine learning is applied. This form of machine learning model is trained with both input and output data as opposed to unsupervised learning where the model is trained to identify clusters and classes with no outcome variable provided for training (Russell and Norvig 2010).

In supervised machine learning, models are trained with the training dataset, where input and output variable pairings are complete for model fitting. Prior to model fitting, a method of model validation is selected. This effectively amounts to a portion of the data used to assess the fit of the model. Options for validation typically include k-fold cross validation, hold-out validation or bootstrap (Kohavi 1995). For our purposes, holdout validation was chosen. In this method of validation, a larger percentage of the data is used to train the model, typically 66.6%; while the remaining 33.3% is used for validation. In this case the split was 83% train and 17% validate. This method was thought best because the number of observations available (60) and the number of independent variables (5) were enough to train a single iteration, in any given instance of the model, to acceptable standards of observations per variable, applying the ‘one-to-ten rule’ (10 observations per variable). See Harrell *et al.* (1984); Harrell *et al.* (1996); Peduzzi *et al.* (1996); Laupacis *et al.* (1997) and Steyerberg *et al.* (2000) for more on this rule. Ultimately this helps to avoid overfitting. A k-fold cross validation would have split the data into two parts at least, thus creating unreliable overfitted models with each iteration.

Altogether, with 5 input variables and 50 observations, models could be trained with 10 observations left over for validation. Running several model iterations with random train-validate splits and assessing consistency, an idea of model accuracy was gotten.

## Linear regression model

A linear regression model in the context of machine learning is one which mathematically patterns the association between one or more numerical predictors and a continuous response variable, for the purpose of predicting the response variable to a reasonable degree of accuracy, when applied to a set of non-modelled covariates.

*y*is the

_{i}*i*th response; $\beta $

*is the*

_{k}*k*th coefficient, where $\beta $

_{0}is the constant term in the model;

*X*is the

_{ij}*i*th observation on the

*j*th predictor variable,

*j*= 1, …,

*p*; $\epsilon $

_{i}is the

*i*th noise expression, referring to random error.

All this operating under the assumptions that:

The noise expressions, $\epsilon $

, are uncorrelated._{i}The noise expressions, $\epsilon $

, have independent and similarly normal distributions with mean zero and constant variance, $\sigma $_{i}_{2}.The responses

*y*are not correlated._{i}

In least squares regression, the coefficients are approximated to minimize the mean squared divergence between the predicted and actual response.

The robust linear regression method is less affected by outliers than least squares regression and functions by assigning a weight to each data point using a technique called iteratively reweighted least squares (Barreto and Burrus 1994*a*, 12*b*; Burrus *et al.* 1994; Burrus 1998*a*, 21*b*). In the prime iteration of this process, every data point is equally weighted, and coefficients are approximated using least squares. In consequent iterations, weights are recalculated such that points that highly deviate from model predictions in prior iterations are assigned lower weighting. Model coefficients are then recalculated, applying weighted least squares. This workflow is repeated until resulting coefficients coincide around a prescribed tolerance.

For a detailed examination of the modalities and technicalities of linear regression modelling see Seber (1977); Neter *et al.* (1996); Bingham and Fry (2010) and Chatterjee and Simonoff (2012).

An example of a similar work employing multiple regression modelling for reservoir performance prediction is Oladeinde *et al.* (2015). In that instance a multiple linear regression model was created to forecast total production volume based on six predictors including gas–oil ratio, number of wells and a few well performance indices. The project was limited in scope but appeared to yield positive results.

## Support vector regression (SVR) model

Support vector machine (SVM) analysis also referred to as support vector networks is a tool in supervised machine learning relevant to both classification and regression exercises (Gunn 1998). SVM typically refers to the use of support vectors for classification while SVR refers to regression specific support vectors (as used in this case). SVR here, as put forward by Vladimir Vapnik (see Vapnik 1995), uses an epsilon ($\epsilon $)-insensitive loss function. $\epsilon $ referring to the distance of data points from the hyperplane, which in SVM would be the hyperplane of separation between groups of classes but in SVR operates as the line of prediction (or more precisely the midpoint of the margin of prediction). The best hyper-plane is the one with the greatest margin of prediction (boundary slab) between classes, which may not necessarily create a perfect distinction between classes but separates a substantial amount of the points; presenting what could be termed a soft margin.

In this form of regression modelling the raw data is mapped onto a higher dimensional space based on the chosen Kernel Function where the projected points closest to the boundary plane are termed support vectors. It can be described as non-linear mapping of projected input variables to create a linear predictive function (See Fig. 2 for illustration).

*and $\alpha n\u2217$ are Lagrange multipliers;*

_{n}*x*terms represent support vectors;

*b*is the bias term.

Apart from the Kernel Function, another key optimizable parameter in SVR is the box constraint. This parameter controls the strictness of datapoint classification, and the penalization imposed on misclassification; such that the higher the box constraint, the higher the cost of misclassification – leading to the designation of fewer support vectors and stricter data separation. The Kernel scale mode can also be optimized when a radial basis function kernel is applied. In this instance only the Linear and Cubic kernel functions are applied and so the kernel scale is not relevant.

Various workers have applied SVR in similar and adjacent contexts including Saffarzadeh and Shadizadeh (2012); Al-Anazi and Gates (2010); Gholami and Moradzadeh (2011) and Gholami *et al.* (2012) where it was applied to reservoir quality predictions and Zhong *et al.* (2010) where it was applied to predict production in high water cut fields.

For a thorough exploration of the operating concepts for SVM see Steinwart and Christmann (2008).

## Boosted regression tree model

Boosted regression tree modelling is a supervised learning technique that aggregates several models into a single predictive algorithm. This method specifically amalgamates the advantages of two processes. Namely, regression trees which relate a dependent (response) variable to its independent (control) variables by iterative twofold splits (see Fig. 3); and boosting which is an adaptive method for merging several uncomplicated models into one with more complexity, to improve forecasting ability (Elith *et al.* 2008).

Boosting is a stepwise process that seeks to minimize the loss function by including, at each tier, a new tree that best mitigates deviance.

Key user definable parameters for Boosted Tree Regression include:

The Minimum Leaf Size: which equates to the complexity of each individual tree, with smaller leaf sizes more prone to recording noise in the data

The Number of Trees: which equates to the number of learners to be aggregated

The Learning Rate (Shrinkage parameter): a value between 0 and 1 which refers to the contribution of each tree to the model (the rate at which the model learns). Thus, the smaller the Learning rate the greater the number of iterations required (Hastie

*et al.*2009).

*et al.*(2008) and Hastie

*et al.*(2009) provide in-depth explanations on this technique. In an adjacent context to this project, Subasi

*et al.*(2020) discuss the application of boosted trees to reservoir quality prediction.

## Results

As outlined above, 5 different models were implemented (linear regression, robust linear regression, linear kernel support vector regression, cubic kernel support vector regression and boosted trees regression) using the aforementioned software. The outputs for presentation include graphs showing validation/test results of predicted responses v. actual (true) responses. These outputs are the result of random train-test splits of the data and are consistent regardless of split, with only slight variations observed across multiple iterations (5 or 6).

Tables of model performance indicators are also shown below (Table 5)<CE: Please check table citations are not in sequential order>. These include:

R-squared; a measure of the level of variation in the response variable explicable by the predictor variables. This percentage value is meant to indicate the predictive ability of the model. There is no universally agreed acceptable R-squared, but an R-squared value over 50% would be deemed acceptable (Bunge and Judson 2005).

Mean Square Error (MSE) and Root Mean Square Error (RMSE): MSE (also known as the Mean Squared Deviation) is the average of the squares of the differences between the predicted and true responses. The RMSE is the square root of the MSE. These values are unit-less and provide a measure of the model accuracy. Being an average, this value is sensitive to outliers. To constrain RMSE values, both response variables of maximum field production rate and recovery factor were standardized based on maximum value to rescale outputs between 0 and 1. For RF all values were divided by 100, while values were divided by 300 for maximum field production rate. Thus, RMSE values are all <1.

Mean Absolute Error (MAE): this is the average of the differences between predicted and true responses.

No universal cut-offs are recommended for error readings (RMSE, MSE and MAE). These values simply give a measure of how much predicted responses differ from actual responses on average.

## Discussion

Values shown in Table 5 illustrate that there are readily observable differences in the performance of the models for the two outcome parameters.

For MFPR we observe that the least squares linear multiple regression model performs quite poorly. However, with a robust (iteratively reweighted least squares) multiple linear regression there is a spike in all measures of performance for predictability and accuracy, with over 80% R-squared and very low values of mean squared (MSE and RMSE) and absolute errors (see Table 5). A comparison of the predicted response v. true response graphs for both models (Fig. 4a, b respectively) illustrates the difference in predictive performance. The linear SVR also displays good performance (as reflected in Fig. 4c) with over 70% R-squared and RMSE less than 0.1 (Table 5). The cubic kernel SVR on the other hand performed poorly as a model for MFPR with skewed predicted response v. true response alignment (Fig. 4d). For the boosted regression model with minimum leaf size of 8 and 30 trees, model response of 65% R-squared and RMSE less than 0.1 was returned with visibly good prediction output (Fig. 4e).

For recovery factors we see that the cubic kernel SVR shows good R-squared of 65% and RMSE just under 0.1 and good correlation for predicted response v. true response (Fig. 5d). For other models, prediction of recovery factor was not as good as the Cubic SVR; RMSE values were over 0.1 and predicted response v. true response outputs (Fig. 5a–c and e) displayed more scatter.

Overall, it would appear MFPR is more easily predictable across a wider range of models than RF given the same predictor variables.

A closely similar experiment was discussed in Aliyuda *et al.* (2020). There, several predictive models were run on hydrocarbon (oil and gas) field data from the Norwegian continental shelf. Variables used in that study were mostly similar to the wider selection from this paper. However, relative to this study, no feature selection was applied there and hence 30 variables were processed through each model as compared to the 5 variables selected here based on PCA and subset testing. To compare results, three parameters were used as outcome variables in that paper, two of which match the two used here; specifically recovery factor and maximum field rates. Model performance metrics in that paper were also R-squared, RMSE and MSE.

In that study it was observed that support vector regression produced the best results for both recovery factor and maximum field rate (with maximum field rate also having a better R-squared than recovery factor) as was observed here. That paper did not explicitly state which kernel function (whether linear, cubic, quadratic, etc.) was used in the SVRs for the testing of each outcome variable.

Results from this study show that with the data used for training the model, recovery factor could be predicted up to as high as 65% R-squared using a cubic kernel support vector regression method with very low absolute error equating to within single digit percentages of recovery factor. Poorest performance in recovery factor prediction is the linear regression model at −22% R-squared. Maximum field production rate could also be predicted with a high level of certainty with models producing up to 85% R-squared and with low absolute errors (less than single standardized unit) in models with good performance (above 50% R-squared). Observing the results, a recommendation is made for the use of support vector regression in reservoir/field performance prediction, with tuning of kernel functions depending on the outcome variable being predicted.

As to why the different algorithms produce different results, simply put, it is like applying a mathematical function or constant to the exact same data. A basic example of this concept would be having an addition function (‘+’) and a multiplication function (‘x’) and two numbers e.g. 4 and 7. Applying the multiplication function to the two numbers would equate to 28 (4 × 7), while applying the addition function would equate to 11 (4 + 7). Same data sample, different processes applied to create different results. Even so on a more complex level of algorithmic processes the model equations take the same basic data and functionally wrangle them differently.

In this specific instance the key difference in the way SVR processes the data and other regression methods do, is that SVR creates a broad range of fit (as captured in the boundary slab illustrated in Fig 4.2) in its predictive process resulting in what some refer to as a ‘low bias and high variance’ model; while linear regression methods create a single line of best fit through points thus presenting a ‘high bias and low variance’ prediction. Where the boosted regression thrives is that it aggregates multiple regressive processes broadening the otherwise high bias low variance situation presented by a singular linear regression.

The implication of this work is that with an abundance of legacy data floating around in E&P industry and academia not only can insights into the production process be acquired but with a small number of variables production performance can reliably be predicted.

Another paper dealing with prediction of reservoir performance using other machine leaning methodologies is Panja *et al.* (2018) where two machine learning models (least squares support vector machine and artificial neural networks) were tested in the prediction of oil recovery against a curve fitting model using simulated data generated for 8 variables (permeability, initial dissolved GOR, rock compressibility, gas relative permeability, slope of GOR, initial pressure, flowing bottom hole pressure, and hydraulic fracture spacing) with 114 observations to train and 30 observations to test. Notably, the control variables somewhat intersect with the selection used in this paper in terms of permeability and GOR. The data was used to predict recovery factor (also similar to this study) as well as produced gas–oil ratio. The results showed, in agreement with this study, that SVM is quite accurate for predicting recovery factor.

Many other studies exist showing similarities and differences in methodologies and variable selection to those used here to achieve high coefficients of determination (R-squared), including Belazreg *et al.* (2019) and Belazreg *et al.* (2021) where predictive models for recovery factor were developed based on regression and group method of data handling (GMDH) with positive results (R-squared as high as 72%). An exhaustive rundown of all these studies and their intricacies would not be feasible. The important thing to note is that these various methods of machine learning prove to be successful in these case studies, using few variables to predict various measures of reservoir performance.

A few other papers dealing with the subject include Mohammadi *et al.* (2014), Srivastava *et al.* (2016) and Daribayev *et al.* (2020).

In summary, numerous papers on machine learning application for performance prediction exist (as referenced in the literature review of this paper), each dealing with unique case studies and employing a variety of artificial intelligence methodologies. This paper is an addition to that ever-expanding library, with its detailing of the application of real-world data for prediction algorithm utilization.

## Conclusion

From observations of results, we see that statistics based predictive models can be used to provide accurate reservoir performance forecasting. It is also apparent that depending on the outcome being predicted (using the exact same predictor variables), the model being applied might require adjustment of tuning parameters or the use of a different model altogether.

Comparing results for the two responses (RF and MFPR) from this study as well as previously published studies it would appear that SVR is a good modelling technique for reservoir performance prediction overall. For different response variables, a change in kernel function (linear, cubic, gaussian, etc.) should produce high R-squared and low error.

Future work would involve broadening the scope of data in terms of number of observations, training reservoir data from different hydrocarbon producing regions and assessing other response variables to decipher empirically appropriate algorithms.

## Acknowledgements

Classification schema for Gross Depositional Environment is applied from the SAFARI Database Project (safaridb.com).

## Author contributions

**UO**: data curation (lead), formal analysis (lead), funding acquisition (lead), methodology (equal), validation (lead), visualization (equal), writing – original draft (lead); **JH**: conceptualization (lead), methodology (equal), supervision (lead), visualization (equal), writing – review & editing (lead)

## Funding

This work was funded by the Petroleum Technology Development Fund (PTDF/ED/OSS/PHD/OU/1188/17).

## Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

## Data availability

Data for this research were acquired from a variety of public sources. The curated database along with a complete list of references is available from the University of Aberdeen Library Cataloguing Service (cataloguing@abdn.ac.uk), and on request from the corresponding Author.