An accurate estimation of carbon dioxide (CO2) solubility in brine is of great significance for industrial applications such as quantifying CO2 sequestration in subsurface formations, CO2 surface mixing, and different CO2-based enhanced recovery methods (EOR). In this research, four different data-driven/machine learning techniques, extreme gradient boosting (XGB), multilayer perceptron (MLP), K-nearest neighbor (KNN), and in-house genetic algorithm (GA), were used to estimate solubility in terms of pressure, temperature, and salinity. Pressure, temperature, and salinity were used as model inputs, while CO2 solubility was the output. The experimental database used in this study was collected by dissolving CO2 into NaCl brines at salinity ranging from 0 to 15000 ppm, temperature ranging from 298 to 373 K, and pressures up to 200 atm. All data-driven models accurately estimated solubility through a coefficient of correlation (R2) ranging from 0.95 to 0.99, and a precise simple-to-use empirical solubility equation was developed using GA. The performance of the models was analyzed using proper model metrics (such as mean absolute error and relative error). A detailed feature importance analysis was conducted using feature importance, permutation, and Shapley values to clarify the correlation between the input and output parameters. The pressure was found to be the most impactful feature, followed by temperature and salinity. The model’s accuracy was compared to a well-established solubility model from the literature, and a good agreement between the two models’ results was observed. Lastly, conducting sensitivity analysis on the model revealed that the model’s estimations were still accurate when pressure and salinity were beyond the scopes of the original dataset.

Carbon dioxide (CO2) capture and sequestration (CCS) is a highly anticipated method for reducing CO2 concentrations in the atmosphere and mitigating global warming stemming from greenhouse gases [1]. It mostly involves the injection of high-pressure CO2 into geological formations such as mature oil and gas fields, aquifers, and the bottom of the ocean [27]. Ideally, the injected CO2 should be safely sequestrated in the injected site forever with minimal chance of it being released back into the atmosphere. Among the repositories, ocean storage is not as widely accepted as mature oil and gas fields and deep saline aquifers due to its risks to the ecosystem. On the other hand, sequestration in hydrocarbon fields and aquifers has been widely studied and anticipated. The major advantage of aquifers over mature hydrocarbon fields is their common occurrence and a considerable capacity for sequestration of CO2 [8, 9].

The four main mechanisms of sequestration in subsurface formations are structural trapping (particularly for hydrocarbon reservoirs), residual phase trapping, dissolution, and mineralization [3]. While all these mechanisms contribute to the sequestration of CO2, the structural/stratigraphic and solubility mechanism has the most immediate impact on the trapping or retaining CO2 in aquifers. The solubility mechanism allows the dissolution of CO2 into formation brine, which occurs during the migration of CO2 along its pathway in the injected formation [10]. Over time, the injected CO2 dissolves into the formation brine, increasing its density and causing CO2 to sink into the formation. Hence, a precise estimation of the solubility of CO2 in formation brine is one of the critical factors to evaluate the efficiency of sequestration. Moreover, the robust estimation of CO2 solubility can enhance efficient approximations of other CCS methods such as surface mixing CO2, mineral carbonation, and CO2-enhanced hydrocarbon recovery [1114].

The solubility of CO2 in conditions representing sequestration in saline aquifers and mature hydrocarbon fields (i.e., high-pressure high temperature (HPHT) environments) has been extensively studied experimentally using a variety of techniques [1417]. Carroll [18]. Wiebe and Gaddy [19] were among the first researchers who studied the solubility of CO2 in water at temperatures between 50 and 100°C and pressures up to 700 atm using a PVT cell. Although most researchers measured solubility based on the equation of state and weighting methods, more unconventional approaches such as the calorimetric method, gas chromatography, and titration methods have been observed in the literature [2025]. As NaCl is considered the major component of a majority of brines, most previous studies have been focused on the solubility of CO2 in NaCl brines [18, 26].

In addition to experimental studies, numerous theoretical and modeling studies focusing on the solubility of CO2 in various brines can be found in the literature. Duan and Sun [27] proposed comprehensive models that could predict the solubility of CO2 in various thermodynamic conditions. There are more complex theoretical studies in the literature that utilize complex approaches and parameters such as fugacity [28] and different equations of state (EOS) to predict solubility in conditions suitable for CO2 sequestration [27]. Valderrama et al. [29] further enhanced Sechenov’s equation to predict solubilities outside the range of available experimental data. Chabab et al. also comprehensively investigated the solubilities of CO2 and O2 at pressures up to 36 atm and temperatures up to 373 ̊K [14]. Zuo and Guo [30] used well-known equations of state such as Peng-Robinson and Patel-Teja to develop models for solubility. However, in addition to the limitation in the application domain, they fail to provide accurate estimations of CO2 solubility [17, 31, 32]. More recently, Sun et al. [33] developed a simple theoretical model to predict the mutual solubility of CO2 in brine and water. The model works well in a wide range of pressure, temperature, and salinities.

In addition to the theoretical models, empirical models have been widely used in energy and petroleum engineering to reduce the complexities of theoretical models. Although some empirical equations for CO2 solubility can be found in the literature [34, 35], they were mainly developed for other purposes and are not applicable in thermodynamic conditions representative of CO2 injection/sequestration in geological formations, or CO2-based enhanced oil recovery methods. Moreover, although a wide range of salinities has been previously covered in the literature, the solubility data in the low-salinity ranges, i.e., salinities lower than 20,000 ppm (0-2 wt%), is very limited. One example of low-salinity formation brines is the one encountered in the Malay Basin, offshore Peninsular Malaysia, with an average salinity of ∼1 wt% [36]. Therefore, one purpose of this study is to develop an empirical correlation of solubility applicable to the injection of CO2 into low-salinity geological formations.

In recent years, artificial intelligence/data-driven methods have become very popular in various industries, and petroleum and energy engineering are no exception. Data-driven approaches are becoming more common for estimating various rock and fluid properties that would otherwise require complex models and sophisticated experimental apparatus. A comprehensive literature review on the application of data analytics in oil and gas was published by Mohammadpoor and Torabi [37], in which the authors discussed current trends in the utilization of advanced computational tools in various aspects of the industry. Artificial neural networks (ANN), tree-based models, and least square vector machines (LSVM) are the most common machine learning (ML)/artificial intelligence (AI) methods in the literature [38]. Various researchers have recently used them to estimate numerous parameters related to petroleum engineering and CCS such as formation water geochemistry [39], mineral classification [40], permeability prediction [41], well log classification [42], CO2 capture rate [43], minimum miscibility pressure [44], CO2 solubility in oil [17], and CO2 miscibility in various ionic liquids [45].

Despite numerous studies on various CCS aspects, applying data-driven methods to estimate CO2 solubility in brine, specifically in conditions representative of CO2 injection to the subsurface formations, has been scarce [46]. In the limited literature available on the topic, despite promising outcomes, the focus of researchers has been mainly on showing how accurate their predictions are with no further clarification on the dependency of solubility to its independent parameters of pressure, temperature, and salinity. Menad et al. [32] developed a solubility model based on the artificial neural network using a wide range of pressure, temperature, and salinity of NaCl as input parameters. Although they reported very high accuracies (R2>0.99) at pressures above 80 psia, their model produces significant errors. Their model predictions are far lower than the experimental and theoretical values reported in the literature. Generally, in most of the previous studies in which data analytics models were utilized on continuous data (regression), due to the complexity of the phenomena studied and a nonlinear correlation between the dependent variables (in this case, CO2 solubility) and the independent variable, these studies failed to offer a reproducible model that could be of practical use. In other words, almost all previous studies only sufficed in fitting and predicting data rather than generating an empirical and practical equation based on a data analytics approach. In addition, due to the black-box nature of the models used in previous works, the proposed models often suffer from bias and accuracy overestimation.

In the current study, the solubility of CO2 in brine (output) as a function of pressure, temperature, and salinity as input parameters was estimated using different AI approaches (XGB, KNN, GA, and ANN). Moreover, as most previous data-driven studies are of black-box nature, in the current study, explainable data-driven approach was used in which the effect of each feature on the outcome (solubility of CO2 in brine) is qualitatively and qualitatively scrutinized. In addition, a straightforward, empirical correlation is developed using a genetic algorithm. Its accuracy is tested by comparing its performance with the experimental data and outcomes of well-established theoretical models from previous studies. The main goal was to estimate CO2 solubility in the low-salinity range (<20,000 ppm). However, to test the accuracy of the model, its performance in a broader range of pressure, temperature, and salinity was studied as well.

It is well-established from the literature that the solubility of CO2 in ionic solutions is a function of pressure, temperature, and salinity [18, 27, 47, 48] as shown in

To estimate solubility (mol/kg), pressure (in atm), temperature (in K), and salinity (in ppm of NaCl) were used as independent input variables in all the models used in this study. Therefore, the solubility of CO2 in brine can be treated as a supervised learning regression problem with one output (dependent variable) and several inputs (independent variables). The database used in this study was experimentally measured in the author’s previous work [22]. The database of 164 data points was fitted to the models, and the output was obtained (Supplemantary data 1 and 2). A statistic description of the data is shown in Table 1.

In the current study, four different data-driven methods were used: the gradient boosting method (XGB), K-nearest neighbor (KNN) regressor, multilayer perceptron (MLP), and genetic algorithm (GA). The general workflow of the ML methods (except GA) used in this study is shown in Figure 1. The procedure used in this study for XGB, KNN, and MLP can be categorized into three phases: data collection, model development, and optimization. In the first phase, data is studied and cleaned (if necessary), and outliers in the data are identified and removed.

In the next phase (model development), the data is split into a training and testing data set. This study used a 5-fold data split, with 80% of the data used as the training set and 20% of the data used as the test dataset. The optimum number of folds and data split portions was chosen based on cross-validation. It should be noted that the application of cross-validation reduces the probability of overfitting and ensures that the model is not biased, and its performance is not random [49]. Moreover, depending upon the type of model, data might require scaling or normalization. In this study, only for MLP was data normalization needed, and the others did not require changes to the scale of the data. The training data was then fitted to the model, and predictions based on the testing set were made.

In the last phase, the model parameters were tuned to ensure that the model performed optimally (hyperparameterization). Hyperparameter tuning substantially improves model performance. The outputs were then recorded from the model and were analyzed using model metrics.

2.1. XGB Model

XGB is an enhanced gradient boosting decision tree (GBDT) method introduced by Chen and Guestrin [50] as a variant to decision trees. It is mainly designed to enhance the speed and performance for supervised learning problems in which training data with various features exist. Due to its excellent efficiency and wide range of applications, XGB has been an established method for industry and various AI/ML problems [51].

XGB operates by creating various weak evaluators for the data and then summarizing the modeling results for those weak evaluators. In parallel, it can achieve an optimized performance for regression problems, in contrast to the serial use of the model [51]. The model has an optimized objective function that constantly regularizes terms to reduce the chances of overfitting thanks to a two-component objective function [50]. The first part calculates the difference between the model-predicted values and the actual ones, and the second component is the regularization term (the variance of the control model). The accuracy is then determined by using the deviation and variance of the model. For a given dataset of D=xi,yi containing n samples and m features, the predictor is an additional model composed of k base models. Its sample prediction results can thus be expressed as
where xi is the input parameter(s), fkxi is a learner at timestep t, and fit and fit1 are predictions at timesteps t and t1, respectively. As a result, the objective function includes terms for the traditional loss function and model complexity as follows:
where l is the loss function, n is the number of observations, and Ω is the regularization term. Detailed information on the XGB model is given in Chen and Guestrin [50].

2.2. K-Nearest Neighbor (KNN)

KNN regression is based on k-nearest neighbors, which is a distance-based supervised and unsupervised learning model [52]. The mechanism behind KNN finds a specified number of training samples with the lowest distance to a new point and predicts their labels [52, 53]. For training dataset A and test instance x=x,y, KNN calculates the probability between D and all training objects, x,yεD, to define its most similar-neighbor list, AQ. In this method, x represents the data of a training dataset and y represents its label. Likewise, x and y represent test object data and their labels, respectively. Therefore, the label for x is
where ν is the case category, yi the label of the ith closest neighbor, and I is an index function (I=1 while its assumption is met and I=0 the rest of the time). Further descriptions of this algorithm can be found in Naghibi and Moradi Dashtpagerdi [53]. The KNN regressor used in this research was implemented using a scikit-learn package in python (Pedregosa et al.).

2.3. Multilayer Perceptron (MLP)

Artificial neural networks (ANN) are a widely used data-driven method for various regression problems due to their robustness in identifying and recognizing relationships between input and output parameters particularly in complex systems [46]. ANN mimics the human brain in learning and processing information [54].

Multilayer perceptron (MLP) is among the most frequently used types of ANN in a variety of modeling problems [32]. Figure 2 illustrates the simplified structure of the MLP model used in this study. Layers and neurons are the two main components of the MLP [32, 54, 55]. The neurons are distributed beneath at least three types of layers: the input, hidden, and output layers. The input layer is where the model inputs (pressure, temperature, and salinity) are fed to the model. The output layer is where the model outcomes (solubility of CO2) are returned.

For any neuron (j in layer i) that is not located on the input layer, the input xij comprises a (linearly) weighted sum of neuron outputs from the immediate previous layer, yi1, j, plus a bias term, bij. Therefore, for any neuron, the output, yij, is the estimation of this input by an activation function, fi (that can vary for each layer):

An MLP model contains at least one hidden layer. The number of hidden layers varies depending on the complexity of the problem at hand. Simple to moderate systems often use one hidden layer, whereas more than one layer is often considered for more complex systems. The main purpose of hidden layers is to use activation functions that transform inputs into higher features. The activation functions are mostly logistic or hyperbolic types of transformer functions as follows:

Logistic function:
The hyperbolic tan function:
The transfer function for output function (if needed) is purelin:

The optimum number of neurons in each layer, the number of hidden layers, and the proper activation and transfer functions are mainly obtained via trial and error.

2.4. Genetic Algorithm

The genetic algorithm (GA) is a metaheuristic algorithm inspired by the process of natural selection, and it is commonly applied to optimization problems. Figure 3 illustrates the process for the GA algorithm used in the current study. GA is initiated with a population of random solutions (called chromosomes) within a search space. The chromosomes are evaluated by a fitness function and are graded based on their accuracy. Depending upon the type of selection method (rank, steady-state, or roulette wheel selection), a number of chromosomes based on their fitness values are selected, and through GA operator crossover and mutation, the next generation of possible solutions is produced [56]. Producing a new generation of possible solutions is repeated until the terminal condition is satisfied. The terminal condition is defined as (a) when an acceptable solution is achieved or (b) when the time limit is exceeded. The GA model used in the current study is very similar to that of the author’s previous works, in which the mathematical background of the model is comprehensively discussed [57].

2.5. Evaluation Criteria (Model Metrics)

To evaluate the performances of the models used in this study, three different model metrics were used, which were the coefficient of determination (R2), relative error (RE), and mean absolute error (MAE) as follows:
where xact and xpred are the actual (measured) and predicted results from the model and x^ipred. is the mean value of the measurement. It should be noted that in this study, xactual is the solubility values from the original database (experimentally measured), whereas xpred is the output of each model. In addition to the above metrics, MRE, RE_MAX, and RE_MIN were also used to analyze model outputs. They denoted the mean, maximum, and minimum, respectively, for the relative error in each model.

To estimate the solubility of CO2 in brine, pressure, temperature, and salinity (in wt% of NaCl) were used as independent input variables for all models used in this study. In this section, the modeling results from different AI algorithms are presented in the context of model solubility accuracy and error analysis (mean relative error). Moreover, the importance of input parameters and their respective effects on solubility is discussed. Table 2 summarizes the MAE, R2, MRE, and MRE min and max values for each model. All the metrics in the table are calculated based on comparing solubility points estimated by the model and the measured solubility points (see Appendix 1) at the same conditions of pressure, temperature, and salinity.

To ensure there was no bias in the model predictions and to reduce the probability of overfitting, all model cross-validation was performed. Data cross-validation did not show a significant change in model performance (an average of ±1.5% R2 variation was observed). Hence, it could be concluded that the performance of models used in this research was not random nor due to data overfitting.

3.1. XGB and KNN

Although KNN and XGB both used different algorithms and both achieved a very high prediction accuracy, XGB performed slightly better. XGB predicted solubility values with near-perfect accuracy (R2 of 99.9%) compared to KNN with a R2 of 97.2%. KNN performed more poorly, specifically where data points were distant from one another (i.e., in pressure ranges between 1 and 70 atm) as the original data set had fewer data points in that range. Nevertheless, both methods performed quite similarly in terms of MAE, MRE, RE_min, and RE max, and their solubility predictions over a wide range of pressures were acceptable. KNN initially performed slightly worse than XGB (the initial R2 for KNN was around 90%). However, after hyperparameter tuning, its R2 and MAE were enhanced by roughly 10% and 8%, respectively. The model parameters for MLP, XGB, and KNN are shown in Table 3.

In Figure 4, the vertical error bars depict a 5% error in solubility. It can be seen that all predicted values from both methods exhibited errors of less than 5%; in other words, the predicted values were very close to the actual (measured) values. Figures 5(a) and 5(b) illustrate R2 values between the actual data and those predicted using XGB and KNN. As can be observed, both models predicted solubility near perfectly.

3.2. MLP

MLP is a feedforward artificial neural network model with at least three different types of layers. Generally, the challenges in tuning artificial neural network models are to find the optimum number of layers and the optimal number of perceptrons for the hidden layer. Although MLP performed better in terms of R2 and MRE than XGB and KNN, the model is more challenging to tune and initialize in contrast with the other models used in this research, and it often requires trial and error and hyperparameter optimization to obtain optimal results. The MLP model used in this study used a layer of input parameters (with three inputs), a hidden layer (74 perceptrons), and an output layer with a single output. The model obtained an R2 of 0.9962, which indicates that the solubility values were predicted with an accuracy of nearly 100%. In terms of RE and MAE, MLP performance was comparable to those of XGB and KNN, and it performed slightly better than the other models in terms of MRE. Figure 6 illustrates solubility versus pressure for the actual and MLP values in the test dataset.

The error bars show a 5% deviation from the measured solubility values. It can be observed that along with the data’s pressure range, the predicted values were quite close to the actual. Figure 7 illustrates the correlation between actual values and those estimated by MLP.

The performance of the models in Sections 4.1-4.3 were only based on model performances obtained from the (randomly selected) test dataset rather than those of a training data set or a combination of training and test datasets. Analyzing metrics based on training data sets or a combination of training and test data sets overestimates model accuracy due to overfitting.

In many previous works [32, 45, 54, 5860], to prove the accuracy of the developed models, researchers often showed model performance (R2, MAE, and relative error) based on training data sets as well. The main problem with the latter is that regardless of the algorithm used, the high R2 and low errors in the training sets are not due to excellent model performance but rather data overfitting. In other words, the model is fed an input dataset together with output values and is evaluated on the same data. Hence, it does not matter how the algorithm works, it always performs close to perfect. To remove such bias and overfitting, and reflecting true model performance, model metrics were measured only for testing data sets that were randomly selected from the whole data set.

3.3. Genetic Algorithm

To estimate the solubility of CO2 in brine, an equation with 5 coefficients was considered as shown below.
where X is solubility of CO2 in brine in, P is pressure in atm, T is temperature in K, and S is the salinity of brine in ppm. With the help of GA, equation (10) was optimized through the employed datasets, and the coefficients a1 to a5 were obtained.

The main advantage of GA over other methods used in this study, and most of the previous data-driven approaches utilized in similar fields, is that GA could generate a mathematical correlation between input and output parameter(s). This means once the data is fitted, with some basic knowledge about the physics of the process, GA can create a correlation between the parameters.

Solubility values versus pressure for both the actual data and GA-predicted values are shown in Figure 8. The error bars indicate a 5% variation from the measured data. It can be observed that generally, along the presented pressure ranges, the predicted values were very accurate. The correlation between the predicted model using GA (predicted) and experimental data is shown in Figure 8. The high R2 that can be observed in the figure is another indication of an excellent fit between the experimental and predicted data.

3.4. Relative Error (RE) Analysis

While absolute error gives the magnitude of an error, relative error quantifies the magnitude of an error relative to the correct value. In other words, RE is a measure of the uncertainty of measurements in contrast with the size of each measurement. Figure 9 illustrates the relative error of all models used in this study for all datasets (training and test) vs. a data index.

In the current research, cross-validation was used to split the data randomly. Therefore, the training and testing data sets were not continuous, as in most previous studies [45, 61]. It can be observed from Figure 9 that the majority of the data had a very low relative error regardless of the used method. It is noteworthy that due to differences in the mechanism of GA (Figure 9(d)), the whole dataset was illustrated as one series for this model (blue color) as opposed to the conventional data split (blue color for test and orange for training) that can be seen in Figures 9(a)–9(c). It should be noted that the unusually low error values for the training dataset observed in Figure 9 (for XGB, KNN, and MLP) do not necessarily indicate very accurate predictions; instead, it is more likely to be due to overfitting. The best measure for model performance is looking at the RE of a randomly selected test dataset (indicated by blue points for XGB, KNN, and MLP).

It can be observed from Figures 9(a)–9(c) that for all models, 2-4 data points tended to exhibit high RE values (RE>0.5). All data points with abnormally high RE were investigated and traced back to a pressure of 1 atm, which was the starting point for all the experiments in the original dataset [22]. A most probable explanation for the high RE in these predicted values is possible errors in the experimental approach used during data collection. The solubility of CO2 in brine was measured using the potentiometric titration method. In near atmospheric pressures, the value of solubility is close to zero, and hence, it is very difficult to detect using the potentiometric titration method. Therefore, the data collected at this range is more likely to be erroneous than the data at higher pressures. However, as the cumulative number of such data points was negligible compared to the whole dataset (~1%), the aforementioned data were not removed from the data bank used for this research, despite the fact that eliminating those data points could further enhance the model performance.

In the case of GA (Figure 9(d)), in terms of MRE, MRE_max, and MRE_min, the model performed considerably better than the three other methods used in this study. Most data points (<99%) showed RE values lower than 0.25. Therefore, it could be concluded that if tuned correctly, GA is a highly accurate method for the estimation of CO2 solubility. Figures 10(a)–10(d) depict the frequency of data versus the RE ranges for each model.

It can be seen from Figure 10 that for all models, relative errors were reasonably low, and the majority of data fell into the low error zone. For Figure 10(d), all data points were used in GA for relative error analysis, meaning that there was no splitting of the data to training and test datasets, as was the case for the other three methods used in this research. This indicates that the predicted values were reasonably close to actual values. From the point of view of RE, KNN had the highest number of data points (3 out of 33) with a high RE>0.5. Therefore, based on the frequency of points with high RE, the accuracy of models can be ranked as GA > MLP > XGB > KNN.

3.5. Feature Importance

Feature (parameter) importance is a measure of the extent to which each parameter (input) contributes to the outcome (output). This will shed some light on how the outcome is affected by a series of inputs. In this research, different methods were used for this purpose: feature importance (directly from the model), permutation, and SHAP values. Figure 11 illustrates the parameter importance for XGB. It shows that the most significant parameter that affected the solubility of CO2 in NaCl brine was pressure, followed by temperature and salinity. It should be noted that among the models used in this study, direct calculations of feature importance (shown in Figure 11) were only possible for XGB.

The same results were found by permutation for KNN, MLP, and XGB as shown in Table 4. The sequence of the parameters in Table 4 shows their impact on the output. Small positive values indicate that including the parameter in the model had a very low effect on the output. For all models, the same parameter importance was obtained. To quantify how (positively or negatively) these parameters affected CO2 solubility, Shapely values (SHAP) were used are shown in Figure 12.

By using SHAP values, the effect of each parameter, considering its interaction with other parameters (if any), was taken into account. Therefore, Shapley values estimated the importance of each feature by comparing what a model predicts with and without the investigated feature.

The red and blue shades in Figure 12 illustrate a higher and lower probability of an outcome, respectively. Similar to the feature importance shown in Figure 11, Figure 12 also shows parameters stacked in a hierarchal order of importance. This means that pressure is the most important parameter, whereas the effect of salinity was the smallest. Therefore, the red dots at the right side of the 0.0 line mean higher pressure values increase solubility, whereas lower values are most likely to result in lower solubility. In other words, there is a direct correlation between solubility and pressure. The effect for temperature and salinity was the reverse of pressure, with lower values (blue dots) resulting in higher solubility. Thus, based on SHAP analysis, there is a reverse correlation between the solubility of CO2 in brine with temperature and salinity. The correlation between the input parameters and output is fully in line with the findings of previous researchers who experimentally discussed the dependency of solubility on salinity, pressure, and temperature [22, 25, 27, 62]. Another point that can be inferred from Figure 12 is that the testing dataset, although randomly selected, was adequately representative of the statistical population. Both figures represented the same SHAP values and hence had identical input and output behavior.

Model-agnosticism is the main advantage of Shapely interpretation over other feature importance analysis methods. Shapely interpretation utilized game theory’s SHAP values to estimate the extent to which each feature contributed to the prediction. The model-agnostic approach uses ML/AI models to study underlying structures without assuming that the model can accurately describe them because of their nature. This reduces potential bias in interpretations.

3.6. Comparison of the Model Performance with the Previous Solubility Models

To investigate the accuracy of the model, the predictions from our model (equation (11)) were compared to well-established theoretical models from the literature [27, 28, 33]. Figure 13 illustrates the prediction from different models for solubility of CO2 in distilled water at a temperature of 323 K and pressures up to 200 atm. The error bars represent a 5% variation in the predictions of our model. It can be observed from the figure that there is a very good consistency between the data obtained from the model developed in this study and those of other theoretical models from the literature. Similar results were obtained when our model was tested at other conditions of pressure, temperature, and salinities (not shown here), and the finding of the model showed great consistency with the aforementioned models.

Figure 14 shows the comparison of solubility of CO2 in NaCl brines estimated using our model and that of Mao et al. [28]. The estimated solubilities are in temperatures in the range of 298-373, salinities up to 15,000 ppm, and pressures up to 200 atm. The predictions from the model are in good agreement with those of Mao et al. (R2>0.96). It, therefore, could be concluded that the estimated solubility results from our data-driven model, both in distilled water and NaCl brines, show a good agreement with those of previous theoretical models.

3.7. Sensitivity Analysis

Data-driven models are known to perform poorly beyond the range of parameters they are trained on. In the case of this study, the scope of the model was pressures up to 200 atm, brine salinities ranging from 0 to 15000 ppm, and temperatures up to 373. However, to assess the predictive performance of the model (equation (11)) beyond the limits of its features, the model was tested using a unique set of data, in which the parameters were beyond the scope of the model. The set of data fed to the model in fact can be considered as a validation set, as the model had not been trained on this range of features before. The findings of the model were compared to solubility values obtained from the well-known theoretical model developed by Duan and Sun [27] at the same conditions of pressure, temperature, and salinity. The analysis was conducted by changing one feature at a time; i.e., pressure was tested at values higher than 200, while temperature and salinity were within the scopes of the model. The same process was repeated for temperature and salinity, and the results were scrutinized for each parameter.

Figure 15 illustrates the solubility values from the model developed in this study versus values obtained from the Duan and Sun theoretical model, at the pressure beyond that of range of this study up to 500 atm. It can be observed from the figure that there is a very good agreement between the data obtained from the two models (R2 values >0.99) at temperatures of 323 and 373 K. Similar results were obtained (R2 of higher than 0.98) by repeating the same process at other salinities and temperatures within the scopes of the model (Table 1) and at pressures between 200 and 500 atm. The average relative error between the solubility values from our model and those of Duan and Sun [27] was 7.3%. It, therefore, can be concluded that the model works well at pressures beyond its limits, while the temperature and salinities are still within the range.

Figure 16(a) illustrates the predictive performance of the model in contrast with that of Duan and Sun at salinities beyond limits of our model in two pressures series, while Figure 16(b) shows the correlation between the values calculated from the two models. An excellent agreement between the solubility prediction from the two models can be observed regardless of the pressure series (R2>0.99). The same results were obtained from other combinations of pressure and temperatures in the range of Table 1 and salinities beyond that. The average relative error between the two models’ outcomes was only 2.4%, which indicates an excellent fit between the two models. Hence, the model developed in this study can accurately predict CO2 solubility at salinities up to 50,000 ppm.

The same procedure was repeated by keeping pressure and salinity within the range reported in Table 1, while the temperature was increased up to 500 C. The results are reported in Figure 17. The error bars in Figure 17 represent 5% error in each data. It is apparent from the figure that the model performance at temperatures beyond its limit is not very accurate with 0.25<R2<0.65 depending on pressure and salinity of input data. The prediction from the model deteriorates further as temperature increases and reaches a maximum of nearly 85% error at 500 K. The mean average relative error was 65%. Similar significant errors were found when the model was tested at different combinations of pressure and salinities when the temperature was higher than 373. Considering the normal geothermal gradient for most reservoirs (3 K/100 m), and surface temperature of 298 K, the temperature limit of our model (373 K) corresponds to a subsurface formation with depth of approximately 2660 meters, and hence, the model is not recommended for deeper or formations with abnormally high-temperature gradients. One possible reason for the model’s poor performance is an inversion point at high temperatures (above 423 K) beyond which the solubility increases with temperature [33]. However, since the experimental database used to train our model did not include such high-pressure high temperatures points, the model could not detect such a phenomenon. Therefore, poor predictions in this range are expected.

In this study, different AI/ML approaches were used to estimate the solubility of CO2 in ionic brines saturated with different amounts of NaCl in conditions representative of sequestration/injection to low-salinity subsurface formations. An actual database consisting of 164 datapoints was split into training data and test datasets, and different models (XGB, KNN, MLP, an artificial neural network model, and a Genetic algorithm) were used to predict solubility. All models accurately estimated the solubility of CO2 in brine (0.95<R2<0.99, 0.02<MAE<0.05, and 0.32<MRE<0.55). A detailed analysis of relative errors for each model was conducted, and it was found that solubility values at low-pressure points (measurements at 1 atm) showed the highest errors among all data points. Analysis of RE also revealed that the frequency of those erroneous predictions was less than 1% among all data points, and hence, these values were not omitted from the database. Different feature importance analyses were implemented further to clarify correlations between the output and input parameters. The qualitative and qualitative analysis proved that pressure was the most crucial parameter for solubility, and there was a direct correlation between pressure and solubility. Both temperature and salinity were inversely correlated to solubility and had a lower impact.

To ensure reproducibility of the results, an accurate and simple to use empirical model was developed in this study. The model was developed using GA and performed well in terms of MRE, MAE, and R2. The prediction from the model was compared to those of well-known theoretical models, and good consistency among the models was found. Also, the model performance was tested beyond the range of its input parameters. It performed well when pressure and salinities were beyond its scope; however, the model manifested significant errors when the temperature was higher than 373 K.

The database used in the current research was obtained from a low-salinity formation, and hence, the developed model worked better within its scopes of pressure, temperature, and salinity. However, to develop more comprehensive models with a broader spectrum of applications, it is recommended to use a database comprising experimental data from a wide range of pressure, temperature, and salinities.

The solubility database used to support the findings of this study is included within the supplementary information files (two pdf files).

The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

This research was supported by the Science and Technology Project of Heilongjiang Province (2020ZX05A01) and China Petroleum Science and Technology Innovation Fund Project (2020D-5007-0106).

Exclusive Licensee GeoScienceWorld. Distributed under a Creative Commons Attribution License (CC BY 4.0).