## Abstract

Recent enhancements in computational capacity provide an opportunity for harnessing the enormous amount of reservoir data already acquired and extracting useful information for hydrocarbon exploration, development, and production. This article reports a three-step clustering technique to determine well groups based on subsurface geological heterogeneity using feature extraction, hierarchical ensemble clustering, and spatial mapping. The first step of the presented methodology is to group the wells into different clusters based on the formation rock composition and property features extracted from well logs using the expectation maximization algorithm. The one-dimensional (1D) stacking pattern of each well log curve is expressed through a two-dimensional (2D) transformation scheme. Thus, the clustering can capture the vertical stacking patterns of well logs, which is essential for reservoir heterogeneity characterization. This base clustering process generated a feature matrix which is further grouped through the hierarchical ensemble clustering in a latent space of well logs in the second step. Through the ensemble clustering, different clustering proposals obtained from the base clustering are integrated corroboratively to reflect a comprehensive feature of all studied logs. In the third step, the spatial clustering is performed based on the ensemble results, considering the spatial distances between well locations in the target area. The results of the 2D spatial map may provide insights into the sedimentary depositional environment in terms of the lateral geological heterogeneity features. Therefore, the proposed clustering technique can present a fast geological modeling method to integrate geological heterogeneity features presented in multiple well logs, which is not yet fully utilized in traditional geomodeling approaches. The results can also support further reservoir studies, such as petrophysical modeling, reservoir modeling, and fluid flow simulation studies.

## 1. Introduction

Borehole well log data has been widely used to measure properties of subsurface formation rock and fluids. Qualitative and quantitative feature analyses of logs provide effective reflections of geological variations of the target formation for characterizing rock compositions and properties, variations, and formation heterogeneity and identifying potential hydrocarbon reservoirs. Well log-based formation evaluation is a well-established and indispensable procedure in the development of oil and gas fields [1–4].

Recently, numerous machine learning (ML) algorithms have been developed and implemented in petroleum exploration and development as an efficient and cost saving technique to harness the vast amount of reservoir-related information embedded in the streaming data [5–7]. A number of recent publications in this area have applied data mining techniques to extract useful information from voluminous subsurface wireline log data [7–9]. While these studies attest to the value in developing new analytical tools to extract information from different log curves via ML, there is currently insufficient studies on the methods of how to harness well log data to improve decision-making for the development of oil fields and evaluation of reservoir management programs.

Clustering is a machine learning technique that involves grouping of data according to certain predefined measurement similarity [10, 11]. It is a common technique for statistical data analysis used in many disciplines. Conventional cluster analysis has a variety of algorithms. One of the most popular algorithms is the $k$-means, because it is relatively easy to understand and implement and it often generates satisfying results. Generally, with the $k$-means algorithm, data points located in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features.

Clustering has been implemented as a supervised or unsupervised machine learning technique that produces natural groups or patterns with a set of data by ensuring that the assigned observation in the group is similar [12–16]. It is significant to indicate the fact that employing a clustering process on a large data set whereby artificial features are generated can expose hidden information. For example, Asante-Okyere et al. integrated features extracted from the process of clustering well log parameters using $k$-means and Gaussian mixture models to improve lithofacies classification performance of gradient-boosted machine learning techniques [17]. A hybrid clustering method was also proposed by Esmaeilzadeh et al. to divide a reservoir into distinct compartments to reflect the spatial heterogeneity [18].

When taking a limited number of well log curves in a reservoir interval for clustering, the number of log features may be much more than the number of log curves, i.e., the samples. This type of data sample is characterized by a combination of low sample number and high dimensionality. Clustering in this scenario is not adequately addressed by common clustering algorithms. Sharma et al. developed a 2D expectation maximization (EM) method capable of clustering high-dimensional biological data sets [19]. This method involves a two-step approach of developing a feature matrix and performing clustering using the EM algorithm that accepts the feature matrix as an input variable.

In this paper, we seek to extend the application of high-dimensional data clustering method by proposing a novel three-step well clustering approach as given below:

Firstly, we conduct a base clustering according to the features of each well log curve using the EM and $k$-means algorithms. In this step, the well log that is going to be used is the obtained 1-dimensional conventional well log curves. The advanced image well log, such as NMR, is not applicable yet. The proposed 1D to 2D information transformation is critical as the 2D matrix transforming will retain the well log shape signatures for clustering that could otherwise be difficult to be included in the traditional vector-based clustering method [20]. For traditional vector-based clustering method, the well log data measured at each depth are usually treated separately. Thus, the log curve vertical stacking patterns or well log vertical trends are neglected.

## 2. Methodology and Workflow

Following traditional clustering notation, the samples are usually represented as points in a multidimensional space, where each dimension represents a distinct attribute or measurement describing the object. For simplicity, it is usually assumed that the values are present for all attributes. Thus, a set of samples is represented (at least conceptually) as an $N$ by $D$ matrix, where there are $N$ rows, one for each object, and $D$ columns, one for each attribute. This matrix has different names, e.g., pattern matrix or data matrix, depending on the particular applications.

When implementing well clustering, each well is a sample for clustering which is done first in well log data domain and then in the spatial reservoir or a target formation or a reservoir zone. For each sample, the log attributes measured along the depth are attributes in clustering notation. For each sample, it has multiple attributes as shown in Table 1, which is called “a pattern matrix.” In practice, the entire well trajectory is rescaled so that the length of all the wells is the same. Basically, the dimension is $N\xd7M\xd7D$, where $N$ is number of wells, $M$ is number of well log curves, and $D$ is number of data points along the depths.

Usually, for a well, it could have multiple well logs, and each log might have the same pattern matrix. As an example, Table 1 is the pattern matrix obtained from gamma ray log data. Similar matrices can be obtained for other well log types. In the present study, different well log types are named as feature.

The wells in a reservoir are denoted as $W1,W2,\cdots ,WN$. For each well, it has multiple log curves, which are denoted as $L1,L2,\cdots ,LM$ as shown in Figure 1. The feature dimension (well log types) is determined by the number of well log types for clustering, such as gamma ray (GR), spontaneous potential (SP), or resistivity logs. Each well log curve is observed as a vector array given certain top and bottom depth. We denoted each well log vector as $d\u21131,d\u21132,\cdots ,d\u2113D$, $D=1,2,\cdots ,M$. The whole data dimension for a reservoir under analysis is $N\xd7M\xd7D$, as shown in Figure 1. The objective is to cluster all the $N$ wells based on $M$ well log curves, and each log curve has $D$ measurements along the depth (Figure 1).

In the present study, a three-step clustering workflow is proposed, as described below:

*Base clustering*: the first step is to perform base clustering using EM and $k$-means algorithms to reduce the problem to a $N\xd7D$ matrix for each well log parameter. The clustering to the samples/wells will be done based on the analyses on those matrices.

*Clustering ensemble*: after the base clustering, user will obtain $M$ partition results from all the wells because each well log will have a different clustering result. In this step, clustering ensemble further combines the multiple base clustering results to generate a single overall partition result, so that all the clustering analyses are completed in a latent space of well log curves, without considering the spatial distance of each well pairs.

*Spatial clustering*: finally, a spatial clustering was performed on the ensemble clustering result using the spatial information of the well data. In current workflow, the spatial information refers to the latitude and longitude of each well log vertical top interval. The length of this vertical will be determined from the heterogeneity study requirement. Usually, for geomodeling purpose, it could be just one zone that is believed to be the finest heterogeneity study vertical unit. The assigned cluster type at well site will be treated as a probability of each type. An inverse distance weighting method is deployed to conduct this probability interpolation between wells based on the well data. The final type to each cell in the spatial domain will be obtained according to the maximum probability threshold.

More detailed technique introduction to the above-mentioned workflow will be given in the next sections with illustration examples.

### 2.1. Base Clustering Using EM Algorithm

The clustering to the samples of all the wells in a studied reservoir will be performed based on the analysis to those attributes along the depth. The challenge is that the sample number is much smaller than the attribute numbers (number of depths). For example, the well log value is usually measured or interpreted every 0.5 feet or 0.125 meter in target formation with possibly hundreds or even thousands of meters in total depth. Therefore, the attribute dimension is around 10,000 s, while the well number is much smaller, such as could be around 100 s, which means $D>>N$. That is the challenge of dimensionality for most machine learning techniques. The problem with high dimensionality results from the fact that a fixed number of data points become increasingly “sparse” as the dimensionality increase [25, 26].

One of the solutions is to use an unsupervised multivariate data reduction method such as principal component analysis (PCA) or singular value decomposition (SVD) to reduce the dimension of the feature data [27–29]. However, the vertical well log stacking features would be lost and could not be fully captured with the use of those data reduction methods. In the workflow presented in this study, the clustering technique 2D EM is utilized because it can characterize the shape features of well logs [19]. The 2D EM can also be used to deal with the situation of small sample size with high-dimensional data sets. There are two steps in 2D EM clustering: one is 2D dimensional transform, and the other is expectation maximization clustering.

#### 2.1.1. 2D Dimensional Transform

The matrix in Equation (3) has the advantage that it provides a way to preserve the temporal dependency. For example, the matrix $G$ contains temporal correlations because $Gi,j$ represents the relative correlation by superposition of directions with respect to vertical depth interval difference. The main diagonal $Gi,i$ is the special case when the depth equals to zero, which contains the original value or angular information. For example, assume that we have ten observations from a 1D log data. First, we normalize them to [-1,1], next calculate their angular coordinates, and then calculate the $G$ matrix. This 1D vector data will be transformed into 2D data of a $10\u221710$ matrix, and a 2D image is obtained as shown in Figure 2. This vector length is a sensitive parameter in the proposed workflow. In practice, it could be determined by experienced geologist taking considering the heterogeneity feature in the target formation. For example, it could be equal the average sand thickness in the target zone.

Based upon the 2D matrix generation, all vertical 1D well log curves are folded into its 2D matrices form or feature matrices.

#### 2.1.2. Expectation-Maximization (EM) for Base Clustering

After the 2D transformation, base clustering is conducted using a modified EM algorithm. EM is an unsupervised clustering method that discovers factors of the probability distribution with the maximum likelihood. Specifically, random values are selected by the maximum likelihood model to estimate the best fit for the well log data and well samples.

In summary, after a vector is folded into a matrix form, the maximum likelihood estimation of EM algorithm is implemented for matrices clustering. Note that the space of final clustered samples (dots in Figure 3) is a hyperdimensional latent space which has neither spatial meaning nor relationship with the well locations. The 2D plotting is simply a way of illustration. When deploying this well clustering in operation, it is not necessary to make such a 2D projection plot.

### 2.2. Ensemble Clustering Based on Consensus Function

The base clustering is performed for each well log curve separately. In order to get a final integrated cluster type for each well that have multiple curves, the cluster scheme from each single well log curve has to be integrated collaboratively. The final ensemble clustering result is obtained through a consensus function calculation [30, 31]. The general framework of cluster ensembles is shown in Figure 4.

The media partition will find the final partition that maximize the similarity of all candidate partitions obtained with each feature. In practice, it is suggested to keep the final ensemble clusters smaller than the number of base clusters which will capture the common feature from different base cluster schemes. More technique details can be found in literature [34].

### 2.3. Spatial Mapping Based on Cluster Results

In steps 1 and 2, the clustering only considers the well log data and is performed in a latent space represented by the well log data domain. But, for a reservoir or formation, as each well log has a spatial location, we need to group the wells in a way that all the wells with same or similar cluster in latent space are grouped into the same spatial group. The workflow of this step is shown in Figure 5.

From the obtained clustering results and the spacing data, the indicator transformation can be interpreted as a probability of each cluster at the well site. If the indicator value is one, it means the probability to find the specific cluster at that specific site is extremely high; on the other hand, the probability is very low, if the indicator value is zero.

## 3. Case Study

The proposed workflow can be generalized without constraints on scale, whether it is a reservoir or a geological basin. In this study, a small synthetic data set is used for illustration the quick well clustering procedure. The procedure could be used for a geomodeler who would like to quick check all the well data before detailed reservoir modeling or for a petrophysicist whom are required to group wells in an oil field which might come from different areas or reservoir compartments. A 2D spatial distribution of the wells considered for current synthetic study is presented in Figure 6.

### 3.1. Base Clustering from EM Algorithm

Since each well contains four types of log curves (GR, SP, shallow resistivity IL1S, and medium resistivity IL1M), the clustering process for each log data $N\xd7d$ is a matrix $X$. Therefore, all the wells are grouped into four clustering based on each log curve type. For sure, base clustering number is another sensitive parameters in the workflow and should be investigated seriously. However, it is suggested to be larger than the final group total cluster number. The general principle is to add 1 or 3 to the target final group number. The classification results from the base clustering are provided in Table 2 and shown in Figure 7.

### 3.2. Hierarchical Ensemble Clustering

Hierarchical ensemble clustering treats each data point as a singleton cluster and successively merges clusters until all points have been merged into a single cluster. In this step, the input will be the base clustering result. During the optimization, a final data partition ($K$) will also be an input. For this case study, three final partitions are picked considering the geological background of fluvial depositional background of target formation. Three well clusters are obtained from the hierarchical clustering process, and the results are highlighted in Figure 8 and Table 3.

### 3.3. Spatial Well Clustering Mapping

The clustering methods of feature extraction and hierarchical ensemble use the well log data information. With the spatial well clustering, the spatial information of well logs shows the underground heterogeneity spatial features. As stated in Methodology and Workflow, the first step for spatial well clustering mapping requires an indicator transformation to the clustered results at a specific well location. The schematic diagram of indicator transform is shown in Figure 9.

Following the conventional geomodeling practice, the spatial domain will be represented with different modeling cells. In this illustration example, the spatial cell number is $210\u2217210$. Each spatial cell size is $500\u2217500\u2009m2$. The results of the indicator transformation interpolation for the three clusters are shown in Figure 10. Based on the maximum probability threshold method, a specific cluster to each cell based on the probability of each cluster using Equation (13) can be obtained. The resulted maximum probability threshold map provides a 2D spatial mapping for the clustered wells, as shown in Figure 11.

### 3.4. Discussion

Well-based depositional facies is interpreted based on GR, SP, shallow resistivity IL1S, and medium resistivity IL1M logs. The GR logs measure the natural radioactivity of rocks which generally have a far greater concentration in shale than in most other sedimentary rocks and thus generally shows a close relationship to grain size, especially in a terrigenous depositional setting. The depositional facies can be interpreted based on gamma ray log shapes.

The SP curves record the electric potential between an electrode pulled up the hole and the reference electrode at the surface. They have a close relationship to permeability because of the charge in mobile fluids of permeable rocks giving negative SP in permeable rocks. Thus, they are used to recognize the grain-size trends and thus depositional environments in sandstones.

Resistivity logs are electrical well logs that record the resistivity of a formation. Resistivity logs can infer the information of the porosity, the water saturation, and the presence of hydrocarbons of a formation. In this study, two depths of resistivity log curves, shallow and medium resistivity logs (IL1S and IL1M), are used for a fast formation heterogeneity characterization.

Those four well log curves (GR, SP, IL1S, and IL1M) are input variables for a fast clustering of all the wells in the study area. The clustering results successfully identify different sedimentary environment for each well group based on the subsurface heterogeneity. One of each typical well log curve is plotted with the final clustered type map as shown in Figure 12.

Log curve shapes are usually grouped as bell shape (upwards increasing in gamma ray counts), funnel shape (upward decrease in well log value counts), box-car or cylindrical (relative consistent well log measures readings), bow shape (systematic increase and decrease of well log measures), and irregular trend (no systematic change in well log measures values) [39, 40].

It is observed distinctly that the wells in type 1 have three major sand intervals (Figure 12). The shapes are box-car or cylindrical, and the lithology transition features are sand to mud dominated, which means a heterolithic succession containing discontinuous sand bodies. Considering the geological background of fluvial/deltaic depositional environment, type 1 represents the sand that are deposited from channel environment or deltaic depocenters. It also indicates that type 1 is deposited during a constructional alluvial plain succession landward of delta or shore zone system.

For type 2, the log curve shape shows a clear bell shape which upwards increasing in gamma ray counts, which mainly represents an upward coarsening discontinuous sand bodies overlying prodeltaic mud. It is a kind of heterogeneity feature characterized by a deltaic mouth bar depositional background or digitate, laterally discontinuous sand body heterogeneity features.

While for type 3, the feature is a strong heterolithic interbedded, discontinuous facies association. Many fine and thin sand beds are found in a very thick mud/shale background, which represent a fluvial flood plain depositional background. All those heterogeneity features fit the depositional environments well.

The results can be used directly as a facies model for subsequent petrophysical property modeling, such as porosity and permeability, if the target formation top and bottom are in an acceptable thickness range. Alternatively, the proposed method could be used to define layers for each zone in the target formation for a 3D reservoir facies modeling.

## 4. Conclusion

In the present study, a novel 3-step workflow of clustering wells based on the subsurface heterogeneity was proposed. Initially, the well log data was transformed into a 2D feature matrix by grouping each well into four clusters for each well log parameters using EM algorithm. The hierarchical ensemble clustering technique is then employed to achieve three clusters by combining different base clustering outputs of specific feature matrices. Finally, a spatial grouping is conducted using well spacing and formation top and bottom depth data.

The clustering results can successfully identify the different sedimentary facies for each well group based on the subsurface heterogeneity. This fast clustering approach can help uncover trends associated with attributes of each well. Thus, the result can provide a petrophysical analysis of the wells in different areas, zones, or reservoir compartments.

The proposed workflow also provides 2D spatial mapping, which serves as a means to predict categories of formation at the reservoir scale, thereby rendering the method amenable to real-time classification of wells. Therefore, this subsurface heterogeneity characteristic revealing can also be used as a certain type of fast geomodeling approach comparing to traditional geostatistics-based geomodeling method. With its high accuracy and low cost, this novel hybrid well clustering method can be adopted as a quick geological modeling tool for areas with inadequate well and seismic data or for frontier exploration.

## Data Availability

The data in this study includes multiple well log data and spatial well location coordinate. The findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be supported by the corresponding author.

## Disclosure

This work was performed as part of the employment of the authors working for Saudi Aramco.

## Conflicts of Interest

The authors declare that they have no conflicts of interest.