Rock classification is critical in geological research and geoscience applications. Traditional methods rely heavily on manual expertise, which makes them susceptible to human errors due to their reliance on individual skills and experience. Although current machine learning models have mitigated some drawbacks by classifying rock images, their generalization and predictive performance are limited by suboptimal network structures, image data quality, and quantity. These models also require manual feature extraction, increasing training complexity. This article presents an explainable EfficientNet model for eight-class rock classification, pretrained on a novel dataset. Our high-resolution rock specimen images are curated to standardize data, reduce noise, and minimize training perturbations, improving classification precision. To enhance convolutional neural network interpretability and reliability, we further dive into visual interpretation maps generated by various class activation mapping methods. This further demonstrates the model’s generalization capabilities and its ability to capture rock textures, shapes, and colors. This approach not only reinforces the model’s interpretability but also underscores its robustness in identifying key discriminative attributes within rock imagery.

Rock classification is the foundation of geological engineering exploration and plays a crucial role in various fields such as geophysical exploration, mineral resource exploration, rock mechanics, geotechnical investigation [1], civil engineering [2], geotechnical engineering, and mineralogy. It provides essential guidance for mineral and petroleum resource exploration, optimization of geotechnical engineering design schemes, safety assessment, risk evaluation, and other projects. Particularly in geophysical exploration, rock classification not only directly impacts the identification of geological layers and the evaluation of reservoir characteristics but also plays a critical role in guiding subsurface structural modeling and property prediction. Traditional methods of rock classification, such as indirect analyses using seismic waves, gravity, magnetic, and electromagnetic methods, as well as geochemical analysis, manual specimen thin section analysis [3], infrared spectroscopy, X-ray powder diffraction, and scanning electron microscopy [4], rely on professionals utilizing specialized equipment and depending on the expertise and precision of experts to extract meaningful features from rock samples for classification purposes. These methods share common drawbacks: they heavily rely on the expertise of researchers, involve complex equipment operation [5], and are subjective in nature, time-consuming, and inefficient. Traditional machine learning methods [6-8] for rock classification typically require manually designing feature extraction methods and training classifiers with rock features to accomplish the classification task. With the advancement of artificial intelligence, deep learning has found widespread application across diverse image classification endeavors. Traditional machine learning requires additional methods to extract rock features from huge training datasets.

Using traditional machine learning requires manually extracting rock features from large training datasets, making training difficult and inefficient. Deep learning is a machine learning method aimed at learning abstract features of data by simulating the structure and function of the human brain. Deep learning models usually consist of multi-layer neural networks, which are mainly divided into three categories: input layer, intermediate hidden layer, and output layer [9-11]. These neural network layers automatically learn abstract representations of data (without the need for manual feature design) to perform tasks such as classification, regression, and clustering. Designing rock classification models leveraging deep neural network methods has become a new approach in rock classification [12, 13], introducing new perspectives for stratigraphy and lithofacies division in geophysical exploration. These techniques not only enhance the automatic identification of rock features but also provide efficient and reproducible classification tools for traditional geophysical analyses, playing an active role in advancing applications such as subsurface stratigraphic division and formation prediction [14-16]. Based on convolutional neural networks (CNNs), Cheng et al. [17] designed a rock grain automatic classification method. By training CNN on 4800 samples from the Ordos Basin, they achieved classification of rock slice samples under a microscope. However, their image data were obtained from rock cast thin sections captured under polarizing microscopes, making dataset creation relatively complex and not easily accessible. Zhang et al. [18] utilized the Inception-v3 network model and established a rock image classification model using transfer learning, capable of identifying and classifying granite, conglomerate, and quartzite, achieving an accuracy of over 85% on test data. Bai et al. [19] developed a rock identification model based on CNNs and trained it on 1000 rock images collected from the internet or captured in real life, reaching a classification accuracy of 63%. Bai et al. [20] also proposed a rock slice classification model based on the VGG model, which classified six types of rock slice images including andesite, quartz sandstone, and so on. The final accuracy reached 82%. Imamverdiyev and Sukhostat [21] proposed a novel 1D-CNN model trained with various optimization algorithms suitable for lithofacies classification in complex terrain. Feng et al. [22] designed a rock classification model based on a dual CNN derived from AlexNet for fresh rock cross-sections. It can extract more comprehensive and deep features by considering both the global information of the image and the local texture information of the rocks. However, the downside is its large model size and relatively low classification accuracy. Hu et al. [23] used image data from geological big data and achieved a lithology classification model with accuracy reaching about 90% based on deep learning. Liang et al. [24] used the vision transformer (ViT) network structure evolved from Transformer to classify seven types of ores with accuracy reaching 90%. Based on the Transformer framework, Koeshidayatullah et al. [25] proposed a new FaciesViT model for automatic lithofacies classification, which outperforms CNN and hybrid CNN-ViT models significantly without requiring feature extraction and preprocessing. In addition to natural scene rock images, many researchers also utilize microscopic images of rocks and their infrared spectral features for rock classification. Xu and Zhou [26] designed a U-Net CNN model for automatically extracting deep mineral features from microscopic images, thereby enabling intelligent identification and classification of minerals in ore samples, with final accuracy reaching 90% Iglesias et al. [27] used ResNet18 to classify polarizing microimages of five minerals such as olivine, garnet, plagioclase, biotite, and quartz, achieving a model accuracy of 89%. Xiao et al. [28] first obtained spectral images of ores using a visible-infrared reflection spectrometer, then the data was input into an expanded convolutional neural network-based iron ore five-classification model, achieving classification of granite, schist, and other ores. Li et al. [29] constructed a rapid rock classification method based on BP neural networks, introducing engineering data from a plateau tunnel into a rock classification system based on BP neural networks for training and testing purposes, achieving an overall correlation of 89 after training the network model. Brousset et al. [30] proposed a new rock classification method using

artificial neural networks, providing a tool applicable to all types of rocks discovered in underground mines, estimating rock classification with an error rate of less than 1%. Zhang et al. [31] proposed a method combining deep residual shrinking networks and attention mechanisms to suppress and eliminate the influence of redundant information and features on rock slice classification, achieving a macroscopic inspection accuracy of 94.08%.

The above research has realized rock classification based on deep learning, but still faces issues such as low accuracy, limited classification categories, redundant model parameters, and lack of confidence in model prediction results. The EfficientNet-B0 model utilizes neural architecture search (NAS) to jointly optimize the network width, depth, and input image resolution, reducing the number of parameters while maintaining accuracy. Additionally, it introduces SE modules with attention mechanisms, allowing the network to dynamically allocate different weights to different features during the learning process, overcoming the problem of attention dispersion in previous models. This enables the EfficientNet to achieve efficient model design with adjustable complexity while maintaining high performance. This balance allows the model to still achieve high performance even with limited computing power. Therefore, this study proposes a rock image model based on EfficientNet to classify eight subcategories of rocks among the three major categories of sedimentary rocks, metamorphic rocks, and magmatic rocks. Finally, the interpretability of the model is verified to reveal the decision logic and key features of the model, enhancing trust and acceptance of the model by people. It also helps identify errors and biases of the model, discover the root causes of problems, and adjust the model’s structure, parameters, or input data to improve performance and accuracy. Applying the class activation mapping (CAM) method to analyze the interpretability of the model, extracting key information from the model’s gradients and intermediate feature maps, and generating heatmaps associated directly with the model’s predictions through Grad-CAM, Ablation-CAM, and Score-CAM methods. These methods help capture local and global features of model decisions, and intuitive visualization helps researchers understand the decision-making process of the model and evaluate the rationality of model predictions. The second section introduces the dataset used for training. The third section introduces the neural network models selected. The fourth section presents a comparative analysis of the performance metrics of several classic models and different versions of EfficientNet models. In the fifth section, the CAM method is used to analyze the interpretability of the model. The sixth section presents the conclusion of this article.

There are numerous types of rocks in nature, which are classified into three main categories in geology based on their formation processes: igneous rocks, sedimentary rocks, and metamorphic rocks. Due to significant differences in their formation processes, the three major categories of rocks exhibit distinct characteristics such as density, hardness, permeability, compressive strength, and other parameters. Assessing the strength, stability, and permeability of rocks under stress is crucial for the design of structures such as tunnels, shafts, rock blasting, hydrogeological engineering, and other related projects. The identification of the three major types of rocks is also a fundamental objective in rock engineering and geological engineering. Therefore, the eight categories of rock samples selected in this study include a relatively balanced representation of the three types of metamorphic rocks, three types of igneous rocks, and two types of sedimentary rocks. To ensure better applicability of the model in practical engineering scenarios, we selected representative and commonly encountered rocks from various geological regions as classification objects. These include amygdaloidal basalt, granite, quartzite, sandstone, eclogite, gneiss, and anthracite. Additionally, we included scoria, which presents a unique challenge in classification due to its visually distinct characteristics compared to traditional rocks. This helps evaluate the model’s ability to identify complex geological materials. Furthermore, considering the balance between dataset size and model performance, we set the number of categories to eight. This configuration not only encompasses the major rock types but also ensures the feasibility of data collection, avoiding the issue of data sparsity caused by an excessive number of categories. The characteristics of the eight types of rocks are shown in Table 1.

Due to the complexity of the shooting environment, a significant portion of the photographs contain numerous interferences. These include obstructions by large plants on the rock formations, artificial features resulting from rock processing, and the pronounced color variations of mineral crystals in the rocks under different lighting conditions. Additionally, geological factors may cause an accumulation of multiple types of rocks within a small area. In engineering, the primary criteria for rock classification based on images include texture features, color characteristics, fracture contour features, and structural features of the rocks.

Therefore, to construct a standardized dataset that meets the requirements of deep learning, this study applied an adaptive segmentation algorithm based on image texture features during the data preprocessing stage. High-resolution raw images were automatically and parallelly cropped using a GPU cluster, with the output standardized to a size of 224 × 224 to align with the input requirements of mainstream CNN models. For macroscopic images, regions highlighting overall structures (such as stratification and fractures) were prioritized, while for microscopic images, the focus was on preserving features such as mineral grains and crystal details. To ensure data quality, a two-stage verification mechanism was implemented after automated processing. In the first stage, a sharpness assessment model was used to automatically filter out samples with blurred features. In the second stage, a self-developed program was used to randomly sample and inspect the data, eliminating completely invalid samples with excessive background coverage or fragmented edges. Ultimately, 3 to 12 valid subimages were extracted from each original image. The cropping results are illustrated in Figure 1(a). Once the cropping size is determined, if the original image size is too large, it may result in cropped images that only reflect local particle features, texture features, and color features of the rocks, while failing to capture features such as bedding and cracks that need to be reflected at a larger scale. Conversely, if the original image size is too small, it may result in low clarity, making it difficult to extract effective features from the image. Therefore, it is necessary to carefully select the size of the original images to ensure that the processed sample data retains as many category-specific features as possible. The final number of actual samples is shown in Figure 2. Then, by employing image augmentation techniques such as vertical flipping, horizontal flipping, and rotation at arbitrary angles, a large and diverse set of new samples is generated to augment the dataset, thereby enhancing the model’s robustness, generalization capability, and mitigating overfitting. The data augmentation methods used in this study are illustrated in Figure 1(b).

To effectively extract features such as texture, color, and local granularity from rock images, while mitigating model parameter redundancy, achieving model lightweighting, and addressing the issue of model attention dispersion, a rock classification model based on EfficientNet-B0 is proposed. Figure 3 illustrates the schematic data flow within the entire model.

3.1. EfficientNet Model

In training image data using CNN models, increasing any dimension of network depth, width, or resolution can enhance the model’s accuracy, but the accuracy gain diminishes for larger models. Based on past experience, augmenting the network’s depth facilitates the capture of more intricate and comprehensive features from images and generalizes well to new tasks. However, as the quantity of network layers increases, the problem of vanishing gradients arises, making training deep networks more challenging. Although some training issues have been alleviated through techniques like batch normalization (BN) [32] and skip connection [33], the accuracy gain of extremely deep networks still tends to decrease. Increasing the width of the network can capture higher and finer granularity features, making training easier. However, wider and shallower networks often encounter difficulties in capturing higher-level features. By inputting higher-resolution images, the network can capture finer granularity and higher clarity, indeed improving the model’s accuracy. However, for extremely high-resolution images, the benefits of increased accuracy diminish. The key to achieving higher accuracy lies in balancing the depth, width, and image resolution of the network. In previous model optimizations, improvements were made in one of the three dimensions: model width, model depth, or the pixels of input rock images, as shown in Figures 4(b)–4(d). The most common approach is to expand the ConvNet according to depth (or width). Another less common but increasingly popular approach is to upscale the model according to image resolution [34]. While it is possible to expand the model in multiple dimensions simultaneously, multidimensional expansion requires tedious manual adjustments. EfficientNet greatly optimized this step by utilizing NAS [35, 36] techniques and compound scaling methods to achieve simple and efficient model scaling, becoming one of the best-performing networks for image classification on the ImageNet dataset. Unlike the traditional approach of arbitrarily scaling these factors, they use a fixed set of scaling coefficients to uniformly scale the network width, depth, and resolution. Leveraging NAS technology and compound scaling techniques, the baseline network (Figure 4(a)) is scaled from multiple dimensions. This enables synchronous adjustments to the network’s width, depth, and the resolution of input images, as illustrated in Figure 4(e), eliminating the need for complex manual adjustments to achieve higher accuracy. The core of this scaling technology is to find a set of values for network’s width and depth, and input image resolution that are best coordinated to optimize network performance. The formulas for this optimization problem are as shown in equations (1)–(3).

(1)
(2)
(3)

In which, w, d, r are, respectively, the scaling factors for network’s width, depth, and input image resolution. d scales the depth L of the network, r scales the resolution, that is, the height H and width W of the image, and w scales the number of channels C of the feature matrix. F^idLi^ represents the convolution Fi repeated dLitimes. N(w, d, r) represents the classification model. This method introduces a user-specified coefficient φ, which controls the magnitude of model scaling in terms of resource consumption. α, β, and γ are constants that can be obtained through a small grid search to measure the importance of depth, width, and resolution. First, φ is fixed at 1, indicating current resource consumption is 1, and a grid search is conducted for α, β, and γ to obtain the optimal values for the base model EfficientNet-B0: α = 1.2, β = 1.1, and γ = 1.15. Then, with α, β, and γ values fixed, by modifying the size of resources φ, the model undergoes composite scaling to obtain Efficient-B1 to B7.

EfficientNet-B0 is the model with the smallest parameter size and calculation amount in the EfficientNet series. Therefore, compared to larger models, EfficientNet-B0 demands lower computational resources, making it more suitable for resource-constrained environments. The EfficientNet-B0 model is constructed by stacking 16 Mobile Inverted Bottleneck Convolution (MBConv) modules, and its specific structure is illustrated in Table 2. In the table, each MBConv is followed by a number, either 1 or 6, which represents the expansion factor n (multiplicative factor for channel expansion). This means that the first 1 × 1 convolutional layer in the MBConv expands the input feature matrix channels by n times. The two numbers after k represent the kernel size. “Input” denotes the resolution of the input convolutional layer, “Channels” indicates the number of output feature matrix channels, “Layers” shows the repetition count of the current MBConv module stack, and “Stride” represents the stride of the last repetition block, with the remaining strides defaulting to 1.

Unlike traditional residual modules, the structure of the MBConv module differs in that both the height and width of its output and input feature maps are larger than the intermediate layer. The composition principle is illustrated in Figure 5. The MBConv module comprises a convolutional layer with kernel size 1 × 1, depthwise separable convolution, dropout layer, and squeeze-and-excitation (SE) attention module. The MBConv module applies BN after the convolutional layer to normalize the data trained in the model, aiming to accelerate its convergence speed. Additionally, it utilizes the Swish activation function to introduce nonlinear correlations in the data, thereby avoiding overfitting. Moreover, the SE module adjusts the importance of features between channels adaptively. It learns the weight of each channel through global average pooling operation and fully connected layers and then applies these weights to the channels. The SE implementation process, as shown in Figure 6, mainly consists of three steps: Compression: The input feature map of size C × H × W is globally average pooled to obtain a feature map of size 1 × 1 × C, converting the feature map of each channel into a single value. This one-dimensional vector contains the global average feature values of each channel, reflecting the overall importance of each channel. Excitation: A fully connected layer is connected after the compression stage to learn the weights of each channel. The ReLU and Sigmoid activation functions are used to constrain the weights within the nonlinear range to ensure that they represent the importance level of each channel. Scale: This is achieved by multiplying each channel in the original feature map by its corresponding weight, applying a scale factor to each channel. This module allocates weights based on the impact of each channel on the accuracy of rock image classification, enhancing effective rock feature channels and suppressing weak ones.

3.2. The Training Steps for the Rock Classification Model Based on EfficientNet

After selecting the EfficientNet-B0 model, the training of the model is conducted, which includes hyperparameter tuning, selection of optimizer, and loss function. The schematic diagram of the entire training process is shown in Figure 7. (1) Data preprocessing is carried out by randomly partitioning the original images into training, validation, and test sets in a ratio of 8:1:1. (2) Data augmentation is performed, including horizontal flipping, vertical flipping, and random rotation. (3) The EfficientNet-B0 model is constructed. An additional fully connected layer with input size 1280 and output size 8 is added after the final fully connected layer and Softmax classifier. (4) The cross-entropy loss function is selected as the loss function, Adam optimizer is chosen for optimization, and Dropout and L2 regularization techniques are used to mitigate overfitting. (5) Training hyperparameters are set with a learning rate of 0.001, batch size of 128, and epoch of 30. Upon completion of the above steps, the training and validation set data are passed into the model to commence training, resulting in a trained model. Inputting samples from the test set will output the predicted probabilities of the eight types of rocks.

To demonstrate the superior performance of the EfficientNet applied to rock classification tasks, this study examines classification models based on two classic CNNs, each with its own advantages, as well as five variants of EfficientNet, including EfficientNet-B0. Moreover, it thoroughly explores the advantages of EfficientNet variants 0 to 6, identifying the most suitable EfficientNet model for rock classification. The following provides a brief introduction to the network models involved in the comparative analysis: (1) ResNet-18: Introduces residual blocks (Residual Block), which utilize residual connections (skip connection) to learn residuals. This effectively addresses the vanishing gradient problem in deep neural networks [37]. By stacking layers, it can learn more complex feature representations, enhancing the network’s expressive capability and performance on complex tasks. (2) VGG16: Adopts a relatively simple and uniform architecture consisting of 16 convolutional layers and fully connected layers [38]. The convolutional kernel size of this model is set to 3 × 3, which helps reduce the number of parameters while enhancing the network’s perception of local features. (3) EfficientNet-B1 to B4 are extended variants of the EfficientNet-B0. EfficientNet-B0 was developed as a baseline model through NAS on large-scale datasets and a compound scaling strategy. B1 to B4 expand the depth, width, and input resolution of EfficientNet-B0 via compound scaling, ensuring that as the model scales up, it utilizes computational resources more efficiently while improving accuracy. The advantages of B1 to B4 increase sequentially, with higher model complexity and input resolution in the higher-numbered variants, making them suitable for resource-rich computational environments.

For the rock classification problem, accuracy, validation set loss, recall, precision, F1-score, ROC curve, and confusion matrix are the main criteria for evaluating classifier performance. Moreover, in cases where model performance is similar, the complexity of the model and its computational resource requirements is also important criteria for assessing its superiority. Therefore, this study conducted comparative experiments on the performance of seven CNN models across these indicators.

4.1. Learning Curve

Precision denotes the ratio of samples correctly predicted as positive by the classifier to the total samples predicted as positive. Recall represents the ratio of truly positive samples correctly identified by the classifier to the total truly positive samples. F1-score, on the other hand, represents the harmonic mean of precision and recall, providing a balanced evaluation of the classifier’s performance. Accuracy is defined as the percentage of samples correctly classified relative to the total number of samples. Equations (4)(7) represent the calculation formulas for these four metrics. TP denotes true positives (correctly predicted positive instances), FN denotes false negatives (incorrectly predicted positive instances as negative), FP denotes false positives (incorrectly predicted negative instances as positive), TN denotes true negatives (correctly predicted negative instances), and i=1rzii represents the total number of elements on the diagonal of the confusion matrix. The learning curve visualizes the model’s generalization and prediction capabilities.

(4)
(5)
(6)
(7)

The performance of each model on the same dataset is shown in Figure 8. The lines in the figure represent the prediction accuracy and validation errors of the five network models on the validation dataset over 30 iterations. It can be observed that EfficientNet-B0 to B4 (EFF0-EFF4) all exhibit strong generalization capabilities and achieve good convergence. Table 3 lists the Macro Avg Precision, Macro Avg Recall, Macro Avg F1-Score, and Accuracy of the seven models in this study for rock image classification tasks. Evidently, EfficientNet-B0 outperforms the other six models across all metrics. Table 4 shows the complexity of the seven models, covering four key metrics: total parameters (Total params), memory consumption during forward propagation (Forward/backward pass size), memory required to store model parameters (Params size), and the total memory used during model operation (Estimated Total Size). The results indicate that the EfficientNet-B0 model has the fewest parameters and the lowest memory usage among all EfficientNet versions. EfficientNet-B0 maintains high performance while minimizing hardware requirements, demonstrating superior computational efficiency.

4.2. ROC Curves

To help us intuitively demonstrate the performance of the classifier at different thresholds, understand the trade-off between true-positive rate (TPR) and false-positive rate (FPR), and comprehensively evaluate the performance of the classifier under different conditions, including sensitivity, specificity, and other metrics. Equations (8) and (9) show the calculation methods for TPR and FPR. TP + FN denotes the number of positive samples in the actual dataset, FP + TN denotes the number of negative samples in the actual dataset, representing TP + FP denotes the number of predicted positive samples, FN + TN denotes the number of predicted negative samples.

(8)
(9)

Conventional ROC curves are often plotted in binary classification tasks. In the case of multiclass classification problems, we can obtain multiple ROC curves by decomposing the problem into several binary classification tasks. Then, by using appropriate methods to combine these curves, we can obtain the ROC curve for multiclass classification. There are two commonly used averaging methods: macro-average and micro-average [39]. When there is no significant imbalance in the classification data, the differences between the multiclass ROC curves generated by macro-average and micro-average methods are small. Since the quantities of various rock types in this study are relatively balanced, the macro-average method is chosen to normalize the ROC curves for each rock type. The area under the ROC curve, known as the area under the curve (AUC) value, intuitively reflects the classifier’s ability to distinguish between positive and negative instances. A larger AUC value indicates better classifier performance.

As shown in Figure 9, this article plotted the ROC curves for the eight-class classification using seven models. Upon observation, it was found that the curve for EfficientNet_B0 was closest to the left axis and the top axis, with the largest AUC. To aid in interpreting the model performances, a black dashed line was added to the ROC curve, representing the reference line for a random classifier, which corresponds to an AUC value of 0.5. This line serves as a benchmark, indicating the performance of a classifier that makes random predictions. By comparing the ROC curve of our model with this dashed line, we can assess how much better our model performs than random chance. An AUC value greater than 0.5 signifies that the model has a better-than-random ability to distinguish between classes.

4.3. Confusion Matrix

The confusion matrix is used to visualize the relationship between the predicted results of a classification model and the true labels across different classes. It helps identify any biases or tendencies of the model toward specific classes. By observing the elements on and off the main diagonal of the confusion matrix, we can understand the misclassification patterns of the model across different classes and adjust the model accordingly to improve classification accuracy.

Figure 10 displays the confusion matrices of various models, showing that the main diagonal elements of the confusion matrix of the EfficientNet-B0 model are the most concentrated.

Based on the performance comparison experimental data, it is evident that among the seven CNN models, the EfficientNet-B0 model achieves the highest scores for accuracy, precision, recall, and F1-score, each reaching 0.97. Its ROC curve is closest to the left and top axes, with an AUC approaching 1, and the main diagonal elements of the confusion matrix are highly concentrated, indicating a close match between class distribution and true labels. Additionally, this model requires the lowest computational power, demonstrating high computational efficiency. Analysis across all metrics confirms that the EfficientNet-B0 model performs best in this rock classification task.

In existing research on rock classification based on deep learning, the emphasis has mainly been on the accuracy of the models, with little investigation into how deep learning models classify rock images [40]. As neural networks continue to evolve, deep learning models have achieved enhanced performance by adopting more abstract (increased network depth) and denser (end-to-end training) structures. However, this also exacerbates the complexity of the neural network “black box.” Such high accuracy yet low-interpretability classification models pose limitations on the application of deep learning in real-world geoscience domains, particularly in tasks where the scientific basis of model decisions is crucial, such as rock classification. Rock classification holds significant importance in disciplines such as geology, geotechnical engineering, mineral exploration, and civil engineering, providing fundamental support for research and practical applications in these fields. Therefore, understanding the decision-making basis of deep learning models in rock classification tasks and verifying the scientific validity and rationality of their classifications are crucial for advancing the widespread adoption of deep learning in geosciences.

In this study, after evaluating the performance of the EfficientNet-B0 model in rock classification tasks, we further examined the fitting behavior of the pretrained deep neural network in this context and conducted an in-depth analysis of the model’s interpretability using CAM methods. By generating saliency maps, CAM methods visually highlight the focus areas within an image and reveal the importance of different regions to the model’s decision-making process, thereby aiding researchers in understanding the basis of the model’s decisions. Considering the variability in analytical capabilities among different CAM methods, we selected three high-performing CAM techniques for interpretability analysis. These methods differ fundamentally in their principles and are suited to various model architectures and task scenarios. By comparing the saliency maps generated by these three methods, we were able to validate the alignment of the model’s decision-making basis with fundamental standards of rock classification from multiple perspectives. Notably, if at least one of these methods produces results consistent with human cognition, it can be demonstrated that the model’s classification basis is scientifically credible. Therefore, the use of multiple CAM methods in this study aims to ensure comprehensive and balanced evaluations of the model’s classification basis rather than to compare the superiority of different methods.

In rock classification tasks, we argue that interpretability analyses of deep learning models should primarily focus on whether the model correctly attends to rock-specific features (e.g. texture, mineral distribution). To this end, our study concentrated on analyzing saliency maps from the model’s final layer to verify whether the model reasonably focuses on rock features during classification tasks. This approach directly illustrates the regions of interest in the model’s final decisions, providing clear evidence of whether the model bases its classifications on geologically meaningful features. Although gradient-based visualizations from intermediate layers can also yield valuable insights, we believe such analyses are more suited to exploring the general working principles of deep learning models rather than the specific feature extraction process in a given task. Considering the “black box” nature of deep learning models, intermediate-layer analyses often involve more intricate internal mechanisms, whose outcomes may not be directly linked to the rationality of classification decisions in specific tasks. Therefore, this study emphasizes saliency map analysis of the final layer to more directly validate the reliability and scientific validity of the model in rock classification tasks. Additionally, the saliency maps generated by different CAM methods may exhibit substantial visual differences. Such discrepancies do not indicate inconsistency in model performance but rather reflect the inherent differences in the principles underlying these CAM methods. This variability further underscores the importance of employing multiple methods in the interpretability analysis of deep learning models to mitigate potential biases introduced by a single method and enhance the reliability of the analytical results. Through this study, the classification basis of the EfficientNet-B0 model in rock classification tasks has been validated from multiple perspectives, demonstrating that it effectively focuses on critical feature regions within rock images. This provides robust support for the reliable application of deep learning models in the geoscience domain.

5.1. Grad-CAM

The principle of Grad-CAM is illustrated in Figure 11. In the figure, A represents the output of the last convolutional layer of the network model after feature extraction from the image. Subsequently, the prediction value yc for class C is backpropagated to obtain the partial derivative of the prediction value y with respect to A. A' represents the slope of each element in A with respect to the prediction value y, where a larger slope indicates a greater influence of the element on y. Then, the average of elements in A' across each channel is calculated to determine the degree of influence of each channel in A on y [41]. Finally, weighted summation is performed followed by ReLU activation to obtain the Grad-CAM heatmap. Figure 12 demonstrates the Grad-CAM heatmaps and corresponding scores for samples classified into different categories by the model. The regions with high gradients are concentrated in the effective classification feature areas, as indicated by the red boxes, which suggests that this method can accurately reflect the model’s areas of interest. The specific calculation formulas are as follows (equations (10) and (11)):

(10)
(11)

k represents the k-th channel in the feature layer A, c denotes the class c, Ak denotes the data of channel k in the feature layer A, Aijk denotes the data at position (i, j), αkc represents the weight for Ak, yc denotes the score output by the network for predicting the input image as belonging to class c, and Z equals the width × height of the feature layer. The Grad-CAM algorithm is as follows:

Algorithm 1: Grad-CAM

Input: Model f, Image X, Layer l, Class c

AflastX, ycfX

CthenumberofchannelsinA

h,wsize(X)

for k in (0,1, 2, …C-1) do for i in (0,1, 2, …h-1) do

for j in (0,1, 2, …w-1) do

GradycAijk

All grad+=Gradend

αkc=1h×wAllgrad

LGrad-CAMc=ReLU(kαkcAlastk)

Output:LGrad-CAMc

5.2. Score-CAM

Unlike Grad-CAM, Score-CAM is a gradient-independent visual explanation method that measures linear weights using the model’s global confidence score for the feature map [42]. The principle of Score-CAM is shown in Figure 13, where the activation feature map of Layer l in the model is considered as a mask. It is upsampled to match the size of the original image and then pixel-wise multiplied with the original image to generate a new image. The difference between the model’s response to the new image and the baseline image is the global confidence score. Finally, the feature map and the confidence score are linearly weighted and passed through the ReLU function to obtain the visual explanation map of Score-CAM. The calculation formula is as follows: (equations (12)(15))

(12)
(13)
(14)
(15)

upsample() refers to upsampling Alk to the same size as the original image X, s(∙) normalizes the pixel values to (0, 1), Alk represents the k-th channel of the feature map of the model’s l-th layer, f(∙) represents the corresponding value of the model to the input vector, and αkc is the difference in response values for channel k of class c. The following is the algorithm for Score-CAM:

Algorithm 2: Score-CAM

Input: Model f, Image X, Baseline Image Xb, Layer l, Class c

AlflX, M[]

CthenumberofchannelsinAl

for k in (0,1, 2, …C-1) do

Mlkupsample(Alk)

Mlks(Mlk)

M.append(Mlk°X)

end

SlkflcM-flcXb

αkcexp(slk)kexp(slk)

LScore-CAMc=ReLU(kαkcAlk)

Output:LScore-CAMc

5.3. Ablation CAM

Ablation-CAM is based on masking and reactivation of activation maps to enhance the interpretability of CAM [43]. This method utilizes ablation to analyze the importance of each channel on the feature map. The principle of Ablation-CAM is illustrated in Figure 14, where each channel of the activation feature map of an image passed through a certain layer of the model is iteratively set to zero, and then the forward propagation of the model is performed to obtain the Score for the target class c. The relative difference between this value and the score of the original image is the weight of the feature map with the zeroed channel. Finally, the feature maps are weighted and summed and then passed through the ReLU function to obtain the visual explanation map of Ablation-CAM. The specific calculation formula is as follows (equations (16) and (17)):

(16)
(17)

yc represents the score output for class c after passing the original image through the model, while ykc represents the score output for class c after setting all the elements inside the k-th channel of the feature map to 0 and then passing it through the model, where k is the number of channels in the feature map. Below is the algorithm for Ablation-CAM:

Algorithm 3: Ablation-CAM

Input: Model f, Image X, Baseline Image Xb, Layer l, Class c

AlflX, ycf(X), CthenumberofchannelsinAlfor k in (0,1, 2, …C-1) do

Alk0

ykcf(Al)

αkcyc-ykcyc

end

LScore-CAMc=ReLU(kαkcAlk)

Output:LAblation-CAMc

The interpretability analysis of the three CAM methods is achieved using visual explanation heatmaps generated from the feature maps of the last MBConv module in EfficientNet-B0, as shown in Table 5. The heatmap uses color to demonstrate the model’s attention to different regions of the image and its focus range. Through the heatmap, it can be observed that the model accurately captures the effective regions and classification features. For instance, in the significance map of the first rock image (almond-shaped basalt) in Score-CAM, it can be seen that the model pays little attention to the blank areas and focuses mainly on the rock, demonstrating the model’s strong ability to extract effective regions from the image. The brightest part of the heatmap in the effective region coincides with the spots on the almond-shaped basalt, which are crucial features for geologists to identify almond-shaped basalt. This indicates that our model not only achieves high accuracy but also exhibits good interpretability. It also suggests that the model’s classification mechanism for rock images conforms to the general rules of human judgment regarding rock categories, further illustrating the rationality of deep learning models in rock classification.

Rock classification is the foundation of geological and petrological research, playing a crucial role in ensuring the stability and safety of civil engineering projects. It is a highly significant engineering task. Due to the lightweight parameters, high accuracy, and the added attention mechanism of the EfficientNet-B0 model, this article proposes a rock classification model based on it. The model is pretrained on a dataset of 5509 images spanning eight rock categories to obtain the final optimized version. The following conclusions can be drawn, based on the analysis:

  1. The model achieves an accuracy of 97% on the test set, surpassing previous studies in both classification scope and accuracy. Additionally, the model achieves lightweight parameters, making it more widely applicable in environments with limited computing resources.

  2. Furthermore, the EfficientNet-B0 model outperforms seven classical CNN models in metrics such as accuracy, recall, precision, F1-score, ROC curves, confusion matrix, and hardware efficiency, highlighting its superior performance in rock classification tasks.

  3. Apart from significant progress in performance, this study also employs three CAM methods to analyze the interpretability of the model. The results show that the model’s prediction process and decision-making are consistent with the judgment patterns of human experts in rock classification, enhancing the persuasiveness and rationality of the model’s conclusions and increasing its credibility in practical applications.

In conclusion, the rock classification model based on deep learning EfficientNet-B0 not only demonstrates excellent performance but also exhibits high interpretability. It can serve as an efficient and rational tool for practical rock classification tasks, providing strong support for research and experiments in related fields.

Further research indicates that changes in the number of rock images and rock categories significantly affect the model’s classification accuracy. Therefore, in future studies, we will strive to collect a greater variety and quantity of rock images and improve the diversity of features for each rock category to ensure that the model’s classification accuracy remains robust even with an increased number of categories.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

This research was supported by the National Natural Science Foundation of China (Grant·Nos. 42107214 and 42477157), National Key Research and Development Program of China (Grant No. 2023YFC3007102), and Chongqing Natural Science Foundation (CSTB2024NSCQ-MSX0740)

The code is made open on the Github repository at: https://github.com/LiGht-BJUT/An-efficient-and-automated-classification-system-for-rocks. Due to licensing restrictions, the full image dataset used in this study is not publicly accessible. However, researchers interested in obtaining the dataset for academic purposes may contact the authors via email at [provide email address].