## Abstract

The biogenic contents of marine sediments, such as carbonate (CaCO_{3}) and organic carbon (TOC), provide important information about past climatic and environmental changes. For sediment cores, such as those found in the marginal seas of the western Pacific, intensive laboratory study takes considerable time and effort. The previous drilling and coring programs have developed nondestructive methods, which require less time and labor, such as those that utilize visible reflectance derivative spectra measured from the surface of sediment samples to estimate downcore biogenic content. Nevertheless, these methods have been shown to be useful only for on-site estimation of downcore samples and are not considered entirely feasible for testing samples collected from regional or larger spatial scales. The present study presents a novel protocol of spectral decomposition utilizing varimax-rotated principal component analysis (VPCA) for estimating biogenic contents of sediment samples at the basin scale. Using two sediment cores from the South China Sea (SCS) separated by 200 kilometers, we evaluated a new protocol by measuring the visible reflectance spectrum and the biogenic content. Based on six VPCA components of first derivative reflectance spectrum measurements and laboratory analyzed biogenic contents of core MD972148, a set of empirical equations for estimating CaCO_{3}, TOC, and opal contents have been established. The equations were tested using data from core MD012396, and the new regression equations provided accurate estimations. Our study demonstrated that our new methods could achieve better estimates due to the improvement of the regression model with a reduced number of independent variables. Further, this study circumvents the limitation of applying empirical equations to sediment cores outside of the calibration range. Our present findings state that with more comprehensive and systematic reflectance spectral data, the new protocol can be used to estimate biogenic content with more regional or spatial precision in future research.

## 1. Introduction

Variations in biogenic components of marine sediment cores provide essential information for reconstructing paleoclimatic and paleoceanographic changes over time. In several studies conducted, for example, in the western Pacific marginal seas, it has been found that biogenic carbonate (CaCO_{3}), total organic carbon (TOC), and biogenic silica (opal) content are good proxy for paleoceanographic reconstructions which are important for understanding how the Asian monsoon (AM), ocean current pattern, sea level, terrigenous sediment input, and ocean productivity have changed throughout time. For instance, CaCO_{3} content changes in the Pacific abyssal basin are systematically linked to glacial-interglacial cycles [1–6]. The complex interactions among biological productivity, carbonate preservation and dissolution, and interocean abyssal ventilation have been investigated, and these mechanisms may contribute to shaping the changing atmospheric CO_{2} levels on glacial and interglacial time scales [1–3, 7, 8]. Biogenic contents in marine sediments provide useful information on hydrographic and productivity changes, with implications for mechanisms that drive the global carbon cycle. A reasonable timeframe and degree of precision for measuring biogenic contents in marine sediments are still highly needed in marine geology.

Laboratory CaCO_{3}, TOC, and opal contents must be chemically treated prior to measurement [9, 10]. CaCO_{3}, TOC, and opal are measured in a single sample over a period of at least five days. It is time-consuming, labor-intensive, and sample-destructive to replicate sample analyses for biogenic content on a regular basis. Many paleoclimatic and paleoceanographic studies have attempted to determine the biogenic content of closely spaced samples from cores located at the same site using fast, cost-effective, and nondestructive techniques such as visible derivative spectroscopy. For example, 511 wavelengths of reflectance spectra ranging from 455 to 945 nm have been used to reconstruct CaCO_{3}, opal, and nonbiogenic sediment percent variations of the past 3.5 million years at Site 846 [11]. There is evidence that the position of first derivative peaks can be used for identifying specific sediment components, such as hematite, goethite, chlorite, illite, kaolinite, and montmorillonite [12–14]. As part of the technical development, the first derivative of reflectance spectral data between 250 and 850 nm from core tops in the Atlantic and Pacific has been used to develop calibration equations for estimating calcite, organic carbon, and opal content changes in ODP Site 847 [15]. Particularly, the optical lightness ($L\u2217$) has been used as a proxy for assessing changes in carbonate content at ODP Hole 997A [16]. The proportion of blue reflectance (450-550 nm) was used to estimate CaCO_{3} variation of the sediment core retrieved from the southern flank of the Agulhas Ridge [17]. Further, first derivative reflectance spectra (400-700 nm) at 1 cm resolution from sediment cores collected off Baja California display R-mode factors that are strikingly similar to the variability of carbonate and organic carbon contents that closely match Greenland ice core *δ*^{18}O from the past 52 ka [18].

In our previous study, we attempted to establish the sediment biogenic calibration equations for the SCS based on the fundamentals described above [19]. As input variables, all possible combinations of parameters and their transformations on visible derivative reflectance spectra between 400 and 700 nm [19] were considered. In addition, the equations were successful only when estimating the biogenic content of the core from which the calibrated data were selected. However, the equations did not apply to biogenic content estimations of other sediment cores and even those located within a limited spatial range of the SCS. The problems may arise from the high multicollinearity (variance inflation factor $VIF>5$) of the independent variables included in the regression equations ([19]; Table 1). Furthermore, this type of calibration employs a regression modeling approach that retains too many variables, resulting in overfitting and instability of model predictions [20]. Another possible caveat in the previous calibration is that the uses of R-mode factors transformed from the high dimensionality of spectral data from one core may vary slightly from one equation applied to different core datasets. The slight differences in R-mode factor models may cause erroneous estimations of the other cores.

Due to the problems described above, this study attempted to implement a varimax-rotated principal component analysis (VPCA) [21–25] to evaluate the spatial stability and accuracy of biogenic content estimation for two SCS cores. In the previous research, the VPCA method was demonstrated to identify more physically meaningful mineral components, resulting in more physically consistent estimates compared with regional ground-truth data [23]. With our previous successful experiments, we developed the VPCA method to estimate the CaCO_{3}, TOC, and opal contents of downcore sediments within the SCS. Visible derivative reflectance spectra and laboratory-measured, ground-truth biogenic content data from core MD972148 (Figure 1) were used to establish calibration equations for estimating CaCO_{3}, TOC, and opal contents of samples from nearby core MD012396 (Figure 1). We compared estimated and ground-truth data from MD012396 and assessed the consistency of two datasets and, therefore, the predictive ability of our new model. In this study, a new methodology is presented and the capability of predicting biogenic sediment contents using reflectance spectral data beyond a calibrated range is tested.

## 2. Methods

### 2.1. Materials

The studied cores MD972148 (19.36 N, 117.54 E; water depth 2830 m) and MD012396 (18.73 N, 115.85 E; water depth 3308 m) were obtained from IMAGES cruises conducted in 1997 [26] and 2001 [27], separately (Figure 1). In the present study, the visible diffuse reflectance spectra from the northern core, MD972148, were used as a training calibration dataset to estimate the biogenic contents in sediments after several transformation steps. Core MD012396, which is located 200 kilometers from MD972148 but 500 m deeper downslope, was used as validation data to test the accuracy of the biogenic content estimations against ground-truth measurements from the same core using the training and calibration dataset.

Visible diffuse reflectance spectra of core MD972148 were generated using a hand-held spectrophotometer, the Minolta CM-2600d, according to Pan and Chen, [19] (Figure 2). MD012396 reflectance data were obtained from a shipboard wet detection using a similar spectrophotometer, the Minolta CM-2022 [27] (Figure 2). In both datasets, which contain 31 variables of percent reflectance data across the visible band (400-700 nm), the wavelength measurement interval is 10 nm. For each core depth, we calculated the center-weighted, first derivative from the reflectance spectra to minimize scattering due to differing effects in water content or grain size of the sediment, which may bias the spectral variance [23, 28].

It is important to note that all visible diffuse reflectance derivative data were compared to laboratory-measured CaCO_{3}, TOC, and opal content data obtained from MD972148 [29] as well as 50 new “ground-truth” data newly generated from MD012396 for the purpose of this study, with the same laboratory-measured procedures.

### 2.2. VPCA Statistical Treatment

The visible diffuse reflectance derivative spectral matrix of core MD972148 contains 31 variables. These variables are all highly interdependent and, consequently, highly correlated with each other due to crystal field effects during electron absorption [23]. As a result, we reduce the dimensionality of the matrix through the extraction of principal components from the first derivative spectra of both cores using varimax-rotated principal component analysis (VPCA), applied using IBM SPSS software. In comparison with R-mode factor analysis, variable-based VPCA has been shown to be more practical for decomposing a highly interdependent reflectance derivative matrix into independent components while preserving the variance within wavelengths [23]. Moreover, as with other R-mode factor analyses, the component loadings after VPCA treatment are independent of each other. The effect of this procedure is to minimize the collinearity problem when running regression analysis on what were initially highly interdependent input variables (Figure 3).

The six extracted VPCA components in the training calibration dataset obtained from MD972148 explain approximately 90% of the total variance of the visible reflectance derivative data (Table 2(a)). A similar number of VPCA components explain approximately 90% of the variance in the test dataset from MD012396 (Table 2(b)). The components of MD972148 and MD012396 were labeled alphabetically (“A” to “F”) based on their spectral shapes to ensure that they represent the same mineral compositions and are labeled similarly (Figure 4).

The VPCA provides component scores for each extracted component, which represent the projection of the component loadings back onto the first derivative matrix of reflectance spectra [23]. Essentially, the component scores provide information about how the components correlate with the spectral data from each sample. Because the components are independent, they can be reliably combined in forward stepwise linear regression analysis. Often in regression analysis, input variables and their transformation (ratios, difference, or higher-order terms) are used as potential independent variables during model development. We avoid higher-order terms because VPCA is based on linear combinations of the input variables. Because the components are orthogonal or independent and centered with zero mean, it is not possible to extract meaningful information from ratios of component scores. However, the difference between component scores, calculated by subtracting one from another, provides a meaningful measure of the distance between two components taken pairwise for each sample. The component scores and their subtractive transformations (Table 3) were used as independent variables in the next step to estimate biogenic contents by regression analysis. The resulting model quality was evaluated using standard regressions statistics, with an additional stopping criterion based on the variance inflation factor (VIF_{crit}). Terms with $VIFcrit>2$ were excluded from the forward stepwise forward principal component regressions.

### 2.3. Forward Stepwise Regression

Forward stepwise and best subset regression are two methods of selecting variables in multiple regression analysis [30]. An SCS study [19] used the best subset regression to establish biogenic content equations. Best subset regression tests all possible models and provides the optimum candidates based on adjusted coefficients of determination ($adj\u2212R2$) or Mallows’ coefficients. To produce a model with the highest $adj\u2212R2$ values, the best subset regression procedure may include unnecessary or physically meaningless variables, resulting in overfitting models and erroneous estimation.

In this study, a forward stepwise regression method with a more rigorous criterion was used to select the most appropriate predictor variables from the six extracted components and their subtraction transformation forms of MD972148 (Figure 5). The independent variables (Table 3) have been automatically incorporated into the model through SPSS statistical software. In this procedure, independent variables have been added or removed at each step in accordance with their statistical significance level [20, 30]. A partial F-statistic value of $P>0.1$ and $P<0.05$ was used in this study as the prespecified criteria for entry and removal. It is possible that the variables entered early in the model will not be significant once the other appropriate variables have been included. The analysis was terminated when there were no statistically significant variables that could be added or removed from the model. It ensures that no important variables are missed and generates a list of all available regression models (Supplementary Table 1-3).

Further, we calculated the variance inflation factor (VIF) as well as the $adj\u2212R2$ value to determine the most efficient model from all the ones generated by the forward stepwise regression (Supplementary Table 1-3). The VIF values are indicators of multicollinearity and the correlations between the independent variables. They range from 1 with no upper limit. VIF equal to 1 means no multicollinearity among independent variables. VIF values between 1 and 5 typically indicate low to moderate correlations among independent variables in a regression model. The VIF $values>5$ indicate a high degree of multicollinearity [30–32]. Based on the same criteria, we set the VIF value of variables entered into the model to less than 2 [32]. Using this very rigorous criterion, biogenic regression models can be derived with the best quality that is feasible. Aside from the VIF values, the $adj\u2212R$^{2} for the regression was also taken into account in determining whether a new variable should be included. The newly selected variable was not considered if the change in $adj\u2212R2$ between steps of the model was less than 0.05. By using the above criteria, a final set of biogenic content regression models has been built (Table 1), and more detailed information regarding the coefficients, the ANOVA test, and the model summary against CaCO_{3}, TOC, and opal contents is presented (Supplementary Table 1-3).

## 3. Results

### 3.1. Estimation and Validation

Based on the calibration dataset for MD972148, the forward stepwise regression models for estimating CaCO_{3}, TOC, and opal contents with the best statistical performances have been developed (Table 1). It is interesting to note that the regression equations only include one or two independent variables. There are many fewer variables in this model than in the equations previously reported by best subset regression [19]. It is important to note that even though the new equations reported here take into consideration a few variables, the correlations between the estimated and ground-truth CaCO_{3}, TOC, and opal contents measured from core MD972148 are only slightly lower (Table 1), which indicates that these are parsimonious models that may accomplish the desired level of explanation or prediction by examining how well the forward stepwise statistical modes fit the set of observations that, in particular, are outside the calibrated range; the goodness of fit of the forward stepwise statistical models can be evaluated, in particular, beyond the calibrated range.

The spectral-based estimates and the residuals versus laboratory-measured biogenic content of MD972148 are demonstrated in Figure 6. The ground-truth analysis results for CaCO_{3}, TOC, and opal have positive correlation with the estimated values. The distribution of the residuals is close to zero, which is small compared to the scale of the ground-truth data of the three biogenic models in the calibration dataset. Further, the laboratory-measured CaCO_{3}, TOC, and opal content data and the spectral-based estimated values for calibration dataset exhibit similar downcore trends, which are highly correlated (Figure 7).

As a result, we used newly generated laboratory-measured CaCO_{3}, TOC, and opal content datasets of core MD012396 to test the performance of the equations, and the curves matching the estimates based on MD972148 calibrations and the ground-truth measurements demonstrate, for the first-order approximation, good fitting that suggests the potential of our new regression model (Figure 8).

## 4. Discussion

### 4.1. New Regression Models

Compared to the previous study, Pan and Chen, [19], the regression models established for estimating biogenic contents in this study are significantly improved (Table 1). First, each of the models includes no more than two variables and a constant with a reasonable coefficient and constant among three biogenic models. Second, the VIF values of all models are equal or close to 1, indicating a low degree of cocollinearity among all independent variables. Further, the $adj\u2212R2$ values in the newly constructed regression models have approximately the same performance as all variables previously included. The results of our study indicate that the use of subtractive terms allows the models to be written in a more compact fashion, and forward stepwise regression with rigorous criteria is appropriate for establishing biogenic equations based on diffuse reflectance spectral data.

In the CaCO_{3} regression model, the forward stepwise procedure stopped after selecting four variables (Supplementary Table 1). An extra variable, type D-type F, was incorporated in the final step. This resulted in a moderate correlation between variables ($VIF>2$). In this situation, even though $adj\u2212R2$ has a better performance than the previous steps, the newly added variable should not be considered. Although the variables in the second and third steps are statistically significant, the changes in $adj\u2212R2$ after step 1 are below 0.05, indicating a low contribution to the model’s performance.

The TOC regression model consists of five steps (Supplementary Table 2). However, the collinearity statistics indicate moderate to severe multicollinearity from the third to the fifth step, which would reduce the accuracy and/or precision of the estimation. As the changes in $adj\u2212R2$ are greater than 0.05, the variables included in the second step will be kept for calibration of TOC. Similar to the TOC regression model, the opal regression model eventually contains two variables (Supplementary Table 3). The second step of the regression process agrees with all the criteria specifications we set for forward stepwise regression in this study.

The three calibration models have been tested using sediment core MD012396. Using the calibration core MD972148, we have estimated fluctuation patterns of CaCO_{3}, TOC, and opal content of MD012396 that are approximately consistent with the ground-truth analysis results (Figure 8). Consequently, the regression models established in this study may provide useful estimates of biogenic content variation.

In addition, the residual distributions of CaCO_{3} and TOC reflected in Figure 9 indicate weak negative trends existed in the validation set. A possible explanation rests in the differences in the spectral shape of component loadings at the two sites (Figure 4). The loadings that have been categorized as a similar type, however, do not mean they contain the exact same composition. Since the variable of type B-type C was selected both in the CaCO_{3} and TOC regression models (Table 1), we inferred that the compositional differences of type B-type C at the two sites may enlarge or diminish toward increasing concentrations of CaCO_{3} and TOC and lead to the negative trend of residual distributions. We suggest that a systematic laboratory approach is needed to better refine estimates of CaCO_{3} and TOC in future investigations.

### 4.2. Spatial Accuracy Assessment

By using 50 sediment samples from MD012396 as a test dataset, the biogenic content of the sediments from MD012396 was determined, and those values were plotted against spectral-based biogenetic estimates by the regression models established in this study (Figure 8; orange curve; Figure 9). As a first approximation, the spectral-based biogenic estimates we made are notable for their excellent agreement with the ground-truth measurements. We have observed that despite the fact that our MD972148 TOC training calibration dataset has a relatively narrow TOC content range (0.32-1.32%) (Figure 8; purple arrow), our spectral-based estimates are unable to capture the absolute value (Figure 9) but show similar relative variability (Figure 8). In the case of TOC contents, the underestimation is within our expectations, since the ground-truth values exceed the range of the calibration dataset. In the case of no-analog conditions, it is common to obtain inaccurate estimated values. Nevertheless, the similarity between the estimated and ground-truth values suggests that our regression model captures independent variables that are more physically related to changes in TOC contents. In addition to TOC, the differences between the estimated and measured values of CaCO_{3} and opal contents are very small compared to TOC (Figures 8 and 9), and this suggests that our new regression models can predict accurate values for the core samples which do not appear in our calibration dataset. Our small-scale experiment concludes that the estimated CaCO_{3} and opal contents from color reflectance spectra are reasonable estimates of ground-truth values. We found that our estimated TOC contents did not match the ground-truth values (Figure 9); however, we observed similar patterns of relative variation between the two estimated and ground-truth values (Figure 8). For future studies, it might be useful to expand the calibration dataset by adding more samples with a wider range of TOC contents to make the calibration more relevant. Overall, the results of our experiments indicate that the nondestructive color reflectance spectra method can be used to estimate the contents of biogenic material not only in core samples from the same site but also in core samples adjacent to the same site but not used for calibration.

## 5. Conclusions

The goal of this study was to present a new model to estimate color-spectral variations in contents of biogenic CaCO_{3}, TOC, and opal for sediment core samples from 200 kilometers away from the core used for the development of our calibration equations. In our study, we found that color reflectance spectra from sediment cores possess great potential for producing a long-term, high-resolution time series of biogenic content variations. In this paper, we describe and provide a protocol of VPCA methods that may be used as a standard procedure for estimation of the concentrations of biogenic CaCO_{3}, TOC, and opal using a statistically valid approach. In this protocol, a set of criteria has been outlined that can be used to circumvent the issue of multicollinearity when developing regression models. Our regression equations are able to provide optimal results when applied to biogenic contents beyond the calibration ranges. In conclusion, after careful statistical treatment of the color reflectance spectral data, we can conclude that these data can be used to determine the biogenic content of sedimentary cores.

## Data Availability

The data that support the findings of this study are available from the corresponding author (Hui-Juan Pan) upon reasonable request.

## Conflicts of Interest

The authors declare that they have no conflicts of interest.

## Acknowledgments

This study was supported by the research projects MOST 110-2116-M-019-001 from the Ministry of Science and Technology, Taiwan.