## Abstract

High-resolution LiDAR-derived datasets from a 1.5 m digital elevation model and a detailed landslide inventory (*n* ≥ 1000) for Magoffin County, Kentucky, USA were used to develop a combined machine-learning and statistical approach to improve geomorphic-based landslide-susceptibility mapping.

An initial dataset of 36 variables was compiled to investigate the connection between slope morphology and landslide occurrence. Bagged trees, a machine-learning random-forest classifier, was used to evaluate the geomorphic variables, and 12 were identified as important: standard deviation of plan curvature, standard deviation of elevation, sum of plan curvature, minimum slope, mean plan curvature, range of elevation, sum of roughness, mean curvature, sum of curvature, mean roughness, minimum curvature and standard deviation of curvature. These variables were further evaluated using logistic regression to determine the probability of landslide occurrence and then used to create a landslide-susceptibility map.

The performance of the logistic-regression model was evaluated by the receiver operating characteristic curve, area under the curve, which was 0.83. Standard deviations from the probability mean were used to set landslide-susceptibility classifications: low (0–0.10), low–moderate (0.11–0.27), moderate (0.28–0.44), moderate–high (0.45–0.61) and high (0.62–1.0). Logistic-regression results were validated by using a separate landslide inventory for the neighbouring Prestonsburg 7.5-minute quadrangle, and running the same regression function. Results indicate that 74.9% of the landslide deposits were identified as having moderate, moderate–high or high landslide susceptibility. Combining inventory mapping with statistical modelling identified important geomorphic variables and produced a useful approach to landslide-susceptibility mapping.

**Supplementary material:** The statistical data used in the combined machine-learning functions are available at https://doi.org/10.6084/m9.figshare.c.5351313.v3

**Thematic collection:** This article is part of the Digitization and Digitalization in engineering geology and hydrogeology collection available at: https://www.lyellcollection.org/cc/digitization-and-digitalization-in-engineering-geology-and-hydrogeology

Landslides occur frequently in eastern Kentucky, an area characterized by valleys deeply incised into flat-lying Carboniferous sedimentary rocks along the western margin of the Appalachian Basin of the eastern USA; direct costs are conservatively estimated to be between $US10 million and $US20 million annually (Crawford 2014; Crawford and Bryson 2017). These costs result from damage to roads, residences and other infrastructure. Indirect costs such as road closures, utility interruption and decreased property value are also significant, but challenging to quantify. Landslides in eastern Kentucky are primarily triggered by rainfall and occur in shallow (<3 m) colluvial soils; major factors contributing to slope instability include steep slopes, weak bedrock, variable soil characteristics and development on hillslopes (Outerbridge 1987; Crawford 2014). Proactive efforts to assess landslide susceptibility are lacking, however, and needed for mitigation and risk-informed resources in affected communities.

Landslide susceptibility is defined as the relative tendency or potential for slope movement in a given area (Guzzetti *et al.* 2006; Highland and Bobrowsky 2008; Hearn and Hart 2019). A landslide-susceptibility map classifies or ranks slope stability in categories based on relationships of factors that contribute to instability, as opposed to a hazard map, which may indicate elements of time or estimated landslide extent (National Research Council 2004; Highland and Bobrowsky 2008). Susceptibility modelling typically is conducted using one of two approaches: (1) physics-based methods, which use probabilistic or deterministic models incorporating geological and geotechnical variables to derive a slope-stability assessment; and (2) geomorphic-based statistical methods, which model slope conditions that influence landslide occurrence (Formetta *et al.* 2014; Reichenbach *et al.* 2018). One limitation of physics-based approaches, particularly at a regional scale, is that they require specific knowledge of soil properties, hydrological conditions and geotechnical inputs into slope-stability models. These data are typically not available on a regional basis or at the catchment scale, as is the case in this study.

Geomorphic-based susceptibility modelling focuses on slope morphology, the quality of which is dependent on data accuracy and resolution of terrain models. The availability of aerial light-detection and ranging (LiDAR) data, used to generate terrain models, allows landslides to be identified at a level of detail not possible with lower-resolution digital elevation models (DEMs) or traditional topographic maps (Burns *et al.* 2010). Landslide features such as headscarps, flanks, toes and distinct hummocky terrain are often easily observed in LiDAR-derived data (Schulz 2007; Crawford 2012; Jaboyedoff *et al.* 2018). LiDAR-derived map datasets consist primarily of DEMs, and their derivatives, such as hillshades, slope, aspect, curvature and roughness.

The purpose of this study is to establish a reliable framework for assessing landslide susceptibility at a regional scale using a statistics and geomorphic-based approach. We have advanced this approach by using a robust landslide inventory and by combining two traditionally distinct machine-learning methods that complement each other to produce a final susceptibility map. We mapped landslides in Magoffin County, Kentucky; compiled geomorphic statistics for each slide; determined geomorphic variable importance with bagged trees, an ensemble decision-tree machine-learning technique; and then used those variables in a logistic-regression model. Logistic-regression results were validated by using a separate landslide inventory for the neighbouring Prestonsburg 7.5-minute quadrangle. We further analysed the logistic-regression results by plotting kernel-density estimations of the individual geomorphic variables and their predicted probability of influencing landslide occurrence.

A focused field-based approach is often impractical due to a scarcity of expenses and time, lack of land access and the effort needed to create, or field check, a statistically robust (*n* = 1000) landslide inventory. Our intent is not to dismiss the importance of direct field-based observations or expert knowledge in determining landslide susceptibility, but to demonstrate that this statistical, geomorphic-based approach can be applied in other environments and support expert (but often subjective) knowledge. High-resolution elevation data and their derivatives provided an opportunity to take advantage of specific geomorphic conditions, many of which are related to bedrock geology and influence landslide occurrence. These high-resolution datasets, coupled with detailed landslide-inventory mapping, allowed us to produce accurate and detailed landslide-susceptibility maps. These maps are particularly important because existing landslides are often susceptible to reactivation, which makes modelling the probability of occurrence and developing a susceptibility map with logistic regression a practical approach (Cruden and Varnes 1996; Crawford 2014). A shift toward reliable data-driven methods that convey hazard information is critical to assist those needing to avoid areas of potential slope instability.

Although determining landslide susceptibility has inherent uncertainty, several studies comparing statistical techniques (Akgun 2012; Felicisimo *et al.* 2012) indicate that logistic regression, based on receiver operating characteristic (ROC) curves, outperforms other machine-learning techniques (such as likelihood ratio models, entropy-based analysis, multi-criteria decision analysis, and classification and regression trees) and that a combination of techniques can strengthen accuracy and reliability. As opportunities to navigate GIS, machine learning and statistical analysis increase, this multifaceted approach can be reliably used by researchers and stakeholders with access to LiDAR and map-based landslide inventories. Understanding the slope-geomorphology variables that are most important to landslide occurrence can support susceptibility mapping, inform stakeholders of planning and mitigation strategies, and ultimately reduce risk.

## Geological setting

Magoffin County and the Prestonsburg quadrangle are in the Eastern Kentucky Coal Field, part of the larger central Appalachian Basin (Fig. 1). The county and quadrangle areas are 800.5 and 152.9 km^{2}, respectively. Topographic relief can be as much as *c.* 228 m and the mean slope is 24.5°. The landscape is highly dissected, characterized by narrow ridges and sinuous alluvial valleys. Deeply incised stream drainages and variable hillslope morphologies range from long and narrow to bowl-shaped tributary valleys. Bedrock comprises flat-lying complex sequences of Carboniferous (Pennsylvanian) sandstones, siltstones, shales, coals and underclays (McDowell 1986; Greb *et al.* 2009). The hillslope morphology viewed in a LiDAR-derived hillshade model is often a good indicator of underlying bedrock geology, indicating the connection between bedrock and slope characteristics. For example, the more resistant lithologies, such as sandstones and siltstones, are often associated with steeper slopes and thinner soil cover. Shale beds, coals and underclays weather easily and are known to be associated with high landslide occurrence (Crawford 2014; Chapella *et al.* 2019). Slopes are mantled with colluvium of varying thickness, and mass wasting is a dominant process, moving soil and rock downslope by creep, sheetwash, landslides and debris flows (McDowell 1986). The colluvial soil is generally fine to coarse loam, typically poorly sorted, with grain sizes that range from clay to medium-coarse boulders perhaps a metre in diameter (Blair and McPherson 1999). The Unified Soil Classification System classifies the soils in Magoffin County as fine silts and clays (CL and ML), as well as sands mixed with fines (SM) (Kentucky Division of Geographic Information 2020; kygeonet.ky.gov/kysoils). Colluvium transport downslope and its velocity range from imperceptible (creep) to rapid (catastrophic). Landslides that occur in colluvium are commonly thin (<3 m) translational slides or thicker rotational slumps, but both types have the capability of developing into damaging debris flows or debris slides, especially on steep slopes (Fleming and Johnson 1994; Turner 1996; Crawford 2014).

## Methods

### Landslide-inventory mapping

Landslide extents were primarily mapped by visual inspection of a multidirectional hillshade (Nagi 2014) derived from a 1.5 m LiDAR DEM. Secondary maps of slope, roughness, curvature, plan curvature, contour and traditional hillshade, as well as aerial photography, were used to help identify landslide features and constrain confidence in mapping deposit extents. Extents of landslides that included features such as headscarps, flanks, toe slopes and hummocky topography were digitized as GIS polygons. A tiling scheme of *c.* 1500 × 1500 m was used in the GIS to keep track of mapped areas. Initial inspection of the hillshade landscape was at a scale of 1:3000 and digitizing was zoomed to smaller scales. In order to consistently capture each landslide, the polygon included headscarps, flanks and toe slopes. For example, the upper boundary of a landslide polygon was traced across the crown of the slide, slightly above the vertical displacement of the headscarp. A total of 1054 landslides were mapped in Magoffin County and a range of sizes and shapes was documented (Fig. 2). The mean landslide area was *c.* 6397 m^{2}, with most of the landslides less than 25 000 m^{2} (Fig. 3). Generally, for translational colluvial landslides, the distance to length ratio is less than 0.1 and, for rotational slides, the depth to length ratio is 0.3 to 0.1 (Highland and Bobrowsky 2008). An inspection of the landslide inventory does reveal a consistent relation to area; however, we do not know if this relationship is a proxy for volume or magnitude. Although the small to moderate-size landslides and their shapes could infer landslide type, the age or potential behaviour was not determined. With over 1000 landslides identified, additional investigations that incorporate age, landslide behaviour and runout potential would be extremely time intensive.

We used a confidence rating system based on methods developed by the Oregon Department of Geology and Mineral Industries, which qualitatively describes a confidence that a landslide was interpreted correctly, determined by the clarity of features visible in remotely sensed data (Burns and Madin 2009). The rating number from 0 to 10 is based on ranking four landslide features (headscarps, flanks, toe slopes and hummocky terrain); zero is unidentifiable and 10 is clearly identifiable in the LiDAR hillshade (Table 1). If the secondary maps were used, then a higher or lower ranking was issued accordingly (Chapella *et al.* 2019). An overall numerical confidence ranking from 0 to 40 was assigned by the landslide mapper. Of the 1054 landslides mapped, 1.3% were considered low confidence (≤10), 44.2% as moderate confidence (11–29) and 54.4% as high confidence (≥30). A landslide width and axial length of *c.* 20 m was used as the threshold for mapping, because landslides smaller than these dimensions are typically stream-bank collapses or roadway-embankment failures. These smaller landslides, although important, often have a different mode of failure controlled by different geomorphic parameters. This added complexity would reduce overall model effectiveness; thus, we did not consider them.

### Geomorphic variables and statistics compilation

Because of the large project area and because LiDAR-derived DEMs show significant detail in the landscape, we needed to modify raster datasets because of limitations in computing power and file size. In addition, selecting the right cell size for a DEM should be guided by the scale of mapping and the original resolution of the data (Fressard *et al.* 2014). Therefore, the 1.5 m DEM was resampled to 3 m cells. Geomorphic maps of elevation, slope, terrain roughness, curvature, plan curvature and aspect were then generated from the resampled DEM using a radial smoothing window of *c.* 15 m to reduce noise (Table 2).

To obtain consistent geomorphic statistics, a circular buffer was generated around the centroid point of each mapped landslide (Fig. 4). The landslide extent may, depending on the size and shape of the landslide, fall outside the buffer polygon. A buffer polygon that represents most of the landslide extent is superior to a single point in accounting for variability in landslide characteristics, however (Timilsina *et al.* 2014; Raja *et al.* 2017). The buffers for all 1054 landslides were used to extract six statistical variables from each of the six geomorphic maps (Table 3). This process resulted in 36 individual statistical values for each landslide (maximum, minimum, range, mean, standard deviation and sum of values within each buffer for each map – e.g. slope map). The buffer created for all mapped landslides had an area of 6648 m^{2} (radius of 46 m), which is the average area of the 1054 inventoried landslides. We tested buffer radii of *c.* 15, 30, 46 and 61 m to determine which was the most effective, and the bagged-trees model accuracy tests supported the 46 m radius (see ‘Susceptibility-map variable determination’ section). Although there is some co-dependence between variables, we argue that starting with an abundant number of variables increases the probability of capturing the strongest correlations and will produce better model accuracy and a smoother, more realistic map.

Several additional variables were considered for determining landslide susceptibility, but ultimately not included. Bedrock geology was not used primarily because, although lithology varies, the mapped geology is reasonably uniform across the county. Geological formations across Magoffin County, as well as most of eastern Kentucky, are mapped as similar packages of sedimentary rocks that include thin to thick beds of variable lithologies. In general, weaker lithologies, such as shales, coal beds and underclays, influence landslide activity (Crawford 2014; Chapella *et al.* 2019). Many of these problematic lithologies are not mapped separately, however, and at the scale of landslide susceptibility mapping would not be reliable for a statistical model. In addition, for the landslides that we mapped in Magoffin County, there is no correlation between landslide deposits and mapped geological formations. The landslides that occur within each formation (some slide extents straddle multiple formations) are consistent with the percentage area that the formation covers, and the clusters of landslides generally correlate with areas of steeper slopes. The hillslope statistics compiled in this study do reflect underlying bedrock geology to some extent, but a more in-depth investigation that correlates geological formations and lithology would require a larger inventory to determine a reliable influence of one formation over another, and is beyond the scope of this study. Therefore, we treated the mapped bedrock geology as a constant across a county scale. The modelling approach presented here would work equally well in an area where bedrock geology needs to be accounted for and detailed lithology is available as a model input.

Soil erodibility was also not considered. This index classification is based on a measure of a soil's sensitivity to the effects of rainfall and land use, developed by the Natural Resources Conservation Service. The drawback to using soil erodibility is that the soils data have a broad distribution across landforms, and differences in county boundaries can lead to spurious map results. Hillslope erosional potential was also excluded. It is defined as the product between mean annual precipitation and slope angle (Mitchell and Montgomery 2006). Adding rainfall data could be useful, but we did not have confidence in hillslope erosional-potential maps because of the coarseness of the rainfall data and the challenge of using a single, averaged dataset as a proxy for stream power and discharge; it does not reflect reality and would limit the usefulness of the final map result.

To prepare data for bagged-trees and logistic-regression models, statistical data for non-landslide areas are required for the creation of a dependent variable called the indicator (1 or 0). Non-landslide areas must also have comparable buffer shapes so that contrasting feature statistics can be gathered. The same procedure (using a 46 m radius buffer) was followed to generate geomorphic statistics for the non-landslide areas as for landslide areas (Fig. 3). An equal number of landslides (1) and non-landslides (0) is required for an equal class-distribution ratio, which helps to eliminate class bias (Dai and Lee 2002; Oommen *et al.* 2010; Gupta *et al.* 2019). The buffers were inspected for overlap between non-landslide areas and landslide areas and culled accordingly. Significant overlap dictated that 123 buffers be eliminated in order to maintain an equal number of random non-landslide and landslide statistics. Table 4 is an example subset of the entire 36-variable dataset, showing slope values for landslides and non-landslides. These statistics, plus the indicator variable, make up a binary dataset used in the bagged-trees and logistic-regression methods.

### Susceptibility-map variable determination

Two machine-learning methods were used to determine model input variables for the landslide-susceptibility map: an ensemble bagged-trees classification and binary logistic regression. The bagged-trees approach was used to elucidate the variables in the logistic-regression model and is a reliable first pass at variable importance, effective for remotely sensed geophysical data. Bagged trees predicts a weighted classification using the indicator variable to return an approximation of variable importance (Breiman 2001; Meinshausen 2006; Zhu and Pierskalla 2016; Dou *et al.* 2019). The technique is called ‘bagged trees’ because it combines statistical results of many individual decision trees in order to improve model performance and reduce model overfitting (Mathworks 2019*b*; https://www.mathworks.com/help/stats/treebagger-class.html). Generally, the bagged-trees algorithm aggregates a set of decision trees to get a single decision tree to rank feature importance. Bagged trees has been previously used to successfully prepare data for landslide-susceptibility mapping, often resulting in high model accuracy and sensitivity values compared to other machine-learning methods (Cracknell and Reading 2014; Youssef *et al.* 2015; Chen *et al.* 2018; Chang *et al.* 2019; de Oliveira *et al.* 2019; Dou *et al.* 2019). Logistic regression models the probability of an event (a landslide) being a function of other variables, and quantifies probability based on statistical analysis of past landslides (Atkinson and Massari 1998; Dai and Lee 2002; Mathew *et al.* 2009; Bai *et al.* 2010; Fressard *et al.* 2014; Timilsina *et al.* 2014; Raja *et al.* 2017; Lombardo and Mai 2018; Reichenbach *et al.* 2018; Chang *et al.* 2019). It uses a logistic function to model a dependent variable: It is simply a nonlinear transformation of the linear regression. Since the geomorphic dataset contains the statistical information on presence or absence of a landslide, the results are log-odds for the value labelled ‘1’ (landslides), which is a combination of one or more of the independent variables (Dai and Lee 2002; Suzen and Doyuran 2004; Bai *et al.* 2010). The value predicted is a probability of an event ranging from 0 to 1 – i.e. an estimate of the maximum likelihood that a landslide will be influenced by the statistics of observed independent variables.

Bagged-tree classification guides the logistic-regression analysis, supporting variable selection, determining variable importance and reducing logistic-regression input variables, which avoids overcomplexity for the susceptibility map (Lu and Weng 2007; Lombardo and Mai 2018). Decision-tree techniques rank relative importance regarding influence on landslide occurrence; however, logistic regression indicates which modelled relationships are statistically significant and the nature of those relationships. Modelling the likelihood of landslide occurrence via equations in a regression analysis thus complements the initial bagged-tree classification.

For an initial insight into variable determination, the MATLAB Classification Learner application (Mathworks 2019*a*; https://www.mathworks.com/help/stats/classificationlearner-app.html) was used to train the bagged-trees and logistic-regression models, classify the data and determine an overall model accuracy, and ultimately validated the use of the *c.* 46 m buffer radius (Table 5). Classification Learner does not run the specific model functions; rather, it serves as a guide for the optimal landslide and non-landslide buffer-size determination. For each of the four variable buffer radii, separate surface-roughness smoothing windows were tested, which yielded 20 distinct possible combinations (Table 6). Terrain roughness was calculated separately because of its dependence on scale of the DEM, as well as scale of the landscape (Berti *et al.* 2013; Korzeniowska *et al.* 2018)*.* Smoothing was performed using the Focal Statistics tool in a GIS, which computes a neighbourhood operation generating an output raster where value of the cells is a function of input cells in a specified neighbourhood. The roughness smoothing window of *c.* 38.1 m performed the best, coinciding with the buffer radius of *c.* 46 m. We also included a roughness index map, which is an index of the highest return from all roughness smoothing windows for each pixel (Lindsay *et al.* 2019). Results from Classification Learner include an overall model accuracy (percent), 2D scatter plots of the landslide geomorphic variables based on indicator response (Fig. 5), a confusion matrix and assessment of model performance. The model performance was judged with the overall accuracy and confusion matrices, which measure overall quality and separateness of classes in the dataset. The plots can also guide decisions on inclusion or exclusion of geomorphic variables for modelling. The overall model accuracy of 83.3% for bagged trees was higher than or equal to other machine-learning algorithms used in the Classification Learner, such as Nearest Neighbor Classification, Discriminant Analysis, Naïve Bayes, Decision Trees and other Ensemble models.

## Ensemble bagged-trees function results

After determining that the *c.* 46 m radius for the landslide buffer and a *c.* 38.1 m smoothing window for terrain roughness were optimal for overall model accuracy, the full bagged-trees function was performed. The binary landslide data were divided into training (75%) and test (25%) datasets. The bagged-trees model estimates feature importance from the entire statistical dataset of 36 geomorphic variables. Feature importance is a prediction of relative importance based on the combination of statistical variables. During the tests using numerous variables, a threshold of feature importance of 0.8, just slightly above the average of 0.73, was a consistent mark of separation to choose the important variables (Fig. 6a). The performance of the bagged-trees function was validated using the ROC, area under the curve (AUC), which was calculated in the test dataset as 0.90 (Fig. 6b). The ROC plots the sensitivity (true positives) v. specificity (false positives) and summarizes the model performance across all decision thresholds within the binary data (Felicisimo *et al.* 2012). Based on the feature importance scores, 12 variables were selected for regression analysis (Table 7). The only variable with a high importance score that was not used in the regression analysis was range of curvature. The range-of-curvature raster map inconsistently highlights heavily modified parts of the landscape, skewing the realistic results of a landslide-susceptibility map.

## Logistic regression

The bagged-trees result of 12 important variables was used in the binary logistic-regression model. We used JMP Pro statistical software package (SAS Institute Inc.) to conduct the logistic-regression analysis. The indicator (response) variables, the 1s and 0s, are a nominal modelling type because we made no assumptions about the data and wanted a multilevel logistic response that models the binary data and fits a probability of predictor variable value of 1. Logistic-regression results derive a coefficient of responses ($\beta $ values) and determine which variables are significant (*p*-values). Low *p*-values indicate the data are unlikely to support a lack of difference; i.e. low *p-*values (<0.05) are relevant additions to the model because they are related to changes in the indicator variable, a rejection of the idea that there are no relationships in the binary data.

_{0}is the constant intercept,

*V*is the geomorphic variables and $\beta $ is the coefficient estimates of responses in the indicator variable. The coefficients express the effects of the predictor independent variables on the relative risk of being a landslide or not a landslide, which increases or decreases with each value of the independent variable

*V*– i.e. the rate of change in log-odds as

*V*changes.

*z*is total contribution of all predictor variables, a model of relative risk of features in the landscape being a landslide or not a landslide.

*P*is the cumulative estimated output probability of an event occurring (landslide occurrence or non-occurrence). The output is confined between 0 and 1. We assumed the variables were not normally distributed or did not have linear relationships (Suzen and Doyuran 2004; Nandi and Shakoor 2009). Therefore, the logistic-regression analysis worked well because the primary unknown was the relationship among the variables.

## Results and landslide-susceptibility map

Bagged-tree analysis on all 1862 records (total binary dataset minus overlapping buffers) identified 12 variables as being important (>0.8 threshold). We conducted logistic regression on the 12 variables and found that eight geomorphic variables were significant (*p*-value ≤0.05; Table 8). The four excluded variables were found to be redundant in the regression correlation, and removing them had negligible effect on AUC and overall accuracy. Table 8 also shows the LogWorth (–log_{10} (*p-*value)), which is a transformation of the *p*-value and a way to visualize the relative weight of each variable. The higher the significance, the higher the LogWorth. We evaluated the performance of the logistic-regression model with the AUC, which was 0.83 (Fig. 7).

Equation (4) shows each variable coefficient ($\beta $) multiplied by the associated geomorphic variable raster map and summed. The results from equation (4) were put into equation (3) in order to generate probability of landslide occurrence for Magoffin County. The mean probability value for the county is 0.28 ± 0.17 standard deviation, whereas the mean probability value for the mapped landslides is 0.39 ± 0.15 standard deviation (Fig. 8).

The logistic-regression results show a connection between specific landslide morphologies, which indicates a certain probability of landslide occurrence. The logistic-regression model produced a landslide-susceptibility map indicating where landslides are likely to occur based on the geomorphic conditions. The map strikes a good balance between indicating existing deposits that have a moderate to high probability of subsequent movement, as well as assessing other parts of the slope that do not necessarily show obvious slope movement but may have features related to existing landslide activity. The majority of the flat alluvial valley bottoms were not considered in the analysis. Selected landslides mapped in Magoffin County, draped on the susceptibility results, are shown in Figure 8. Five landslide-susceptibility classifications were determined manually by creating breaks of standard deviations from the mean (Table 9). The percentage of the mapped landslide deposits that were moderate, moderate–high and high are 41.2, 26.6 and 6.4%, respectively. As for the susceptibility classifications of the entire county area, 24.5% are classified as moderate, 12.9% as moderate–high and 4.6% as high, i.e. 42% of the entire county is classified as having moderate, moderate–high and high landslide susceptibility.

Overall, the map emphasizes steep hillslopes and parts of ridgetops as having moderate, moderate–high or high susceptibility. Steep slopes just below ridgetops and steep heads of catchments (often existing headscarps) are modelled as having moderate–high and high susceptibility. Steep planar slopes that are the sides of catchments or are above roads and streams are modelled as having moderate and moderate–high susceptibility. The map shows significant susceptibility differences between the western part of the county, where relief and slope angle are generally less, and the east, where slopes are steeper (Fig. 9). Figure 9b shows large planar sloped areas, as well as the heads of catchments, which are classified as having higher susceptibility compared to that shown in Figure 8a. The heads of catchment drainages in Figure 9b also indicate larger areas of moderate–high or high susceptibility compared to Figure 9a in the western part of the county. The susceptibility map does not determine landslide type, potential extent or runout, or temporal implications.

To further illustrate the logistic-regression results, we plotted kernel-density estimations of the predicted probability of landslides for each variable (Fig. 10). Kernel-density estimation is a statistical technique that finds a balance between underfitting and overfitting of data in order to better visualize the results (Mathworks 2019*c*; https://www.mathworks.com/help/stats/kernel-distribution.html). A kernel density algorithm takes a bandwidth from bins of datapoints to control the smoothness of the estimation (Bowman and Azzalini 1997). The predicted probability of a landslide for each variable and its relationship to 2D Gaussian kernel-density estimation (Fig. 10) aids in interpretation of the physical process implied by the regression results, particularly for associations and uncertainty among investigated variables. For example, as opposed to looking strictly at specific points of predicted probabilities of landslides (white dots), we can view landslide occurrence as joint-probability density within a range of values. The stretched colour ramp in Figure 10 is easier to interpret than the cluster of points; it constructs a view that accurately reflects relative likelihood instead of interpreting the random points. In general, the kernel density estimation patterns of predicted probability highlight the complexity among variables that influence slope stability. More specifically, mean roughness, minimum curvature, standard deviation of plan curvature and standard deviation of curvature have clusters of higher kernel-density estimates (lighter colours) that correlate with low predicted probabilities of landslides; as the density values expand, the predicted probability of a landslide occurrence increases. This correlation perhaps suggests a low dependency on probability, or is an indication that these clusters of landslides may correlate with low predicted probability, even though these geomorphic variables are significant to landslide occurrence in the regression model. Although minimum slope, standard deviation of elevation, range of elevation and sum of roughness do not indicate a specific cluster, the high values of kernel densities are stretched vertically along most values of predicted probability (Fig. 10). This suggests that the ranges of minimum slope, standard deviation of elevation, range of elevation and sum of roughness values that fall within the high kernel-density areas are perhaps more significant than the other variables because of the range of probabilities of occurrence.

## Model validation and limitations

In order to validate the logistic-regression methodology, we compared the results to a separate inventory dataset of 370 landslides that were mapped in the Prestonsburg 7.5-minute quadrangle. The same methodology for landslide inventory described in this paper was used to identify the landslides in the Prestonsburg quadrangle (Chapella *et al.* 2019). The resulting geomorphic variables (minimum slope, minimum curvature, standard deviation of elevation, range elevation, standard deviation of plan curvature, mean roughness, sum of roughness and standard deviation of curvature) from the same logistic-regression model (equations (3) and (4)) were used to create a landslide-susceptibility map of the Prestonsburg quadrangle. For the Prestonsburg quadrangle landslides, 74.9% of the deposits were in the moderate, moderate–high or high landslide-susceptibility classifications (Table 10, Fig. 11). Approximately 43% of the quadrangle area is classified as moderate, moderate–high and high susceptibility. The moderate–high and high classes (0.53–1.0) make up 18.5% of the quadrangle.

The success of the logistic-regression model in Magoffin County and neighbouring Prestonsburg seems to indicate that the statistical analysis using a more encompassing polygon buffer to extract the geomorphic variables is necessary, as opposed to the often-used general value generated from a point. The landslide buffer of 46 m used to extract the geomorphic statistics and generate the variables results in some artefacts in the susceptibility results, however. Because we used a circular buffer and smoothing window, roughly circular artefacts are present in some areas of the resulting map (Fig. 12). This occurs most often with heavily modified parts of the landscape, where there are sharp unnatural breaks between steep slopes and flat, modified ground.

## Discussion

Data preparation for determining geomorphic variables used in statistical landslide-susceptibility modelling varies significantly from study to study because there are no standard guidelines or procedures (Reichenbach *et al.* 2018). In statistical models that explain the probability of landslide occurrence, the selection of geomorphic variables used as model inputs is often based on subjective assumptions about the relationships between variables (Guzzetti *et al.* 1999; van Westen *et al.* 2003, 2008; Timilsina *et al.* 2014). Similarly, the weights applied to variables that are considered important in expert, knowledge-driven approaches are often assigned with bias (Thiery *et al.* 2014). Therefore, clarity regarding methodology, constraint of the uncertainties that influence landslide occurrence, and the landslide-susceptibility map output is important for the end user (Hearn and Hart 2019).

A challenging part of the machine learning, statistics-based approach is recognizing the sensitivity of model inputs to the results and what implications that may have on the ground and with landslide processes. Different methods such as using a point instead of a buffer, or adding other variables (like geology) would certainly generate different results. However, we consider the advantage of this approach is that it is easy to use and repeatable, which is important for establishing methods or a workflow in other areas. Considering buffers or a point feature that can extract different parts of the slope could be a next step in our approach. For example, a buffer that captures areas just around the headscarps, crowns or toe bulges might prove to be a strong model performance and give insight into landslide initiation.

The geomorphic variables in our models reveal a complex relationship between the slope morphology and type of landslide. Our approach takes advantage of the data-driven machine learning technique to elucidate variables that expert knowledge cannot, allowing the algorithms to shed light on relationships of individual variables. For example, we do not know specific connections between mean roughness and mean plan curvature to landslide processes on the ground, but Figure 5 shows high-quality accuracy and separateness of variables and consideration of further investigation of slope processes and causal factors. The bagged-trees results show that seven of the twelve ranked variables are a statistical category of curvature. The ranked importance of curvature indicates connections on the ground related to colluvial soil heterogeneity, colluvial thickness, variable pore-water pressure response, and shear strength variability at different slope positions. Although these connections to the ground and slope processes need further evaluation, the results indicate the importance of concavity, convexity, erosion and water flow. Geomorphic analysis on landslide occurrence has long been focused on landslides preferentially beginning in topographic hollows (concave slopes), and many do, but there are also conditions under which landslides may occur on planar or even convex slopes (Dietrich *et al.* 1986). Bagged trees ranked standard deviation of plan curvature highest, which may suggest plan curvature does an effective job of subdividing a hillslope into regions such as hollows, noses and planar, therefore, controlling convergence or divergence of landslide deposits and groundwater (Ohlmacher 2007).

Further supporting the importance of the bagged-trees results, three curvature variables remained in the logistic regression results as being statistically significant. Gathering an understanding of these relationships is one of the reasons we included the kernel-density figures. Figure 10 shows relationships with a broad spectrum of correlations, guiding further investigation that can assess correlations on the ground with landslide processes. Our logistic regression results show that minimum slope is ranked as most significant (has the lowest *P*-value) and representative of a null hypothesis that we reject. The null hypothesis here being each independent variable has no effect. Minimum slope being significant makes sense because slope is such a well-established influencing factor regarding stability and landslide activity. Landslides need to exceed some angle for failure and, fundamentally, we are rejecting the notion of minimum slope having no effect landslides.

Despite the shortcomings and limitations of statistical geomorphic-based methods, rapidly changing remote-sensing technologies (availability of LiDAR and high-resolution aerial photography), GIS software capabilities and growing landslide inventories all provide opportunities to improve landslide-susceptibility modelling. Statistical approaches to modelling landslides (e.g. tree-based analysis, logistic regression, support vector machine, weight of evidence, principal component analysis etc.), variable selection techniques and model evaluation typically generate reasonable model performance and accurate maps (Reichenbach *et al.* 2018; de Oliveira *et al.* 2019). Many of these techniques are quite complex for a binary dataset such as presented here and are typically more suitable for more difficult statistical problems related to complex remote-sensing data. This study does not necessarily reveal complex features or complex remotely sourced data; however, the relationship among these features regarding slope stability is complex, and classification decision trees coupled with logistic regression proved effective for predicting discrete variables.

Generally, using GIS datasets and binary statistical data, a combination of temporal and random spatial strategies are used to validate model performance. We conducted a spatial validation by separating training (75%) and test (25%) datasets for the bagged-trees function, as well as applied the logistic-regression function to a different area. A temporal validation was not considered because only a single generation of LiDAR data was generated for the study area. For performance evaluation, Reichenbach *et al.* (2018) found that 38.2% of studies used one metric and 32% used no metric. The most common quantitative evaluation metrics were success rate curves, ROCs, landslide-density frequency and a general confusion matrix. Oommen *et al.* (2010) indicated that ROC–AUC is a more robust and consistent measure of model performance compared to other metrics such as area under precision recall curves and F-score. Recent landslide-susceptibility studies that used logistic regression, a similar DEM resolution to generate datasets and AUC to determine model performance had a range of 0.81 to 0.91 for the AUC (Mathew *et al.* 2009; Nandi and Shakoor 2009; Raja *et al.* 2017; Lombardo and Mai 2018; Chang *et al.* 2019). The AUC for the logistic-regression results in this study is 0.83, which is less than the 0.90 from the bagged-trees method; however, the combined machine-learning approach narrows variable selection and properly uses each model's function for the creation of susceptibility maps.

The uncertainty regarding the final variable relationships with the ground surface and the implications for landslide susceptibility and risk is also a challenge. However, our map results and distribution of higher probabilities (moderate, moderate–high, high) effectively reflect the geomorphic variables that are indicative of unstable ground conditions and potential landslide activity. A robust landslide inventory, quality DEMs and relevant derivative maps allowed us to establish a reliable framework for assessing landslide susceptibility at a regional scale using a statistics and geomorphic-based approach. We advanced this approach by combining two traditionally distinct machine-learning methods that complement each other and demonstrate that the criteria needed to ensure quality model performance produced practical and accurate susceptibility maps.

## Conclusions

We combined bagged-trees and logistic-regression approaches to model landslide susceptibility for Magoffin County, Kentucky. Landslide inventory mapping for the county allowed for an improvement upon geomorphic, statistics-based models by combining two traditionally distinct machine-learning methods. This combined approach generated a more sophisticated data-driven assessment of landslide susceptibility. We mapped 1054 landslides in the county and compiled 36 geomorphic statistical variables for each slide. Variable importance was determined using the bagged-tree machine-learning algorithm. The bagged-trees results indicated that standard deviation of plan curvature, standard deviation of elevation, sum of plan curvature, minimum slope, mean plan curvature, range of elevation, sum of roughness, mean curvature, sum of curvature, mean roughness, minimum curvature and standard deviation of curvature were influential variables to determine the likelihood of landslide occurrence. The performance of the bagged-trees function was validated using the ROC–AUC, which was calculated in a test dataset as 0.90.

We used the bagged-trees results to run a logistic-regression function and model landslide susceptibility. The results of the logistic-regression model indicated eight geomorphic variables were significant: minimum slope, minimum curvature, standard deviation of elevation, range of elevation, standard deviation of plan curvature, mean roughness, sum of roughness and standard deviation of curvature. These variables were used in the regression equation. An AUC value of 0.83 suggests a strong overall model performance. The logistic-regression model produced a landslide-susceptibility map of Magoffin County that represents a realistic landscape and connects specific landslide morphologies and probability of occurrence, and has relatively minor noise and model artefacts. Landslide-susceptibility classifications for the entire county area based on standard deviations of the mean were: 14.6% low, 43.1% low–moderate, 24.5% moderate, 12.9% moderate–high and 4.6% high. Kernel-density estimations of predicted probability highlight the complexity among variables that influence slope stability, but also support the model results of significant variables. The map results strike a good balance between classifying existing deposits that have a moderate to high probability of subsequent movement and classifying other parts of the slope that do not necessarily show obvious landslide activity.

We validated the logistic-regression results with data from a separate landslide inventory for the Prestonsburg 7.5-minute quadrangle, using the same logistic-regression model. The same combination of bagged-trees and logistic-regression showed that 74.9% of the landslide deposits in the Prestonsburg quadrangle have moderate, moderate–high or high susceptibility classifications, and 43.3% of the quadrangle area also falls into moderate, moderate–high or high susceptibility. This combined approach to predicting landslide susceptibility, taking advantage of landslide inventory mapping, successfully identified existing geomorphic conditions that likely lead to a landslide, but also modelled similar slope conditions elsewhere, which are areas of focus as well. The map results classified landslide susceptibility into categories that effectively communicated the landslide hazard. This statistical, geomorphic-based approach can be applied to other environments and emphasizes reliable data-driven methods that convey landslide hazard information to those at risk.

## Acknowledgements

We thank Arnold Stromberg and Edmond Kim from the University of Kentucky Department of Statistics, Zhenming Wang and Meg Smath of the Kentucky Geological Survey for their reviews, and David Korte of the North Carolina Geological Survey and Thomas Oommen of Michigan Technological University for technical guidance regarding logistic-regression models and landslides.

## Author contributions

**MMC**: conceptualization (equal), data curation (lead), formal analysis (equal), funding acquisition (lead), investigation (equal), methodology (equal), project administration (lead), resources (supporting), software (equal), supervision (equal), validation (equal), visualization (equal), writing – original draft (lead), writing – review & editing (equal); **JMD**: formal analysis (equal), investigation (supporting), methodology (equal), visualization (equal), writing – review & editing (supporting); **HJK**: data curation (supporting), formal analysis (supporting), investigation (supporting), methodology (supporting), visualization (equal), writing – review & editing (supporting); **AAK**: data curation (supporting), investigation (supporting), methodology (supporting), visualization (supporting), writing – review & editing (supporting); **JZ**: formal analysis (supporting), investigation (supporting), methodology (supporting), software (supporting), writing – review & editing (supporting); **YZ**: formal analysis (supporting), investigation (supporting), methodology (supporting), software (supporting), visualization (supporting), writing – review & editing (supporting); **LSB**: conceptualization (supporting), investigation (supporting), resources (supporting), writing – review & editing (supporting); **WCH**: conceptualization (equal), formal analysis (supporting), investigation (supporting), methodology (supporting), project administration (supporting), resources (supporting), software (supporting), supervision (supporting), visualization (supporting), writing – review & editing (supporting)

## Funding

This work was funded by the Federal Emergency Management Agency (PDMC-PL-04-KY-2017-002).

## Conflicts of interest

The authors declare that they have no conflict of interest or competing interests.

## Data availability

The datasets generated during and/or analysed during the current study are available in the Kentucky Geological Survey repository, https://www.uky.edu/KGS/

*Scientific editing by Jonathan Smith; Jennifer Hambling*