Data-driven methods have increasingly been applied to solve geoscientific problems. Incorporation of data-driven methods with hypothesis testing can be effective to address some long-standing debates and reduce interpretation uncertainty by leveraging larger volumes of data and more objective data analytics, which leads to increased reproducibility. In this study, lithogeochemical data from regionally persistent Archean shale units were aggregated from literature, with special reference to the Kaapvaal Craton of South Africa—namely, shales from the Barberton, Witwatersrand, Pongola, and Transvaal Supergroups—and the Belingwe and Buhwa Greenstone Belts of the Zimbabwe Craton. We examine the feasibility of using machine-learning algorithms to produce a geochemical classification and demonstrate that machine learning is capable of accurately correlating stratigraphy at the formation, group, and supergroup levels. We demonstrate the ability to extract highly useful scientific findings through a data-driven approach, such as geological implications for the uniqueness of the sediment compositions of the Central Rand and West Rand Groups. We further demonstrate that when lithogeochemistry and machine-learning algorithms are used, only about 50 samples per geological unit are necessary to reach accuracy levels of around 80%–90% for our shale samples. Consequently, for many traditional tasks, such as rock identification and mapping, some expensive analyses and manual labor can be replaced by an abundance of cheaper data and machine learning. This approach could transform large-scale geological surveys by enabling more detailed mapping than currently possible, by vastly increasing the coverage rate and total coverage. In addition, the aggregation of historical data facilitates data reuse and open science. These results justify the need to bridge data- and hypothesis-driven techniques for the stratigraphic correlation and prediction of rock units, which can improve the accuracy of the inferred stratigraphic correlation and basin setting.

You do not have access to this content, please speak to your institutional administrator if you feel you should have access.