The geological record is a vast archive of information that provides the only empirical data about the evolution of the Earth. In recent years, concentrated efforts have been made to compile macrostratigraphic data into the online centralized database Macrostrat. Macrostrat is a global stratigraphic database containing information regarding surface and subsurface rock units and their respective ages, lithologies, geographic extents, and various other associated metadata. However, these raw data are currently directly accessible only through the Macrostrat application programming interface, which is a barrier to potential users that are less familiar with such services. This data accessibility hurdle currently prevents full capitalization of the value offered by Macrostrat, particularly its potential to improve understanding of the geological and biological evolution of the Earth. Here, we introduce rmacrostrat, an R package that interfaces with the Macrostrat database to access and retrieve a variety of geological, paleontological, and economic data directly into the R programming environment. In this article, we provide details about how the package can be installed, its implementation, and potential use cases. For the latter, we showcase how rmacrostrat can be used to visualize regional stratigraphic columns, produce regional geologic outcrop maps, and investigate temporal trends in macrostratigraphic units. We hope that this package will make geological data more readily accessible and in turn will facilitate new research utilizing Earth system data.

Earth's geologic record provides a unique spatiotemporal archive of the evolutionary history of the planet (Ernst and Youbi, 2017; Tetley et al., 2019; Cao et al., 2019; Scotese et al., 2021). Historically, to understand macroscale Earth system trends through geological time, researchers were required to synthesize local or regional quantitative studies, predominantly from data gathered in the form of regional geological maps, sections, and individual sedimentary logs or boreholes (e.g., Ronov et al., 1980; Seslavinskiy, 1991; Bosscher and Schlager, 1993; Miall, 2022). However, the introduction of large online open-access databases, in which a variety of complementary data sets are already digitized and synthesized, has facilitated the development of macroscale analyses through both time and space. One such database is Macrostrat (https://macrostrat.org/), a relational geospatial database that aims to aggregate and synthesize field-derived geological data from geological maps and regional geological columns into a data set that describes the spatial distribution of geological units within the Earth's upper crust (Peters et al., 2018). Macrostrat contains information regarding individual rock “units,” linked by unique identification numbers to associated lithological, environmental, paleontological, and economic attributes, alongside information regarding their respective chronostratigraphic context. These units are organized spatially into “columns,” representing a cross section of the upper crust within particular geological basins, and temporally by Macrostrat's internal chronostratigraphic age model (Fig. 1). Sequentially deposited units bounded by unconformities form geological “sections,” which also have their own unique identification numbers. Additionally, Macrostrat units are linked by unique identification numbers to geological mapping data amalgamated from a variety of sources as well as data from other large geoscience databases such as the Paleobiology Database (PBDB; https://paleobiodb.org/) (Peters and McClennen, 2016; Uhen et al., 2023).

Since its initial compilation in 2005 from the American Association of Petroleum Geologists Correlation of Stratigraphic Units of North America (COSUNA) charts (Peters, 2006), Macrostrat has grown into a comprehensive and well-established database containing over 35,000 units and 1500 geologic columns, all of which are publicly accessible. Macrostrat aims to provide such data on a global scale, and while the abundance and resolution of available data are currently geographically variable, improving spatial coverage is one of the major aims of the project moving forward (Quinn et al., 2024). Data hosted by Macrostrat have been used for a wide variety of applications in scientific research as well as science communication and education. The broad temporal and spatial scale of data hosted by Macrostrat has facilitated a diverse array of research related to Earth systems through time, including in the fields of sedimentology (Peters and Husson, 2017), stratigraphy (Tasistro-Hart and Macdonald, 2023), igneous petrology (Peters et al., 2021), geochemistry (Husson and Coogan, 2023), and paleobiology (Peters and Heim, 2010, 2011a, 2011b; Heim and Peters, 2011; Rook et al., 2013; Nelsen et al., 2016; Peters et al., 2017; Balseiro and Powell, 2020, 2023; Ye and Peters, 2023; Segessenman and Peters, 2024). Macrostrat has also collaborated with the Extending Ocean Drilling Pursuits (eODP) project (https://eodp.github.io/) to integrate existing drill core data from sources such as the International Ocean Discovery Program (IODP) into the database (Sessa et al., 2023). Geologic map data held within Macrostrat are also displayed by a variety of web (e.g., Sift; https://macrostrat.org/sift/) and mobile applications (e.g., Rockd; https://rockd.org/) that aim to enable usage of geologic information by the wider scientific community, the general public, and university education platforms (Cohen et al., 2018). Macrostrat is also planning to expand and integrate community-led validation of sections, ingestion of stratigraphic column data, and development of new software to facilitate data collaboration in the near future (Quinn et al., 2024). As such, Macrostrat is a vital resource for Earth scientists investigating a variety of issues related to both the geological history of our planet and the impacts of geological processes today.

Despite the apparent opportunities offered by Macrostrat, its hosted data can currently be directly accessed only via the database's application programming interface (API). Although a powerful resource, this single direct data access avenue means that familiarity with both the structure of the database and how to interact with APIs is necessary in order to use the database. Those able to overcome this data accessibility hurdle are still required to develop their own custom protocols to integrate Macrostrat data into coding-based scientific workflows; this can inherently lead to researchers “reinventing the wheel” and producing code that is case specific and difficult to repurpose, inhibiting the reproducibility of research conducted using Macrostrat data. Such processes are commonly carried out in the programming language R, which the Earth science community has broadly adopted to access, prepare, analyze, and plot data (e.g., Bell and Lloyd, 2015; Varela et al., 2015; Ortiz and Jaramillo, 2018; Barido-Sottani et al., 2019; Kocsis et al., 2019; Jones et al., 2023; Gearty, 2024). In particular, several R packages have been developed to interface with databases relevant to the geosciences through API services, supporting the generation of readable, reusable, and reproducible workflows (Varela et al., 2015; Gearty and Jones, 2023; Vidaña and Goring, 2023). However, until now, no such package has been available for interacting with the Macrostrat database.

Here, we present rmacrostrat, a dedicated R package for interfacing with the geological database Macrostrat. The package provides streamlined functionality for querying the database via its API service and retrieving various geological data (e.g., lithostratigraphic units), definitions, or metadata associated with the hosted data (e.g., lithological terms). First, we provide instructions for installing the package and details on its implementation. We then demonstrate the functionality available in rmacrostrat and provide typical usage examples. Finally, we provide details about the resources we have made available to support rmacrostrat users. By providing a programmatic solution to accessing the data hosted by Macrostrat, we endeavor to facilitate new research across the Earth sciences that is conducted in a streamlined, readable, reusable, and reproducible manner.

The rmacrostrat package can be installed from the Comprehensive R Archive Network (CRAN) using the install.packages() function in R (R Core Team, 2024):

If preferred, the development version of rmacrostrat can be installed from GitHub via the R package remotes (Csárdi et al., 2023):

Following installation, rmacrostrat can be loaded via the library() function in R:

Dependencies

The current version of rmacrostrat (ver. 1.0.0) was developed to fetch data from version 2 of Macrostrat's API. The package depends on R (≥4.0) (R Core Team, 2024) and imports functions from the R packages curl (Ooms, 2024, preprint), geojsonsf (Cooley, 2022), httr (Wickham, 2023), jsonlite (Ooms, 2014), and sf (Pebesma, 2018; Pebesma and Bivand, 2023). The package was developed with the support of the R packages devtools (Wickham et al., 2022), testthat (Wickham, 2011), and roxygen2 (Wickham et al., 2024).

Functions are broadly grouped into two categories in rmacrostrat: (1) def_*, and (2) get_*. The def_* suite of functions provides access to the definitions (or metadata) associated with data stored in Macrostrat, such as lithologies [def_lithologies()], measurements [def_measurements()], or Macrostrat columns [def_columns()]. A summary of this suite of functions is provided in Table 1. The get_* suite of functions is for retrieving data from Macrostrat, such as Macrostrat columns [get_columns()], Macrostrat units [get_units()], or geological map outcrop objects [get_map_outcrop()]. Detailed descriptions of these functions are provided in Table 2.

Definition Functions

Definitions (or metadata) of the various data stored in Macrostrat are retrieved from the Macrostrat API service via the def_* suite of functions (Table 1). The coverage of each of these functions should hopefully be immediately recognizable via their naming convention (e.g., def_lithologies() returns definitions of the lithologies used in Macrostrat). Data returned using the def_* suite of functions contain both categorical (and commonly hierarchical) information about data attributes of interest (e.g., def_lithologies() returns individual lithologies [“sandstone”], as well as the type [“siliciclastic”] and class [“sedimentary”] of the lithology) as well as unique identification numbers for individual attributes that can be used to query Macrostrat. Without user-specified arguments, all def_* functions return a data.frame object containing the entire data set of definitions associated with that function:

Alternatively, users can search for definitions of specific entities or groups of entities using the specific arguments for each def_* function. This can generally be achieved using specific unique identification numbers (integers) for those definitions or via a name (character strings):

For convenience, we have also provided a wrapper around all def_* functions via the catalog() function. This function returns complete sets of definitions for each def_* function, which takes the suffix of an individual def_* function for its argument:

We strongly recommend using the def_* suite of functions prior to retrieving data from Macrostrat to better understand both the structure of the database and the utility offered by the functions available in rmacrostrat. Due to the wide variety of data available in Macrostrat, individual get_* functions include a large array of potential arguments that can differ substantially between functions (see Data Retrieval Functions section). By using the specific def_* functions related to potentially useful search criteria, users can efficiently identify arguments and parameters with which to query the database via the get_* suite of functions. Examples of the utility of the def_* functions are provided in the Application section below as well as in the available vignettes, which provide tutorials on how to use the package.

Data Retrieval Functions

Data can be retrieved from the Macrostrat database API directly into the R environment using the get_* suite of functions (Table 2). These functions return either data related to specified Macrostrat entities (e.g., Macrostrat columns, units, sections, and age definitions), geologic map elements, or external data related to Macrostrat entities (e.g., PBDB collections, eODP data, paleogeographies); these data can be returned either as a standard data.frame or as a spatial simple features (i.e., sf) object (providing associated spatial geometries). In some instances where multiple values exist for a variable (e.g., proportions of lithologies within a unit), a hierarchical data.frame structure is employed (i.e., a data.frame within a data.frame). In accordance with the def_* suite of functions, the purpose of individual get_* functions is intended to be easily identifiable from their named suffix (e.g., get_columns() retrieves data for Macrostrat columns).

As opposed to the def_* functions, the get_* functions require at least one supplied argument for a valid database query. Although the array of possible arguments differs substantially between get_* functions, users can generally retrieve data based on several categories. Firstly, users can search by unique identification number for either the chosen data type to retrieve or based on another Macrostrat entity:

It should be highlighted that for many get_* functions in rmacrostrat [e.g., get_columns(), get_units()], data can also be retrieved with associated spatial geometries that define the geographic extent or position of the retrieved data (see Fig. 1):

Attribute information—such as lithostratigraphic name, lithology, environment, or economic source—can also be used independently, or in combination in some instances, to retrieve subsets of Macrostrat data. These attributes can be specified either using their unique identification number or by character string. Further information about each attribute to search by can be found in the respective def_* functions (e.g., lithology attribute information can be found in the def_lithologies() function):

Data can also be retrieved using temporal limits by specifying either a specific interval name as a character string (e.g., “Permian”), a unique identification number, or a numeric value (e.g., 275 Ma) or from providing constraints based on numerical limits (e.g., 251.9–298.9 Ma). All Macrostrat entities that overlap with the specified parameter(s) in terms of their chronostratigraphic range defined in the Macrostrat age model are returned:

Finally, some get_* functions allow the user to query the database using geographic or spatial information. This can be achieved either by specifying coordinates in decimal latitude and longitude degrees or, if continental-scale resolution is desired, through the use of Macrostrat projects. Macrostrat data are split into regional projects, such as North America (project_id = 1) and New Zealand (project_id = 5); setting this argument returns all Macrostrat entities associated with that regional project. It should be noted that different Macrostrat projects currently have different levels of data completeness, ranging from virtually complete temporal and spatial coverage (e.g., North America, project_id = 1) to incomplete and limited coverage (e.g., Africa, project_id = 9):

As aforementioned, it is recommended that these arguments be used in tandem with the def_* suite of functions to maximize search potential and data retrieval. For instance, a user interested in retrieving units deposited in a specific paleoenvironment may want to use the def_environments() function prior to their search to see the full variety of parameters by which to search. We reiterate the importance of always exploring the data fetched from Macrostrat and ensuring returned data are as expected. As an illustrative example of this, when seeking all carbonate-bearing marine units, the use of the lithology_type argument would return many more units than environ_type because higher-resolution paleoenvironment interpretations for all units are currently incomplete:

Herein, we provide three example applications of the rmacrostrat package. These examples are greatly expanded in step-by-step vignettes provided alongside the package, available online via the associated package website (https://rmacrostrat.palaeoverse.org/articles/) and also bundled with the package, accessible via:

Constructing Stratigraphic Columns

An understanding of stratigraphy—that is, the relationships between adjacent geological units—is fundamental to accurately reading the geological record. This understanding enables researchers to put relative ages to lithological units and make temporal and spatial correlations with variables of interest. Using rmacrostrat, the geological data within the Macrostrat database can be easily retrieved and used to generate a stratigraphic column for a specific location and/or time interval. Below we provide an example showing how to retrieve and plot a stratigraphic column for the San Juan Basin, an asymmetric structural basin in northwestern New Mexico and southwestern Colorado (Four Corners region of the southwestern United States), containing sedimentary rocks ranging from Cambrian to Holocene in age (Fassett and Hinds, 1971). For this example, we restrict our column data to the Cretaceous, but this approach could equally be applied to any other basin or temporal interval. Columns are the most broad-scale geological entity available within Macrostrat, and by using the def_columns() function, the column associated with the San Juan Basin can be identified. The unique column identification number can then be used to get data for all appropriate units via get_units(). Given that the example focuses only on the Cretaceous, additional arguments available in get_units() can be used to further filter the queried data. With the returned data—Cretaceous lithostratigraphic units within the San Juan Basin—a stratigraphic column can be generated (Fig. 2):

Plotting Geologic Outcrop Maps

A commonly required figure across a range of disciplines within the geosciences is a geographic map of the outcrop for a specific geologic formation. Such a figure can be easily generated using the get_map_outcrop() function of rmacrostrat, which retrieves geospatial data associated with lithostratigraphic units. Below we provide an example for constructing a map of outcrop for the Hell Creek Formation, a geologic formation from the latest Cretaceous and early Paleogene of North America, which is found cropping out across Montana and North and South Dakota in the United States (Johnson et al., 2002; Fastovsky and Bercovici, 2016). Given that outcrop spatial data are compiled from various map sources, the definition function def_strat_names() is first used to find the appropriate identification numbers for any stratigraphic names of formations that include the Hell Creek. This information can then be used with the get_map_outcrop() function to retrieve geospatial data for the formation as a simple features (sf) object. These data can be plotted to produce a geological map (Fig. 3):

Examining Macrostratigraphic Temporal Trends

Initial publications using data from the Macrostrat database quantified how the counts and proportion of Macrostrat entities, as well as different lithostratigraphic unit types associated with different paleoenvironments (e.g., marine, marginal, mixed, terrestrial), varied throughout the Phanerozoic (Peters and Heim, 2010). rmacrostrat facilitates access to these types of data and allows for similar analyses to be conducted. Below we provide an example of such an analysis, in this case estimating the number of igneous, metamorphic, and sedimentary units in North America throughout the Phanerozoic.

For this example, the relevant lithostratigraphic unit data from Macrostrat are first fetched using the get_units() function from rmacrostrat. For this query, several filters are applied to retrieve the appropriate data. First, the lithology_class argument is used to separate out the major rock classes. Second, the interval_name argument is used to filter to units only from the Phanerozoic. Finally, the project_id argument is used to filter results to units from the North American geological record:

With these data, the number of units for each rock type can be calculated for every international geological stage (i.e., time bin) through time. Functionality available in the palaeoverse R package can be used to retrieve relevant information about geological stages (Jones et al., 2023):

Using additional R packages for visualization, such as ggplot2 (Wickham, 2016) and deeptime (Gearty, 2024), Phanerozoic stage-level counts of North American lithostratigraphic units can be plotted by rock type (Fig. 4):

We have made several resources available for our users. First, we have built a package website (https://rmacrostrat.palaeoverse.org) that provides information on how to use and contribute to rmacrostrat, how to report issues and bugs, and a contributor code of conduct. We have also made available three vignettes (i.e., tutorials) for the package, which provide user-friendly usage guides (https://rmacrostrat.palaeoverse.org/articles). Through rmacrostrat, we hope to further foster collaboration and the sharing of resources within the Earth science community. With this goal in mind, we warmly welcome the community to join and follow our community spaces, such as our GitHub organization page (https://github.com/palaeoverse) and Google Group (https://groups.google.com/g/palaeoverse), where users can share ideas and resources, advertise opportunities, and network with colleagues.

The development of rmacrostrat expands upon the suite of software toolkits available within the Palaeoverse (https://palaeoverse.org/) “universe” (Jones, 2022; Gearty and Jones, 2023; Jones et al., 2023). The current version of rmacrostrat uses version 2 of Macrostrat's API to retrieve data, and it is our intention to track future versions of the API as updates become available, including the planned integration of eODP data into Macrostrat's data entities (e.g., columns). Through rmacrostrat, we hope to improve accessibility to the vast pool of geological data available within the Macrostrat database and facilitate new research across the Earth sciences. The rmacrostrat R package offers researchers the opportunity to streamline their research by providing a bridge between Macrostrat and the R environment as well as supporting the capacity to generate fully reproducible pipelines. We hope that these benefits will encourage the community to further capitalize on the value offered by Macrostrat and may ultimately lead to higher data quality through peer review. As we have demonstrated with our example applications, rmacrostrat can be used to support the efficient plotting of stratigraphic columns, mapping of geological outcrop, and quantification of temporal dynamics in available macrostratigraphic units. However, we envision that rmacrostrat can also be used to support a wide range of additional analyses across the Earth sciences, such as economic resource exploration, comparisons between deep-time diversity dynamics and environmental change, and hazard mapping.

Science Editor: David E. Fastovsky
Associate Editor: Lawrence Tanner

We would like to thank Shanan Peters, Daniel Segessenman, Daven Quinn, Amy Fromandi, Michael McClennen, Andrew Zaffos, and all those who develop and maintain the Macrostrat database. We would also like to thank the Swiss National Science Foundation for supplying a Scientific Exchanges grant to Allen, which facilitated the construction of this package and associated materials (IZSEZ0_224089). Jones was supported by a U.K. Natural Environment Research Council (NERC) Independent Research Fellowship, a Juan de la Cierva–Formación 2021 Postdoctoral Fellowship (FJC2021-046695-I/MCIN/AEI/10.13039/501100011033) from the European Union “NextGenerationEU”/PRTR, and a Norman Newell Early Career Grant from the Paleontological Society. Gearty was supported by the Lerner-Gray Postdoctoral Research Fellowship from the Richard Gilder Graduate School at the American Museum of Natural History and a Norman Newell Early Career Grant from the Paleontological Society. Dean was supported by a Royal Society (UK) grant (RF_ERE_210013, awarded to Philip D. Mannion). Allen was supported by ETH Zurich. This is Paleobiology Database publication number 504.