Museum collections provide primary data for paleontologists, and recent advances in information technology have revolutionized how museums collect and share this information. However, many natural history museums have huge collections and small budgets, so museum scientists are challenged to keep these critical data current and available to the public. We suggest that establishing an open collaboration through the Internet is one possible solution to this challenge. To achieve this solution, we have implemented a Web-based collections catalog to encourage collaborative maintenance of collections data as a shared resource. Anyone can search the catalog via a simple interface designed for any standard Web browser, and Web users can also be authorized to add information or update records as stratigraphic and taxonomic concepts change. The goal is to establish two-way communication between our catalog and the scientific community wherein the museum shares its collections and related data, and in return the community contributes new data acquired through use of the collections. The catalog also provides a basic function for building links with online publications and other data sources. As data exchange standards become accepted, these links can be used to create metadatabases that could lead to global networks of collections, taxonomic, stratigraphic, and bibliographic information. By providing an efficient mechanism to locate and synthesize large volumes of disparate information, such loosely integrated systems have resulted in rapid progress in disciplines of the life and physical sciences, and they represent one way forward into a data-rich future for paleontology.
Fossil specimens are the best record of the occurrence of a particular organism at a specific time and place (Allmon and Poulton, 2000), so collections are the raw data of paleontology. Collections are required for subsequent researchers to check and reinterpret previous work, and they are an important source of new information that can be released by the arrival of new technologies and new research questions. For example, collections have been used in studies based on morphometric analysis, molecular methods including DNA sequencing, and various geochemical techniques (Suarez and Tsutsui, 2004; Allmon, 2005). Collections held by museums become especially important in cases where original exposures are no longer available for collecting, as is commonly the case for man-made exposures produced during road building, quarrying, or construction. However, collections of fossils are only useful if they are accessible to potential users. Traditional use of paleontology collections required researchers to visit museums and work with material onsite or resort to secondary sources in the published literature. In reality, much of the information about the contents of paleontology collections is passed along by word of mouth, as a kind of folklore: for example, Heinz Lowenstam was a professor at the California Institute of Technology, so his collections might be held by an institution in Southern California. Obviously, this is not the most efficient method to advertise the availability of important research collections. For at least a decade, it has been clear that the World Wide Web is an ideal forum to publish collections catalogs. Besides widespread availability and ease of access, the Internet offers the additional benefit of allowing databases to be integrated into new networks of bioinformatics and geoinformatics (Graham et al., 2004). Such networks enable researchers to address questions regarding the large-scale history of regional or global diversity in response to global environmental change (e.g., Jackson and Johnson, 2000; Alroy et al., 2001), and are an inevitable part of the future of paleontology.
Most natural history collections belong to public or nonprofit institutions that hold their collections in the public trust (American Association of Museums, 2005). However, many of these institutions have recently been subject to budget shortfalls (Dalton, 2003; Suarez and Tsutsui, 2004) that have reduced support for collections. At the same time, changing organizational priorities has resulted in the transfer of collections to a smaller number of institutions (Gropp, 2003). For example, the Department of Invertebrate Paleontology at the Natural History Museum of Los Angeles County (LACMIP) currently contains collections that formerly belonged to the University of Southern California, the University of California at Los Angeles, the California Institute of Technology, and California State University, Northridge. The consequence of these transfers is that relatively small staffs are caring for many large and important collections that are critical to the future of paleontology. Besides limitations in manpower, there is an increasing shortage of expertise. With reduced staff, most institutions do not have in-house experts that can serve as taxonomic authorities in the entire spectrum of fossil groups represented in their enormous combined collections. Without this expert knowledge in-house, it is difficult to adequately maintain and improve collections without enlisting the support of experts in the broader paleontological community. This outside assistance must come from the researchers using museum collections to address questions in their own specialized fields, whether Cambrian trilobites of the Great Basin or Pleistocene mollusks of western North America. Collections managers provide free access to specimens and data, but sharing must become a two-way street. The research community using these resources must contribute its expertise to ensure continued access to high-quality information. Other fields of research within bioinformatics are reaching the same conclusion (Eiden, 2004; Wilson, 2005). To help achieve this, we have developed a Web-based collections catalog that can be jointly managed by the museum collections staff and research community as a shared resource.
The LACMIP holds more than five million specimens, primarily from the western United States, including the world's largest collections of Cretaceous and Neogene mollusks from western North America. Our collections have been built over the past 90 yr and include the important university collections mentioned above that were transferred to the museum as local universities decided to eliminate their research collections. The department is currently housed in an off-site facility about a half mile from the main museum. This site contains collections storage as well as laboratories and staff offices. Within the collections space, the fossils are stored in 674 steel cabinets. Specimens collected from the same locality are stored together, and the entire main collections are arranged first according to geologic age (Cambrian to Quaternary) and then by geographic place (country, state, county) within each age. Each steel cabinet has a set of drawers containing specimen lots. These are groups of specimens from a single collecting locality that have been identified as belonging to the same taxon. Over the years, each lot may have accumulated a group of paper labels that contains information regarding the fossil collecting locality and sometimes multiple taxonomic determinations made by different researchers who have studied the material. For example, the gastropod illustrated in Figure 1 has four different hand-written and typed labels that contain such data. One of our challenges is to capture these data and make them available to the public.
Cataloging of the collection was started in the 1960s with the development of a card-based locality register. In this system, each locality was given a unique number and a card with essential geographic and stratigraphic information. These numbers were attached to specimens and became the primary identification of specimen lots in the collection. A similar card file system was developed for type and figured specimens. Each type specimen was associated with a unique number and was cross-referenced with specimen identification and bibliographic information. During the late 1980s these data were entered by hand into a custom collections management system developed in Borland Paradox. Nontype specimens that had never been figured in publications were not cataloged. However, the card system continued to be maintained in parallel with the computer database and was considered the standard. In 2002 we extracted the data from the legacy database and reformed it into a new system.
THE LACMIP COLLECTIONS CATALOG
Our goal was to build an electronic catalog that could meet the following objectives: (1) The catalog must allow the rapid acquisition of basic taxonomic, stratigraphic, geographic, and bibliographic information. The majority of these data need to be entered manually by part-time staff with little training, mainly volunteers and work-study students, so we have developed entry forms with pick lists to minimize typing of long and unfamiliar scientific names; (2) The catalog must be accessible from any computer connected to the Internet. To achieve this, we decided to take advantage of a Web architecture approach and the existing mature technology developed for e-commerce sites on the World Wide Web. This decision was made both to streamline the development process and to allow access for the broad community of research scientists contributing to the site as well as museum staff working in other locations; (3) The system must be able to share information with outside data networks in geoinformatics and other systems in our own institution. Therefore, we used a multitiered application architecture to facilitate this sharing; (4) The system must allow links to be made directly from online taxonomic publications to the type and figured specimens in our collections. These are among the most important materials in our collections, and we strive to maximize their exposure for convenient use by the research community; and (5) Images of specimens, collecting localities, and digital copies of field notes, maps, and other resources must also be available for remote use.
With these objectives in mind, we have developed a flexible, modular system that can be adapted to changing technology because individual components can be added, modified, or removed as necessary. This will allow the system to be improved incrementally as new technologies become available. For example, the current system does not include collections management functions so it cannot be used to track loans, insurance values, or the physical location of specimen lots. Our institutional Office of the Registrar performs many of these tasks, and we are building automated links from their registration system to our collection catalog. Our system also does not include a sophisticated geographic information system to allow mapping or geospatial analysis, nor have we attempted to track complex synonymies and changes in taxonomic practice. Instead we plan to take advantage of other tools developed especially for these purposes. For example, we would likely cede responsibility for maintaining taxonomic information to other systems when distributed taxonomic dictionaries become available for fossils. The LACMIP electronic catalog has been designed to publish information regarding our collections only.
The new LACMIP system has been developed as a Web-based, client-server database with multiple interfaces (Fig. 2). The data are stored in a relational database as a backend, using the PostgreSQL database system (PostgreSQL Global Development Group, 2005). Some of the basic business logic is implemented on this server including checks for referential integrity and triggers that enforce data updates. At the moment there are four interfaces to the data. The most simple is an interface that communicates via the SQL database programming language (Wikipedia, 2005) used for administration and maintenance. Three interfaces written in the PHP scripting language (PHP Group, 2005) run via an Apache Web server (Apache Software Foundation, 2005). Two of these interfaces are Web forms that allow input, searching, and browsing of the data using standard Web browsers on any machine connected to the Internet. One is composed of simple read-only forms accessible to the public, and the second interface includes data input forms and accepts user authentication using secure protocols. The third interface is a set of basic Web services built under a Web architecture (Jacobs, 2004) or “REST-like” philosophy (Fielding, 2000) that allows integration with other systems.
Data models for systematics collections have been described in detail elsewhere (Association of Systematics Collections Committee on Computerization and Networking, 1992; Morris, 2000; Pullan et al., 2000; Raguenaud et al., 2002), and further analysis is not warranted here. Our underlying database structure is loosely based on these other models. The goal was to keep the schema relatively simple but to capture as much useful information as possible. The subject areas include localities, taxonomy, lots, people, images, and a bibliography. One critical difference between our model and many other systems is that we track multiple interpretations for most data fields. That is, data are never deleted as new information is added. This is in keeping with the fundamental paradigm of collections data as tools for online collaboration. All additions are time stamped and marked with the name of the person that made the contribution. This allows researchers to know who added the information and when it was added. Therefore, anyone who is interested can track changes in the system.
Locality associated information includes geographic, stratigraphic, and collection data. Our use of locality is similar to the concept of collecting event used in the ASC model (Association of Systematics Collections Committee on Computerization and Networking, 1992). In theory it would be possible to make multiple collections from the same geographic and stratigraphic context, but in practice many repeated collections are not from precisely the same context. Therefore, we consider each new collection as a new locality in our system. The collector, field number, and date of collection are associated with the collecting locality in the LACMIP data model. Geographic data are categorized as political place names (city, county, state or province, country) and supplemented by detailed written descriptions provided by collectors. Geospatial data are included where available and provided by the collector (usually in the form of United States township/range system or latitude/longitude), but standardized georeferencing remains to be completed. Stratigraphic information is limited to stratigraphic units (member, formation, group) and associated age range. The chronostratigraphic units used in the system are the internationally accepted standard stage names (Geological Society of America, 1999). Additional information on stratigraphy and age can be included in the text description for each locality.
A specimen lot is a group of specimens from a collecting locality that has been sorted out and identified as belonging to a particular species or higher taxon. In theory all specimens identified as the same species from a single locality would be contained in one lot, but in practice there might be more than one lot of this species because of specimen abundance, limitations in container size, or special use of individual specimens from a lot (illustration, geochemical analysis, etc.). Information associated with specimen lots includes taxonomic determinations, the number of specimens in the lot, and whether the specimen has been cited in a published work. Digital images of specimens are provided for some specimen lots.
Managing taxonomic data is a complex problem, and data models have been developed to track synonymies, changes in rank, splitting, and the detailed consequences of changing taxonomic concepts and practice (Taxonomic Databases Working Group, 2004; Shattuck, 2005). The LACMIP catalog records updates to determinations of specimen lots and allows users to search for lots using supraspecific classification. We use a combination of our legacy database and data from the United States Department of Agriculture Integrated Taxonomic Information System (ITIS) (ITIS, 2005) as the starting point for mollusks and corals, and we could easily integrate other taxonomic dictionaries as they become available for fossil groups. Multiple determinations can be included for each specimen lot, and we have implemented a basic system for tracking synonyms to aid in the consistent application of taxon names.
Although collecting localities and specimen lots are the basic units of information in our catalog, we also maintain information regarding associated personnel, digital images, and a bibliography relevant to the LACMIP collections. These supplementary modules have been kept simple. People associated with the collections include collectors, collections maintenance staff, authorized users of the catalog, and specialists who have contributed data to the system. A basic bibliographic table that allows publications to be associated with localities and specimen lots is also maintained. Most collection localities are associated with maps, and these are referenced as publications in the bibliography. In our current catalog, images are maintained as digital files on a fileserver at two resolutions. Thumbnails are small compressed files with widths of 150 pixels for photographs and 300 pixels for field maps or other scanned images. High-resolution images are also available to the public at widths of 450 pixels for specimens and 800 pixels for scanned materials. Image file data are maintained in a basic image database associated with our catalog so that they can be published over the World Wide Web.
A WEB INTERFACE TO THE COLLECTIONS CATALOG
There are both public and restricted Web interfaces to the LACMIP collections catalog (Fig. 3A–302303D). The public interfaces allow researchers to browse the catalog over the World Wide Web (Johnson et al., 2005a; Fig. 3B). Note that we track multiple interpretations for most data fields. The name of the person who made each entry and the date of entry are indicated in parentheses. Researchers can browse through a set of localities or specimen lots or can download the information for local use. Data can be downloaded as delimited text files that include only the most up-to-date information, because the full information associated with any particular locality cannot be represented in a simple two-dimensional table if multiple interpretations are present for any piece of information. For printing specimen lot labels or hard copies of locality information, portable document format (PDF) files can be downloaded. A thumbnail is shown if images are available, and higher-resolution images can be viewed by clicking on the thumbnail. The restricted Web forms can be accessed using our secure Web server. Museum collections staff and researchers interested in contributing to the system are assigned user names and passwords that are required to access this part of the site. Restricted forms for searching and browsing the catalog are similar to the public pages except they allow input of additional data.
The initial entry of locality and lot records into the catalog can only be performed by museum collections staff. There are data entry forms for each of the main subject areas (Fig. 4), written as standard hypertext markup language (HTML) Web forms. An online data entry guide is provided to ensure consistent data input, and pick lists have been implemented where possible to minimize typographical errors. For example, when a determination is made, there are several steps to selecting a taxon name (Fig. 5A–502503D). Also, modern Web browsers have autocomplete functions that may reduce typographic errors. There is a simple mechanism to increase the consistent use of taxonomic names via tracking synonyms. Junior synonyms can be associated with senior synonyms so that when a junior synonym is requested as a taxonomic determination, both that name and the senior synonym are returned as determinations. In general, this interface has been designed to minimize potential data-entry errors because much information is hand keyed into the catalog by assistants who may have limited geological or taxonomic expertise. However, information is not proofed and all data entered into the system are immediately available to the research community.
Both the public and authorized researchers can search and browse the data using the provided set of Web forms. For example, to find all localities in Shasta County from the Redding Formation (Fig. 3A), a researcher needs to fill in the appropriate fields on the locality search form. In this case, a total of 88 localities is returned, and users may browse through them one by one (Fig. 3B). Alternately, a researcher could return to the search form and limit or refine the search (using the Modify Search control), or the researcher could download the entire data set either as a text file or a PDF-formatted file that is ready to be printed. Authorized researchers see a slightly different view (Fig. 3C), because they are able to add information. The labels associated with each line of data are now controls that may be used to access additional forms for data entry. For example, to add new information regarding the stratigraphic unit of a locality, a contributor would click on the button marked Unit to use the appropriate form (Fig. 3D).
Searching and browsing for specimen lots is similar to working with locality data and can be performed using a similar set of Web forms (Fig. 6A–C). A search can be performed for both lot information and the locality from which the lots were collected. For example, a search for the gastropod genus Paosia from the Redding Formation (Fig. 6A) returns eight lots from a selection of localities (Fig. 6B). Data for this list of lots can then be downloaded as a text file by selecting the Download Lot List, or labels for specimen trays can be produced by selecting Create Labels. Information for one of the lots (lot LACMIP 10726-2) is shown in Figure 6C. However, the downloaded data will not include all of the information associated with this lot because this information cannot be organized into a simple two-dimensional table. This lot has been identified several times, first as Oonia? californica (Gabb, 1864), later as Paosia colusaensis (Anderson, 1958), and most recently as Paosia californica (Gabb, 1864). In addition, the specimen lot has been cited in two publications (Jones et al., 1978; Squires and Saul, 2004) as type specimen LACMIP 10810. Several images are also available that can be downloaded in high resolution. Information about locality LACMIP 10726 is at the bottom of the lot page, including a map. This series of Web forms provides the primary interface for the catalog. Similar forms exist to search, browse, and add bibliographic and biographic information. A comprehensive user guide that will assist researchers with use of the system, including standards for data entry, is available through a link on all of the forms.
As of May 2005, our entire locality register of 27,970 collections has been included in the catalog. To date 28,197 specimen lots have been cataloged comprising 601,409 individual specimens. We estimate that this includes ∼20% of our complete collection, but we do not have precise estimates for the total size of the collection. In fact, during the cataloging process we are finding that the previous attempts to estimate collection size probably are 25%–30% lower than the true figure. A similar result may be obtained during cataloging of other large paleontological collections. The majority of these records is derived from our extensive holdings of Neogene Mollusca from Southern California. Cataloging of this material was determined to be a priority due to the potential use for studies of the impact of regional environmental change on shallow marine communities. In addition, our complete set of type and figured specimen lots has been incorporated, including 10,429 specimens. These are the most important components of the collection, so they were a priority for cataloging.
There are several problems with the type of Web forms interface outlined above. Most serious is the requirement for human intervention to locate a particular piece of information regarding a particular locality or specimen lot. This means that it is difficult to generate direct links to information, for example to link from another Web site to one particular locality. Secondly, the “Web spider” programs used by standard Web search engines to index Web pages cannot access Web forms easily. To overcome these limitations, we have designed a simple Web interface to the LACMIP catalog that allows direct linking to individual locality, specimen lot, type specimen, and digital image records. We have followed a REST-like architecture (Fielding, 2000) that takes advantage of existing Web protocols to allow access to our data from outside systems. Each data resource is represented by a Web address or unique resource locator (URL). These addresses are static and easy to construct if the user knows what he or she is looking for. For example, the URL http://ip.nhm.org/ipdatabase/locality/17575 will return whatever information we have regarding locality 17575, and the URL http://ip.nhm.org/ipdatabase/lot/10762–2 will link directly to information about specimen lot 10762–2. Similar links exist for type specimens and images of specimen lots. For example, the URL http://ip.nhm.org/ipdatabase/type/9786 links directly to type specimen LACMIP 9786. The returned pages are not static Web pages but are generated by the Web server at each request so they are always up to date. As standard schemas for the publication of paleontological specimen data become available, we will be able to publish extensible markup language (XML)–formatted information using this mechanism.
As a test of this Web services interface, we developed a system that allows joint queries across both the Holocene and fossil mollusk collections at the Natural History Museum of Los Angeles County (LACM; Johnson et al., 2005b). In our institution, most departments use different systems that are appropriate for the needs of each department. For example, the LACM Holocene malacology database requires no treatment of stratigraphy, and the invertebrate paleontology database has no way to track water depth. In the joint search tool, searches are performed on a subset of fields from the LACM malacology and LACMIP databases, and the results include links back into the original databases so users can access more complete information that might not be contained in both systems (Fig. 7A–702C703).
Developing any information system requires compromise. Our priority has been to publish as much collections-related information as possible with limited resources. The quality of these data varies, but even imperfect data can be useful (Lieberman and Kaesler, 2000). Furthermore, we acknowledge that as museum curators we will never have the resources to fully verify the immense volume of information held in our catalog. Instead, the paleontological community must help with this never-ending task. The LACMIP catalog is a living document that is constantly being improved by museum staff and database users, and the information published in it should not be used uncritically in large compilations. Although effort is made to publish only accurate information, there is large variation in the quality of the information included in the catalog. Stratigraphic and taxonomic concepts change with time, and these updates are not always included in the system. Indeed, we hope that users will help improve the data, and we ask that the authors publishing studies based on information in the LACMIP catalog help update the catalog with new information and interpretations resulting from their research. We also expect authors to include citations to the LACMP catalog if they have used it as a data source.
Besides sharing information with our community of researchers, we encourage links to the LACMIP catalog from online versions of publications that make use of our collections. Such links should enhance greatly the utility of research collections catalogs (National Research Council, 2002). For example, papers published in the online version of the Journal of Paleontology or Geosphere could contain direct links from specimen or locality citations to the LACMIP catalog. Such links allow readers rapid access to the most up-to-date information available. Changes in the interpretation of stratigraphy, environment, or taxonomic classification cannot be tracked in a static document, but the static document can provide links back to systems that can be changed. Similar links could be incorporated into compilations of paleontological occurrences based on published records, thus allowing users direct access to the underlying data and allowing database administrators to automatically track revisions in data associated with museum collections. In the current implementation we have adopted a Web architecture approach rather than a more complex Web services approach. The benefit of this type of interface is that it can be implemented right now—the protocols exist, and they are simple to use. The only software required to view the catalog is a standard Web browser. Furthermore, as new data standards and messaging protocols develop we will be able to accommodate them into new versions of the LACMIP collections catalog.
Probably the main obstacle to the widespread adoption of community-based catalogs is encouraging qualified researchers to contribute hard-earned data to a collaborative system. There are several potential models to rectify this problem, some of which offer a “carrot,” and others that threaten a “stick.” For example, we could require some level of contribution as a condition for providing loans of specimens or access to the collections catalog, but so far we reject this approach because it might result in reduced collections use. An alternative is to provide a mechanism by which contributors could receive some form of professional credit in the form of measures that could be added to curricula vitae or management reports used in professional performance reviews. To achieve this, we plan to implement an electronic recorder or scoreboard that lists the number and type of data contributed by each member of the community using the catalog. As links develop into the catalog from online journals or other publications, this scoreboard could be used to track usage of particular types of information, and the resulting track record could be used to weight contributions from individual researchers in the same way that publications are weighted based on the number of times that they are cited in works of other authors. However, in the end, researchers and other users of information in the LACMIP catalog must take on part of the responsibility for maintaining this shared resource. As a community, we all require high-quality information to place fossils in the proper taxonomic, stratigraphic, and geologic context because the scientific value of paleontological collections lies as much in this context as in the fossils themselves. Unfortunately, with the funding levels currently available for the support of collections, museum staff will never be able to maintain and update all of the information for researchers and other users of the catalog. The resulting bottleneck will impede progress by limiting the availability of up-to-date information. To avoid this, the community of paleontologists must perform as much of the required maintenance and updating as their collections use dictates. These data must become a shared resource maintained by all as we move together into a data-rich future for paleontology.
Current Address: Department of Palaeontology, Natural History Museum, Cromwell Road, London SW7 5BD, UK
Much of the data in the LACMIP collections catalog was entered by our predecessors including J.M. Alderson, L.T. Groves, G. Kennedy, P.G. Owen, and E.C. Wilson. Our team of work study students, research associates, and volunteers includes M. Alonso, J. Cline, S. Cowles, A. Fu, B. Gillies, L. Moore, H. Murdock, L.R. Saul, J. Severe, R.J. Stanton Jr., and J. Wiggins. We thank C.M. Kelly for producing many of the photographs. W. Allmon, W. Kiessling, D. Pentcheff, A. Valdés, and R. Wetzer provided useful suggestions for improving this contribution. We gratefully acknowledge the support of the United States National Science Foundation (grant DBI-0237337).