Many valuable earth science data are not available in a digital format. Manual entry of such information into databases is time consuming, unrewarding, and prone to introducing errors. Taxonomic descriptions of fossils are a good example of valuable data that are overwhelming and available only in printed volumes and journals, some of which are increasingly rare and inaccessible. The highly structured nature of taxonomic procedures and nomenclature means that many previously published data remain equally valid to the present day, and contain information that is currently not available on the World Wide Web; these data would be of great use to a wide variety of scientists and other end users in government, industry, academia and the general public. This paper describes an XML (extensible markup language) parsing technique that allows taxonomic descriptions to be fully digitized much more rapidly than would be possible by manual entry of the data into a database. The technique exploits the high degree of structure in taxonomic descriptions, which are written in a standardized format, to automate the processing of tagging separate sections of the text. Once tagged using XML, the data can be subjected to complex searches using queries written in any of the XML query standards. The XML-tagged data can potentially be imported into existing databases, in effect removing the necessity to manually enter the information, and hence overcoming the main bottleneck in generating digital data from printed material. Individual parsers can be tailored precisely to the nature of the text being analyzed, and once the underlying concepts and procedures are understood, those interested in acquiring and using digital data will be able to generate XML parsers dedicated to text with different styles of standardized formatting.
Many scientific disciplines are ideally placed to make great use of rapidly developing computer technology, as many of the raw data are available in a digital format. Indeed, the demands for huge computing power to store, manipulate, distribute, and analyze such enormous data sets as are generated by individual experiments in subjects such as particle physics and astronomy, for example, constitute an important driving force in the recent developments in computing science. In particular the necessity of establishing global collaborations in key areas of sciences that have, and increasingly generate, extensive digital data has led to the development of the new field “e-Science,” and its enabling infrastructure (commonly referred to as the “Grid”) (Hey and Trefethen, 2002). While the Grid initially focused on providing access to high-end computational facilities, the concept has been enlarged to include situations in which the main challenge is to effectively coordinate access and analysis of dispersed digital resources such as databases (Jones, 2007). Such a facility would have universal applications in sciences.
However, in earth sciences, as in a number of other sciences, digital data are only patchily available. There are a number of important areas for which there are impressive and rapidly growing collections of digital data (satellite imagery, for example), but there are also many branches of the subject for which few or no digital data exist. Most of the accumulated knowledge in the earth sciences resides in published monographs and journals. While the development of electronic (e) journals and e-books will ensure that published research data increasingly become available in a digital format, there remains the major difficulty of how to digitize existing published information. Some components of published knowledge base are undoubtedly outdated and it would clearly be counterproductive to expend time and effort to digitize this information. However, there are very significant amounts of published information that represent core information for many disciplines, and there is an urgent need to digitize such information. The urgency is particularly acute because this published information has been peer reviewed, and hence has been academically validated (which is not always the case for data directly published on the World Wide Web [WWW]).
The major problem is how to digitize such data. It is not sufficient to simply generate a digital version of printed text (such as a portable document format, pdf, or a text document), which can, as discussed below, be readily achieved even for publications of considerable antiquity using modern scanning and optical character recognition (OCR) software. Instead, full exploitation of the data requires that the text be entered into labeled fields (for example, in a database). Currently this full digitization requires manual entry of data into a database, but the process is very slow and unrewarding for the operators. Among the numerous difficulties inherent in manually digitizing data is the fact that rekeying and recoding text may introduce errors, and that different operators may adopt different or inconsistent protocols for entering information and hence will generate additional artifacts. The importance of achieving a high degree of fidelity in digital data means that creating even small databases can be a major task. Even when databases have been prepared, there is a need to constantly maintain and update them as new information accumulates, and the entire process is therefore unrewarding and uneconomic. In addition, computers are ideally suited to conducting repetitive tasks extremely fast and with great fidelity, so it is not sensible to depend on human operators for this task when their time would be much better spent concentrating on analysis and interpretation (for which tasks humans are much better equipped than computers).
Manual entry of data into a database is universally recognized as a major bottleneck in the development of major digital data resources (Curry et al., 2001; Godfray, 2002; Rodman and Cody, 2003; Wheeler 2003; Wheeler et al., 2004; Curry and Connor, 2007). Automating the process would help overcome this bottleneck, and this paper describes how this can be achieved for a type of earth science data (fossil descriptions) that is particularly amenable to this form of processing. Species descriptions are particularly suitable for retrospective digitization because they have a well-defined format, and because within each group there are readily identifiable key monographs that warrant digitization. These monographs contain data that are still valid, often using terminology still in constant use (Curry and Connor 2007). They contain vast amounts of valuable information not generally available on the WWW, as they contain full and diagnostic details of an organism's morphology, as well as detailed stratigraphic and paleogeographic information and expert discussion, while most taxonomic databases usually list little more than species names and basic taxonomic, stratigraphic, and distributional information. A further advantage is that many of these key monographs are out of copyright.
Furthermore, the rigorous protocols of taxonomy, and generally high standards of editing, mean that the data they contain remain useful to users: even though an organism's name may have changed over time, the routine inclusion of synonym lists in formal taxonomic descriptions guarantees that the taxon can be identified unambiguously. As this paper is solely concerned with the digitization of existing seminal publications using the author's original and unaltered text, there are no issues concerning compliance with the International Commission on Zoological Nomenclature (ICZN) regulations governing the publication of formal taxonomic descriptions (ICZN, 1999). There has been considerable debate about the potential of the WWW for the publication of formal descriptions of new species (Godfray, 2002), but the monographs targeted in this publication are all previously published in full compliance with ICZN regulations. Indeed in many ways the Treatise on Invertebrate Paleontology is a particularly ideal test subject, not only covering an entire phylum in a consistent fashion, but also representing the agreed views of the international community of brachiopod experts. In effect, taxonomic descriptions published in the treatise have been doubly validated—initially when published in a recognized systematics journal, and secondly by the author, or group of authors, of the treatise sections who have reviewed the taxonomic validity of a taxon from their expert position. The enduring value of these descriptions in therefore effectively guaranteed by a consensus of cumulative expertise extending back throughout the history of paleontological research.
The recent announcement of the Encyclopedia of Life Project (http://www.eol.org/) makes it particularly timely to develop such a digitizing procedure for the huge amount of fossil description data that currently exist only in printed form. The Encyclopedia of Life is concentrating on living organisms, but mentions that fossil data will eventually be incorporated: the technique described in this paper will allow the earth science community to immediately make rapid progress on this important task.
Apart from the Encyclopedia of Life, many other major international projects are addressing the complex issues involved in making biodiversity information available on the WWW. The Global Biodiversity Information Facility (GBIF), for example, was established to meet demands from governments and industry for biodiversity information that was essential for scientific research, environmental research, economic development, and sustainability (Lane and Edwards, 2007). However, such initiatives, and many others, are focused almost entirely on living organisms, and do not have immediate plans to incorporate fossil data. It is important to realize that megascience facilities such as the GBIF (Lane and Edwards, 2007) take the form of a distributed network, in which disparate data nodes are linked to provide a global resource. Thus the GBIF is working toward the computational infrastructure that will allow digital biodiversity data in any format to be accessed using a single interface regardless of the nature of the data or the platform on which the data are stored. There will still be a requirement, therefore, for constituent data sets to be digitized by individuals or groups, and this is clearly true for a considerable proportion of paleontological information, which remains unavailable in a digital format. Once digitized it will be possible to make use of the systems interoperability facility of organizations such as the GBIF to have this data distributed freely irrespective of computing platform (Lane and Edwards, 2007).
FORMAL DESCRIPTIONS IN TAXONOMY
The key issue of taxonomic descriptions that renders them suitable for automated digitization is that they are written in a very structured way, and this format is followed rigorously and in a consistent order within a particular monograph. The standard components of a formal taxonomic description are the following.
Genus and species, original author, date of publication, and page numbers are all presented in a standard format through the publication, for example with the binomial name in bold, followed by the original author's name in plain text, with the data of the publication and page number of the species description separated by commas, and with a full stop and line return to end.
Previous names applied to this species are commonly shown as italics and in a list, each separated by a line return. Some synonym lists can run to several pages where a taxon has been subjected to numerous revisions.
Diagnosis includes a brief summary of the key diagnostic features that distinguish this species from others that are closely related.
The features are described in a standard order, for example, starting with the shape, size, external ornamentation, and then moving to describe the internal features. The description is generally written in very abbreviated style, with high-information content sentences or phrases separated by semi-colons.
Geographic and stratigraphic information are presented in a factual, standardized format.
Details of the location of the specimens used to prepare the description are given, including in most cases the catalogue numbers from the museum or repository where they are stored.
The discussion is the least standardized part of the formal taxonomic description, but it will normally contain some justification for the erection of this new species by outlining how it differs from other species assigned to this genus.
A very similar style is largely applicable to formal taxonomic descriptions at any level of classification, i.e., not just species, but also for formal descriptions of genera, families, superfamilies, etc.
AUTOMATING THE DIGITIZATION OF TAXONOMIC DESCRIPTIONS
In simplistic terms, databases impose a high degree of structure on normal text (which is unstructured). This is because text has been placed in labeled fields in the database. This feature greatly increases the ability to carry out complex searches, one of the main reasons for creating databases. For example, if species descriptions were fully encoded in a database, then it would be possible to search for all taxa that had all of the following eight attributes (a hypothetical query based on formal taxonomic descriptions of fossil nautiloid species).
1—date (of first description) = “1940–1950”;
2—author = “John Smith”;
3—description or diagnosis includes “lirate shell”;
4—description or diagnosis includes “cyrto-conic apex”;
5—description or diagnosis includes “siphuncular deposits”;
6—description or diagnosis includes “cyrtocho-anitic septal necks”;
7—Stratigraphic range includes “Upper Devo-nian”;
8—Geographic range includes “Kansas.”
Attempting such a query over simple text (for example, in a word-processed document) would fail, because such queries are based on simple string matching, and will therefore find all instances where “1940–1950” occurs in the text (not just when these appear in the formal description). A simple string-matching search will return all cases where the search terms are found in the text, and not the species that display all of the features. Even though simple text searching is becoming more sophisticated, there is no immediate prospect of being able to use this technique to generate meaningful results from the sort of complex query illustrated above. Until recently, such functionality has been restricted to highly structured databases. There are many earth scientists, and other organizations such as museums, natural history societies, and government agencies, who would make great use of such information if it were available digitally; however, it is unrealistic to expect that either the funds or the expertise will be available to generate suitable databases manually.
There have recently been developments in the ability to process semi-structured text, and it is these developments that have opened up the prospect of automating the digitization of species descriptions. In computing terms, species descriptions have a much higher degree of structure than normal text, both because of the standardization of the layout, and because of the quality and rigor of pre-publication editing. In effect, species descriptions are situated at some point between unstructured text on one hand, and highly structured databases on the other.
It is possible to exploit this feature of species descriptions, and indeed any other text written with such a comparatively high degree of structure, in such as way as to automate the digitization process. This was achieved in this study by preparing software that parses the text of species descriptions, and automatically generates tags using XML (extensible markup language). The parser makes use of the high degree of structure, and the standardized layout, to recognize different components of the species description and to label them accurately. XML is widely used in major biodiversity projects such as the GBIF (http://www.gbif.org), iSpecies (http://darwin.zoology.gla.ac.uk/rpage/ispecies/), and Electronic Biologica Centrali-Americana (http://www.sil.si.edu/BCAProject/) because it provides easy searching through labeled data, and allows interoperability between different data sets and hence facilitates the linking of distributed data sources. For example, iSpecies (http://darwin.zoology.gla.ac.uk/rpage/ispecies/) provides a WWW-based facility to search a range of different databases for information about a particular species, and compiling the results into a single WWW page. The process involves recovering appropriate XML output from a variety of online databases, and presenting them as a hyper text markup language (html) page using extensible stylesheet language transformations (XSLT) (as described below). Searching for information on species of interest to earth scientists using iSpecies often produces sparse results because the data are not available digitally, but the use of XML, as advocated in this paper, will clearly guarantee that digitized information on fossils becomes universally available given the large number of facilities that utilize data in such a format. The major focus on interoperability will also ensure that XML-coded data remain accessible regardless of future developments in computer hardware or operating system.
FORMAL TAXONOMIC DESCRIPTION
The following is a taxonomic description of a brachiopod genus, as published in the Treatise on Invertebrate Paleontology (Williams et al., 2000, p. 36). It demonstrates many of the conventions adopted for the formal description at any taxonomic level (e.g., species), and is in a format that is used right across the extensive series of Treatise volumes dealing with many phyla of fossil organisms.
Glottidia DALL, 1870 p. 157 [*Lingula albida HINDS, 1844, p. 71; OD]. Shell strongly elongate; ventral pseudointerarea small, with vestigial propareas and pedicle groove; ventral visceral area extending somewhat anterior to midvalve, with posterolateral margins bounded by two divergent septa, serving as places of attachment for oblique muscles and support of body wall; pedicle nerve curving around unpaired umbonal muscle scar; dorsal visceral area with median septum extending from umbonal to transmedian muscles; mantle canal system with papillae; vascula media absent. ?Cretaceous, Tertiary-Holocene: ?Antarctica, Cretaceous; Europe, Tertiary; North America, Tertiary-Holocene; South America, Holocene.–Figure 9,2a–c. *G. albida (HINDS), Holocene, Anaheim Bay, California; a, dorsal valve exterior, X1.8; b,c, ventral, dorsal valve interior, MCZ 4423, X2.8 (new).
Parsing this 115 word description transformed it into an XML document with 61 lines of XML-tagged text. Figure 1 shows the first 18 lines of this XML document, with the original text in blue, and the XML tags added by the parser in red. These 18 lines deal first with the formal name of the genus (in this case Glottidia), its author (DALL), the date of the publication in which this genus was first described (1870), and the page number within this publication on which this description occurs (p. 157). It is possible to parse such text because of its comparatively high degree of structure; consistently throughout this and other Treatise on Invertebrate Paleontology volumes, the generic name comes at the beginning of a new paragraph, it has an initial capital letter followed by lower-case letters, and it is always in bold. Attributes can be assigned to XML tags, and in this example the name attribute has been tagged as <NAME confirmed = “true”>. This function is retaining important information about the taxon, because throughout the treatise the use of a question mark indicates some doubt about the genus. In this case there is no question mark in the parsed text, and hence the name attribute is listed as “true.”
Immediately after the taxonomic name, the author's name is cited, all in capitals but not in bold, and this is followed by a comma, the date, another comma, and then the page number. The major difference from the text version printed above is the XML tags that have been automatically and accurately generated around the appropriate segments of text (<NAME> … </NAME>, <AUTHOR> … </AUTHOR>, <DATE> … </DATE>, <PAGE> … </PAGE>) by the parser, and this allows structured searches to carried out in a way that would be impossible with simple text.
This becomes even more important when the remaining information in the description has been automatically parsed. Thus line 7 of the XML document contains the name, author, date, and page for the type species of this genus [OD indicates that this was the “original designation”; i.e., that the type species was explicitly applied to the single species in the first description of this new genus (Kaesler, xix, 2000)]. The type species data follows the page information without punctuation, but enclosed with square brackets. Immediately after the type specimen information there is a full stop, followed by the actual description of the key morphological features of this genus. The parser recognizes these descriptive phrases, each separated by a semicolon in the printed text, and encloses them within <DESCRIPTION> … </DESCRIPTION> tags (line 8 and line 18 in Fig. 1). As each phrase in the description is separated by semi-colons, these individual phrases have been labeled as <DETAILS> … </DETAILS> within <DESCRIPTION> … </DESCRIPTION> (lines 9–17 in Fig. 1). XML tags can be nestled within each other hierarchically, which is ideally suited for formal systematic descriptions that are also organized hierarchically (i.e., superfamily, family, genus, species, etc.).
The first 18 lines of the XML document contain 81 words from the total of 115 in the printed Treatise on Invertebrate Paleontology description. The few remaining words contain a great deal of information, and the parser generates more than 40 lines in the XML document to extract all this information. Immediately after the full stop indicating the end of the description, a summary of the stratigraphic range is given, all in italics. For the genus Glottidia, the citation of “?Cretaceous, Tertiary-Holocene” indicates that there is a somewhat dubious record of this species in Cretaceous rocks, but that it has a well-documented record from the Tertiary to the Holocene.
The human brain can quickly absorb this information once the conventions are understood, but for computers there needs to be a great deal of tagging to ensure that the information contained in these three words is correctly digitized. To do this in the XML parser, 11 lines of XML-tagged text are generated (Fig. 2). Enclosed within the <STRATIGRAPHIC> … </STRATIGRAPHIC> tags are both <STRATIGRAPHICPERIOD> and <STRATIGRAPHICRANGE> tags. The former tags are applied to single words (e.g., if the stratigraphic range is listed just as “Permian” or as in this case “?Cretaceous”) and the latter is applied when there are two words separated by a dash without spaces (as in this case with “Tertiary-Holocene”). In some cases there will be only a single word, in others there will be only a range, and in other cases, as here, there will be both. The parser can handle any combination of stratigraphic information that is provided. Figure 2, line 20, shows another use of the XML-tag attribute to add additional information in this case that the record of this genus in the Cretaceous is, as indicated in the text by the question mark, doubtful, and this is recorded in the XML-tag attribute as “confirmed” = “false.” Nested within the <STRATI-GRAPHICRANGE> tags are tags indicating the <START> and <END> of the stratigraphic range, and again the “confirmed” = “true” tag attribute indicates that the cited periods are not considered dubious for any reason.
This may all seem very complicated, but it relates back to the underlying rationale of digitization, namely to make sure that complex, multitopic searches are successfully executed. In practical terms such complexity is insignificant for the operation of the parser, which can rapidly process many hundreds of records. With these tags in place in the XML document, it would be possible to search for all taxa that are only present in the Cretaceous (search for “<STRATIGRAPHICPERIOD> = Cretaceous” will return all taxa that are only present in the Cretaceous, and the false or true attribute will indicate how many of these records are in some doubt). It would also be possible to search for all taxa that first appeared in a particular period, that last appeared in a particular period, or which had a particular range of periods (searching using the <START> and <END> tags).
However, the next part of the description contains even more stratigraphic information, this time paired with a geographic breakdown of the area where the genus is found during that stratigraphic unit. These paired geographic and stratigraphic listings also require complex tagging to ensure that they are fully available for searching. One of the important attributes of XML is that the tags can be named, and in this case the rather cumbersome <PLACES_AND_POS-SIBLE_PERIODS> tag has been used to clearly distinguish the information present in these paired stratigraphic-geographic descriptions. These tags can subsequently be altered very easily using global search and replace functions in XML editing software, but are used here to ensure clarity.
Within these tags (Fig. 3) there are a variable number of subsets of information tagged as <GEOSTRATSETS>, each of which represents a set of information on the stratigraphic and geographic distribution of this taxon. Thus in the formal description of Glottidia reproduced above, immediately after the stratigraphic range information discussed above (and shown tagged in Fig. 2), there is a colon, and then the first of these “geostratsets” is listed (in bold below).
?Cretaceous, Tertiary-Holocene: ?Antarctica, Cretaceous;
The parser recognizes this paired set of data, and erects the following XML tags around it: In computing terms, it is now clear that this record indicates that there is a somewhat doubtful record of the occurrence of this taxon in Antarctica, but that if it is really this taxon, then the rocks are of Cretaceous age. There can be numerous sets of these paired geostratsets: for Glottidia there are four, and they contain a prodigious amount of information. As indicated in Figure 4, there is some doubt about the record of Glottidia in Antarctica during the Cretaceous, but there are much more reliable records of its presence in Europe from the Tertiary, from North America starting in the Tertiary and extending to the Holocene, and from South America during the Holocene. The nine original words convey an admirably succinct snapshot of the geological history of this taxon, which may have first appeared in Cretaceous rocks now in Antarctica, had definitely spread to Europe and North America during the Tertiary, and survived to the Holocene in North America and indeed had extended its range to South America (in the treatise, the Holocene includes taxa that are extant, as is the case with Glottidia, which can be collected in present-days seas of North and South America).
<PLACE confirmed = “false” >
<STRATIGRAPHICPERIOD confirmed = “true”>Cretaceous</STRATIGRAPHICPE-RIOD>
The parser recognized and tagged all this information (Fig. 3) because the text always follows on from the stratigraphic range but separated from it by a colon, it is printed in italics, and each set of information is separated from each other by a semi-colon. The XML tagging allows all this information to be included in complex searches; e.g., Glottidia would be returned in any search constructed to find taxa that were recorded from North America during the Tertiary and for whom the description included “median septum,” “unpaired umbonal muscle scar,” etc.
The final section of the description gives information about the illustrations provided in the Treatise on Invertebrate Paleontology for this genus, and includes the figure numbers, the age and geographic location of the specimen illustrated, the magnifications of the images compared to the original, and a museum catalogue number if available. In this case the parser simply includes all this information in a single XML tag <FIGUREDESC>, but it would be easy to subdivide the information due to the editorial conventions being used, for example the figure details, name, age, and geographic location of the specimen are given in this order, immediately after the end of the paired stratigraphy-geography information, and are separated from it by two horizontal dashes. The style has reverted to plain text. The information on the views and magnifications of the illustrations comes next, and then the museum number (MCZ 4423 in Fig. 4), and the final piece of information is the source of the illustrations (in this case “new” indicating that they had been prepared for this publication and had not been copied from a previous publication).
SEARCH AND QUERY
As discussed above, the main rationale behind automated XML tagging is to facilitate complex searches without having to manually enter data into a database. The output from the XML parser is a long XML document that can be displayed in a modern WWW browser such as Internet Explorer or Firefox. In this format, complex queries can be run across the XML-tagged data by executing transformations of the document, as can be achieved using XSLTs.
A simple example of an XSLT transformation is shown in Figure 5. This query is going to extract the name, author, and date information from the descriptions of 277 brachiopod genera, which were described in volume 2 of the revised brachiopod Treatise on Invertebrate Paleontology (Williams et al., 2000). These descriptions were subsequently parsed to generate an XML document. The results of this XSL transformation of this XML document generated a table listing only the name, author, and date information from each genus description (Fig. 6). This table was imported into an Excel spreadsheet, and analyzed to produce Figures 7 and 8. Although applied here to a simple query and to a small subset of the total data, the technique could be applied to the entire list of genera from the phylum (∼5000) with very little additional effort. Figure 5 shows the full formatting of the XSL transformation, but a simple interface would be developed for widespread use that would just involved the investigators entering query terms, and data sets to be transformed. Further programming would be required to transform the parser into a sufficiently user-friendly format to allow its use by non-computer scientists, but this is believed to be feasible. Much more significant is to develop the mindset that recognizes that developments in computing science allow for digital data to be acquired from semi-structured text on any topic (e.g., rock and mineral descriptions).
PRACTICALITIES OF RETROSPECTIVE DIGITIZATION OF SPECIES DESCRIPTIONS
Fully digitizing taxonomic monographs first requires scanning of the printed text into a format that is amenable to XML parsing. Generating a suitable text document from printed species descriptions from an historic monograph faces some unusual problems, but none that cannot be resolved using modern scanning technology and optical character recognition (OCR) software, both of which have advanced greatly in recent years.
A good example of the sort of authoritative taxonomic volumes that XML parsing should address are the Monographs of Recent Brachiopoda published by Thomas Davidson in the Transactions of the Linnean Society of London between 1886 and 1888. Davidson was a world authority on brachiopods, living and fossil, and his original notebooks, with many exquisitely detailed and superbly executed illustrations, are one of the many treasures stored in the Natural History Museum, London. Despite having been published well over 100 yr ago, the species descriptions remain valid, use terminology that is readily recognized by present-day brachiopod taxonomists (Curry and Connor, 2007), and contain much invaluable additional information on habitats, ecology, distributions, as well as illustrations, some of which remain the best available of the anatomy of brachiopods (Davidson, 1886–1888). There is an important historical perspective to these monographs as well, as they provide an opportunity to chart the changing patterns of brachiopod diversity over the past 120–150 yr, something of great interest for the current research effort on climate and change. In certain respects, these monographs increase in importance the older and more rare they become, as they contain vital information on the changing distribution of organisms over time.
However the available printed volumes of the Davidson monographs, the only authoritative versions of his taxonomic work, display many signs of their antiquity. As the cover page shows (Fig. 9), the paper has become discolored, and has acquired various blemishes and marks. The printing has faded and the intensity of ink fluctuates markedly on the page. The shapes of the letters are variable and often imperfect, and the spacing between them is uneven. In addition, text from one side of the paper has often bled through to the other side, further obscuring sections of text. In effect, therefore, these monographs display most of the features that will complicate scanning and optical character recognition of monographs targeted for full digitization.
In practice, however, it was relatively straightforward to generate text documents from the Davidson monographs. The first step was to photocopy the monographs, something that would be essential when working with rare monographs of considerable antiquity in any event, but that proved to have considerable benefits for scanning and optical character recognition. In particular, during this study it became clear that using a reduced intensity setting on the photocopier, producing copies with fainter text, brought about a significant improvement in the accuracy of the automated OCR software. Slightly fainter photocopies reduced the intensity of most marks, blemishes, and bleed-through text to such an extent that it fell below the recognition threshold for the OCR software and hence was ignored.
The second feature that proved to be of considerable importance in digitizing fossil monographs was to use the dictionary to “learn” taxonomic terminology. A significant number of the names and terms used in these monographs will not be present in standard dictionaries, but can be readily be learned as the program presents words unrecognized by the dictionary. In the case of the Davidson manuscript the opportunity was taken to use this learn function to ensure that the OCR software was able to reproducibly and reliably process the more obscure text and characters. This proved to be particularly effective for genus and species names, and when processing some unfamiliar features of printed text in the 1880s. For example there are numerous occurrences of ligatures involving conjoined “a” and “e” letters in Davidson's monographs, for example in words such as lamellæ (line 2, Fig. 10). The OCR software consistently but erroneously recognized these ligatures (in this case as “lamellre”; see line 2, Fig. 11), but once corrected and learned, subsequent occurrences were accurately identified and rendered as “lamellae.” Similarly diacritic marks such as umlauts or dieresis marks are common in the Davidson text (e.g., the two dots above the o in the surname Döderlein; line 4, Fig. 10), but these have been learned as “o” for this exercise (although it would be possible to have them rendered accurately if that was desired). Some of the issues in digitizing the Davidson monographs related to variable spacings between letters (e.g., “developed” and “irregularly” in Fig. 11), but again these can easily be learned and thereafter accurately digitized using automatic character recognition.
With so many potential problems, it was in many ways surprising how accurately automated OCR rendered the text from the Davidson monographs. Figure 10 shows a particularly challenging page for digitization; apart from the usual text, it has illustrations and separate blocks of descriptive text printed at a smaller font size.
Despite these complications, automated OCR software has accurately acquired the overwhelming majority of the text, including the genus and species names, synonymy lists, and the unusual terminology used to describe morphological features. In total there are 6 errors in the entire page (Fig. 11), out of a total of more than 400 words or symbols, and most of these are easily corrected using the learn function. The resulting text document can readily be subjected to XML parsing.
It is inevitable that there will be increasing demand for more earth science information to be made available digitally, as this opens up new research possibilities, new methods of analysis, and new possibilities for collaboration. Valuable data that are currently only available in printed form, such as taxonomic descriptions of fossils, can be digitized rapidly and reliably using the methods described in this paper. It is significant, as well, that it utilizes printed text, in the author's own words, that has been subjected to extensive peer review and editorial checking. Accordingly the data are high quality, and free of operator-induced artifacts of the sort that are introduced into databases using manual entry of information.
Electing to store the data in the semistructured XML format, rather than in the more common relational database, is a clear compromise; there are advantages and disadvantages to the approach. The clear disadvantage is that the lack of predefined and agreed database schema means that guarantees on the quality of both data and query are weakened, and any results extracted though automated queries should be treated with less confidence.
The advantages of our approach, however, are numerous. In purely pragmatic terms, it requires substantially less human effort to acquire the digitized data, not just in the human cost of populating the database, which is considerable, but also in the cost of designing a suitable schema. Database folklore indicates that an all-embracing and long-lived schema is one of the most difficult design aspects of any project.
The advantages do not stop there, however; the other key point is that the original form of the data, the treatise itself, is not lost. The marked-up version allows radically different views to be presented—including, for example, queries over the whole collection—but the original appearance of the data always remains one of the most important views, as this is the form to which the professional community relates most strongly. Most important, the retention of every character of the data source means that it is impossible to lose any information from the original; this is never the case when transcribing any information source into a database schema, where judgments must continually be made about which information to keep, and which to lose. Finally, should a later decision be made to create a database for a partial high-quality data resource, the structure already imposed upon the XML resource makes it a more useful place to start from than the original text.
There are other advantages in developing protocols to handle semi-structured text. A major issue in the digitization of information from text is the incompatibility between the flexibility of human language and the inflexibility of computers. For example, language has a great variety of ways of describing features; an elongate shell could be described as oval, ovate, elongately oval, ovoid, oviform, ellipsoidal, elliptical, or egg-shaped. Similarly a specimen of considerable size could be variously, and entirely justifiably, be described as big, large, great, maximum, massive, huge, enormous, king-sized, colossal, gigantic, oversized, and substantial. The richness of language allows a huge range of descriptions, each of which conveys subtly different information depending on the context.
Searching a database for oval shells, or large shells, would require searches using all the terms mentioned above, and even then some records would probably be overlooked. One possible way of dealing with this issue would be to try to restrict the range of terms used to describe shape or size, but this approach will almost certainly fail. Previous experience had demonstrated that attempts to impose such a nomenclature scheme will inevitably not include the full range to terms that taxonomic authors will want to use to convey the often subtle variations they observe. A more pragmatic approach would be to continue to let individual scientists use the full range of their preferred descriptors, but then to carry out a retrospective survey of the range of terms used so that all subsequent searches are as comprehensive as possible. This is another area in which XML tagging would be particularly useful, as it provides an extremely rapid method of compiling a list of all descriptive terminology actually used in a study to describe size, shape or any other morphological feature, without having to enter all the data into a database.
We are very grateful to the publishers of the Treatise on Invertebrate Paleontology (Geological Society of America, the Paleontological Institute, and the University of Kansas Press) for granting permission to use a taxonomic description from volume 2 of the revised Brachiopod Treatise (part H), and for their comments on this paper. In addition, we are grateful for the cooperation of the Linnean Society of London. We also gratefully acknowledge receipt of a UK Biotechnology and Biological Sciences Research Council/Engineering and Physical Sciences Research Council (BBSRC/EPSRC) bioinformatics grant (BIO12052).