SAA 2000
As the potential (and the need) for electronic data publication became apparent during the mid-1990s, there was an immediate and predictable rush to produce digital data resources to be offered via the novel new "World Wide Web." NSF’s Virtual Libraries initiative was typical of these early efforts – massively funded projects designed to rapidly produce large volumes of digital data. While these efforts contributed greatly to the availability of large amounts of online data, usage often falls short of expectations due to the learning curve required to navigate and make use of archived data. Whereas "if you build it, they will come" seemed to dominate the philosophy behind many pioneering projects, recent trends in grant awards seem to indicate a greater emphasis on the usability of data archives. In short, there is an increasing need for the delivery of electronic information to assume the product-oriented perspectives already in use by conventional publication media. This involves targeting audiences and producing applications that speak to specific needs.
At last year’s SAA meeting one of us outlined a model of informatics in archaeology that saw developments at multiple levels contributing to an overall data infrastructure for the 21st century (McCartney 1999). In this paper, we wish to explore more closely the problem of developing applications that make archived data more accessible to the actual audiences that wish to use them.
Providing user-friendly interfaces to archaeological data requires consideration of a number of issues. One common problem is the matter of ensuring the security of site locations and/or of information that has been determined sensitive to certain special interest groups, such as Native American tribes. Administration of Internet resources where access must be controlled can complicate both the design of server systems and client applications. Technical constraints, particularly those at the client end such as outdated hardware and software, firewalls, low bandwidth, are also major factors in deciding what should go online and how it is to be accessed. An even greater challenge is ensuring the comprehension of data. Users vary greatly in their familiarity with the spatial, temporal and topical context of archaeological data - a list of phase names might be second nature to one archaeologist but foreign to another. Users also vary in terms of their technical ability to make use of data - information made available as a downloadable ArcInfo export file might also be presented as a simple interactive map to a more GIS-challenged user.
To effectively make decisions about how to deal with these issues requires consideration of target audiences and some appreciation for the likely range of variation among those groups. For most data archives the primary audience is the archaeological research community itself. While we might expect such users to be more prepared to deal with "raw," less synthesized data, scientific queries are also likely to be fine-grained and extensive in scope. This calls for search engines that can search across many datasets and provide a query syntax that is rich enough to probe deep into a dataset once it has been located. Education represents a vast audience for archaeological data with very different sorts of needs. The complexities of raw data must be reduced and delivered via interactive applications or highly structured tutorial exercises. Effective preservation of historic resources requires that archaeological information be accessible to resource management and planning. Existing tools for conveying site and survey data are often constrained by primitive reporting tools, lack of inter-regional standardization and security limitations.
In many cases, a common interface mechanism can be adapted to server multiple interfaces. A good example is the use of map interfaces and spatial query. A research application might use a map interface for query by overlay or tracing a boundary. An educational application might display data via a thematic map rather than in a table. For planning purposes, an interactive sensitivity map might summarize sensitive data according to coarse-grained units such as planning units, land ownership boundaries, or sections; thus providing useful management information without actually releasing any primary data.
The key to bridging data and user applications is metadata. Metadata are data that document a data resource, informing a user or an application about its contents. Those who recall the floppy disks containing data files that appeared in the mid-1980s as inserts in the back of archaeological reports will also recall that their value was often undermined by the limited guidance provided for reading the files. That missing information – the descriptions of column names, data types, measurement units – is part of what we now refer to as metadata. Metadata serve a variety of purposes. Minimally, metadata identify a dataset through descriptors such as title, creator, publisher, etc. These plus other descriptors like keywords, geographic location, and temporal indicators, can also assist in the discovery of a dataset. Filenames, Internet addresses (URL), and file format descriptors enable access to a data resource. Still greater depth of information such as research protocols and sampling procedures, permit more extensive actions like interpretation or actual synthesis with other datasets.
The more complex the action we wish to perform with data, the more information is required from the accompanying metadata. Many of the existing content standards that have been developed for metadata are organized into modules of increasing depth that correspond more or less to these sequential levels of detail (Michener et al. 1997). Several of these standards are relevant to archaeological data. The Dublin Core standard was designed as a broadly applicable element set for describing virtually any electronic data resource available over the Internet (http://purl.org/dc). Supported by the library community, it provides basic attributes such as title, creator, and keywords to both identify and help discover a dataset. Closely related to the Dublin Core standard is the Council for the Preservation of the Anthropological Record's (CoPAR) standard for anthropological collections. This standard adds anthropologically significant information such as culture names, time periods, etc, to provide more discipline-specific search capabilities (http://archaeology.asu.edu/copar). The Federal Geographic Data Committee (FGDC) content standard for geo-spatial metadata is the most comprehensive published standard that is in widespread use (http://www.fgdc.gov). The inclusion of sections covering spatial provenience, file distribution, table and column information, quality control, and research context provides documentation that support actual access to and use of the data. A uniquely archaeological metadata standard, specifying content for archaeological survey and site inventory data, is currently undergoing final revision by an FGDC funded committee (http://colby.uwyo.edu/fgdcncptthome.html).
Applications built around these metadata standards are generally limited to providing search functions. FGDC compatible metadata can be searched via standard Z39.50 search applications using query syntax similar to that used in library catalog searches -- thousands of environmental datasets are published via a national clearinghouse network. ASU is participating in a Wenner-Gren funded project to develop a similar network for anthropological collections based the on the CoPAR metadata standard.
To date, there are few applications that draw upon metadata to perform more advanced tasks associated with accessing archived data. In many cases when, metadata are available they are still in text format, meant to be read by humans rather than processed by computer programs. Applications that deliver data via a web or other interface are generally proprietary, with the necessary connection, table and column structure hard-coded into the program. Such applications are thus costly to develop and not very portable either to other datasets or to other target audiences.
A major bio-informatics project recently started at ASU seeks to address some of these issues by developing data access tools that rely heavily on extensive metadata that has been encoded in machine-readable format using eXtensible Markup Language (XML). By working in collaboration with several other national-level projects to establish standards for the content and exchange of metadata, we hope to build an infrastructure of flexible and generic data access components with which end-user interfaces applications can be designed. Figure 1 illustrates the basic philosophy of application design in which highly structured metadata are combined with a template set of desired application properties to produce interfaces for different target user audiences. The archaeological data project described in this paper is being carried out in close parallel with this broader, environmental project and will draw extensively from this technical infrastructure.

Archiving the Teotihuacan Mapping Project Database
In the 1960s and early 70s the Teotihuacan Mapping Project (TMP), directed by René Millon, accumulated a vast amount of diverse kinds of data on the ancient city of Teotihuacan, in highland central Mexico (Millon 1973, 1981; Millon, Drewitt, and Cowgill 1973). These data include a base map of the site itself; a map of archaeological interpretations; field notes, sketches, and photos of traces of ancient structures and other features visible on the surface; notes about vegetation, land use, and other modern aspects of sites; tabulations of categories of ceramics and other materials collected; and the curated collections themselves. Collections were made and observations recorded for each of about 5500 "collection tracts" in a roughly 30 square km area that included the city and its immediate surroundings. Over 900,000 objects were collected and tabulated. Study of these materials has continued ever since, as parts of general and special-purpose analyses.
In 1999 we received an NSF grant for the purpose of getting data collected by the TMP into forms suitable for wide distribution to users and sufficiently well-documented so that researchers not intimately acquainted with Mapping Project concepts and methods could make intelligent use of them.
Three major products are being undertaken as part of the current grant. One is a data archive, organized and stored using modern GIS and relational database software. The second consists of the documentation (metadata) needed to make appropriate use of the data. The third is a set of user-interface tools that provide a means for accessing both data and metadata.
Data
The original data recording forms and field methods used by the TMP were designed without electronic methods in mind. However, a pilot project to create electronic files began in 1965, using some of the primitive technology then available (Cowgill 1968, 1974). Since then, development of electronic files has proceeded and continues alongside ongoing analyses of the collected objects and non-electronic records. By now, several specialized files have been created for various classes of objects. The three most important are "DF8," which codes a wide range of observations recorded for each collection tract and initial tabulations of artifact counts according to fairly broad categories; "REANS," which provides updated and more detailed tabulations of ceramics categories; and "MF2," which is a digitized version of the archaeological map. In these files, each "case" or "record" corresponds to one of the original collection tracts or occasionally to two or more spatially contiguous and closely-related tracts. In the final versions of these files there are slightly fewer than 5000 records. Part of our work involves identifying data errors not previously caught and reconciling occasional discrepancies among the files. Concern for accuracy of the files is indispensable, but fairly straightforward conceptually.
Metadata
More interesting and more important is documenting the data – explaining the concepts and methods by which the data in the files were arrived at. Tersely worded code-books are useful for researchers already familiar with the project, but are far too sketchy and ambiguous for other scholars, as well as for more general audiences. To make the files genuinely available to others, far more is needed than simply putting the cleaned-up files on the Internet. One must provide sufficiently rich, detailed, and exact background information, the kind of "data about data" known as metadata. It is also important to provide users, especially those who are not very sophisticated about electronic systems, with easy and relatively misuse-resistant ways of getting around in the system – locating desired kinds of data, displaying useful kinds of data summaries, and receiving warnings about some kinds of likely mistakes.
We will build a database structure that will accommodate metadata conforming to standards developed jointly with related projects at ASU, and with additional documentation appropriate to the TMP project. Some of the metadata items pertaining to specific variables will be linked to the user interface described below. Other metadata items pertain to specific cases (records), such as unusual situations or problems encountered in the survey, publications dealing with this case, and so on. This record-level metadata will probably be organized as a table within the metadata database, set up so as to be easily linked to cases in the main TMP descriptive database.
The remainder of the metadata, such as general information about the TMP and the history and lineage of the various electronic datasets will be accessible through a series of forms included as part of the main user interface.
User interface
There is a wide range of potential user audiences that we could target in our work; we have elected to begin by building interface tools designed principally to serve the academic research needs of specialist archaeologists who work in Mesoamerica. This is the user community most familiar to us, and where we expect our applications will receive the most immediate use. We recognize, however that this is only the first step in creating a platform that will eventually serve much broader needs, including education and public outreach. Our data and metadata structures are flexible in design and can accommodate new kinds of interface without requiring serious modification. All of our tools will run on widely-available desktop computers without requiring any unusual hardware or software, and they will be designed to be relatively easy to upgrade as new machines and systems become widespread. Figure 2 illustrates a general schema for accessing the TMP data and metadata.

We have chosen to emphasize scientific visualization (http://www.mimas.ac.uk/argus/Research/WYSIWYG.html), as a significant component of our interface designs. Because of the highly spatial nature of the data, web-based mapping software will allow users to access attribute data via a "zoomable" map of the city. A smaller "orientation" map will be provided as a means of rapidly moving around the larger map. Records of interest identified on the latter map will be selectable either on a single-case basis, by using one of various kinds of interactive selection tools to identify larger groups of cases, or even by digitizing more complex selection boundaries directly on the viewing map.
In addition to identifying records of interest, users need to be able to select which variables will be included in output tables. This will be via a simple form, organized around broad categories of information such as ceramics, lithics, architecture, and so on, in which individual variables can be selected by means of check boxes. Access to basic kinds of variable-level metadata will be available directly through this form. For example, a button next to "Metepec Phase ollas" on the variable selection form will bring up a window describing basic information about that variable, including typological criteria, known problems, references to principal relevant literature, and so on.
We will provide a means for summarizing data on selected variables across records when a user selects more than one case. These will employ user-selected operators such as sums, means, medians, standard deviations, mid-spreads, and the like. It will be possible to calculate densities, as well as other kinds of indices based on ratios of different artifact types. Mechanisms will be set in place to prevent users from attempting to harvest data in ways that don’t make sense, such as trying to compute sums of ordinal scale variables or sums or means of nominal scale variables. In other cases where an operation is not strictly illogical but is dubious, a warning will be generated. Selection boxes for various summary operators can easily be built into the variable selection form. Tabular output on variables may be returned to the user in several formats, such as XML, HTML, and ASCII.
We will eventually add more depth to our interface by adding advanced features such as interactive, linked displays of graphical and map-based spatial information that will encourage users to explore the TMP data in depth, using visualization throughout the entire research process – not just as a way of displaying results obtained by other, non-graphical means. One attractive model for this is a program called cdv (cartographic data visualizer - http://www.geog.le.ac.uk/jad7/BCS/paper.html) that provides a battery of tools for exploratory data analysis -- such as box-plots, scatter plots, polygon maps, and graduated circle maps – that make it possible to discover really subtle patterns in complex geo-spatial data sets. The current version of cdv is too slow to be a practical research tool for a database as large and extensive as the TMP. Some illustrations based on cdv output are provided, however, to give a sense of what we have in mind. Figure 3 is a view of Teotihuacan, thematically mapped to allow a user to visualize simultaneous variation in two variables—in this case, measures of neighbourhood diversity for two consecutive phase of occupation). Figure 4 shows how colour differences exhibited by Figure 3 reflect variation in these two variables. Figure 4 serves as a kind of key for Figure 3, and is dynamically linked to it, so that individual cases in either display can be identified in the other.


Finally, one of the considerations that we face while designing user interfaces to the TMP data is the hope that many of its users will be Spanish speakers. Research at Teotihuacan is international in character; it is important that our efforts accommodate the interests and needs of Mexican colleagues, and eventually, the Mexican public. In addition to bilingual interface tools, metadata will be created in both Spanish and English.
We have provided an overview of how we think the design of interfaces between data archives and users is best accomplished. We have also described our current plans for implementing an interface to a significant archaeological data resource. Our challenge over the next year will be to accomplish the interface goals we have set for the TMP data by adhering as much as possible to general design guidelines, which call for extensive use of standardized metadata and open development tools. We anticipate future efforts both to build new interfaces for the TMP database and also to expand the data archives that can be accessed with the existing applications.
Cowgill, George L.
1968 Computer Analysis of Archeological Data from Teotihuacan, Mexico. In New Perspectives in Archeology, edited by S. Binford and L. Binford, pp. 143-150. Aldine, Chicago.
1974 Quantitative Studies of Urbanization at Teotihuacan. In Mesoamerican Archaeology: New Approaches, edited by N. Hammond, pp. 363-396. Duckworth, London.
McCartney, Peter H.
In Press. Long-term Management and Accessibility of Archaeological Research Data. In Proceedings of SAA Symposium on Delivering Archaeological Data over the Internet, edited by Mary Carroll and Harrison Eiteljorg II. National Center for Preservation Technology and Training, National Park Service.
Michener, W.K., J.W. Brunt, J.J. Helly, T.B. Kirchner and S.G. Stafford.
1997. Non-geospatial metadata for the ecological sciences. Ecological Applications 7(1):330-342.
Millon, René
1973 The Teotihuacan Map. Part One, Text., University of Texas Press, Austin.
1981 Teotihuacan: City, State, and Civilization. In Supplement to the Handbook of Middle American Indians, Volume One: Archaeology, edited by V.R. Bricker and J.A. Sabloff, pp. 198-243. University of Texas Press, Austin.
Millon, René, R. Bruce Drewitt, and George L. Cowgill
1973 The Teotihuacan Map. Part Two: Maps. University of Texas Press, Austin.
Return to index of papers for April, 2000, SAA Session "Digital Data: Preservation and Re-Use"