Susan C. Jones
It is no longer rare for digital data to be created by archaeological projects during both data gathering and analysis. These digital data are stored as part of project archives, and as such may (and many would argue should) be shared with the research community. Indeed, today's digital data formats facilitate the process of sharing data; they can be made standard, compact and transportable. Consequently, the web is becoming an important vehicle to distribute detailed archaeological information to an interested audience.
There are two basic approaches to the dissemination of archaeological project data over the Internet. One of these is online access to query routines through a web page, an approach that minimizes the computer expertise required of researchers. They submit queries and receives the data that meet the specified criteria, but they never have access to all the data with a single query. Only that portion of the data that meets the submitted criteria are retrieved and presented with each query. While this approach allows users quick and easy data access, it requires a great deal of thought and effort on the part of the provider to organize that access.
The other approach is to provide the digital data in a series of downloadable files. In this approach, the providers have the simpler job. They must only make the data available for downloads in a compatible digital format. Organizing access to that data falls to individual users since all users must work with the data on their own hardware and software. After downloading files, they must first reconstruct the databases(s). Only then can they query the data and analyze the responses. Of course they can also browse the data in relatively unstructured ways since they have the complete data set on their computers.
What are the advances of each system? The disadvantages? What must be presented in each case to make the data useful? What level of expertise is required to gain useful knowledge? Which is "best?" What cost and effort are associated with the methods?
For this comparison I have chosen one example of each type of presentation to compare. The choices are idiosyncratic, but each is the best of its kind that I found. For an example of the online-access-to-queries approach, I have chosen the Digital Archaeological Archive of Chesapeake Slavery (DAACS) (http://www.daacs.org/); and for the downloadable-files approach the Danebury Excavations Digital Archive (http://ads.ahds.ac.uk/catalogue/projArch/danebury_var_2003/index.cfm), a part of the ADS/ ARENA archive.
DAACS is an on-going project with a two-pronged goal: (1) to provide a multi-site archaeological database to facilitate research on slavery in the greater Chesapeake region, and (2) to provide a prototype of such databases to explore the potential for collaboration and data sharing among researchers who work in a specific area. DAACS is well-funded, having received a start-up grant from the Andrew W. Mellon Foundation, a $500,000 matching grant from the NEH, funding from the Thomas Jefferson Foundation (Monticello), and generous gifts from private donors.
DAACS started with sites from piedmont Virginia, specifically Monticello, Mount Vernon, Statford Hall, and several sites in the Williamsburg area. More sites are scheduled to be added to the database in the near future, including several more dwellings at Monticello. Currently there are detailed records for over 200,000 artifacts and 4,500 excavation contexts from 11 sites [DAACS site, http://www.daacs.org/aboutDatabase/structure.html, 9 Aug 2004]. Interpretive data is presented about each site, including an historical overview, site plans, photographic images, excavation protocol, and Harris matrices. The database includes data tables on sites (site number, geographic coordinates, excavation dates, excavators), contexts (feature types, deposit types, stratigraphic groupings), and artifacts (ceramics, beads, buckles, and miscellaneous small finds).
From the DAACS home page, access to all basic functions -- methods for searching the data, the necessary explanatory information, and interpretive material -- is available at the click of a mouse. Specialists may be able to search the database without help, but most users will need the explanatory information to construct effective searches and the interpretive material to analyze the data returned by those searches. In this approach, the search mechanism is the key; it presents that portion of the underlying data that satisfies the request. Researchers do not have access to the primary data, only to subsets of data that satisfy the query specifications. Because users do not have direct access to the raw data files, there is no need for detailed explanations of the internal database structure. The database is a "black-box." The number of individual files, their structures, and their interrelationships are of no concern to the researcher. The querying software automatically uses these relationships to return intelligible information. The researcher is presented not with a response of the type "bottles, count=5, material=gl, feature=11," but of the type "5 glass bottles found in feature 11, a post hole along the east-west fence location north of building L at Monticello."
Though the DAACS database has a complex internal structure of over 200 separate tables, the researcher need never contend with deciding which tables are relevant to any particular query. On the web site, she is lead through a series of menus that frame the actual query to the database; at each level, she must select one of the presented parameters before the next menu will appear. The initial menu begins with a choice of a query on faunal data, mean ceramic dates, artifacts, or contexts. Depending on the initial choice, further selections are presented. For example, the choices under artifact queries are: general inventory query (artifacts by site), inventory by context (artifacts in a given type of context at a site), artifact type by site, and artifact attribute by context/site. Once all query parameters have been filled in, the query is submitted and results are returned. Any number of queries may be made during a single browser session and the results from individual queries can be accumulated in a cache on the host computer. Whether the results of any given query are added to the cache is under the researcher's control, and she may also review and edit the contents of the cache at any time during the session. Both the actual Structured Query Language (SQL) queries and their results are cached. At any point in the session, the cache may be downloaded to the researcher's computer on request. It is cleared from the host computer only when the session is over.
A primary limitation of this type of data presentation is that the data can only be approached through the structure imposed by the presenters of the database. In the case of DAACS, this structure is by site. To compare artifacts in contexts across sites, 11 separate queries, one for each site, must be submitted with the same set of criteria, and the results cached. Obviously this is a cumbersome procedure but the task is not insurmountable. The procedure is error-prone simply because of the sheer number of queries that must be identically constructed. The cached query results, however, are available to be downloaded. If the researcher would want to create a database of bottles found on the DAACS database on his own computer, she could set up a sequence of queries, cache the results of the individual queries, examine the cache to be certain that she has performed all the required queries, then download the cache. The downloadable cache is in the format of tab-delimited TXT files with field names in the first line so that files in the cache can be imported into the researcher's own database software. Note that the actual SQL-formatted queries are cached in a separate file.
Documentation about the DAACS database and its contents is extensive. It includes field names and allowable entries, glossaries, definitions of color terms used, catalog protocols for each artifact category, etc. There is also documentation of relationships among the primary data tables. All of these "internal" details are irrelevant; results, not procedural details, are what the researcher gets. On the negative side, multiple records generated by a query may replicate data unnecessarily. To return to the previous example, the query for all attributes of all glass artifacts found in all feature types at building L at Monticello, produced records for 5 bottles in feature F11. In each record, feature F11 is identified as "a post hole along the east-west fence line located north of building L and belonging to feature group FG03." This replication is not necessarily bad, but it does add to processing time.
From the presenters' perspective, this type of online-accessible database is expensive to build and maintain. The entire burden of organization and maintenance of data is placed upon the presenter; the level of computer and archaeological expertise required by the personnel mounting and maintaining the database is high, while that of the user can be low. This burden includes migration issues, as well as archival ones.1 It is obvious from the list of individual donors and grants, that DAACS is well-funded and probably has promises of continuing support, both public and private, so the expense can be managed. I will return to cost issues below.
From the researchers' perspective, this type of online-accessible database has the advantage that no expertise in database design is necessary to gain meaningful access to the data. The site provides a professional appearance, easily understood menus, and wide access to its information. The level of archaeological knowledge that the researcher must bring to the data is not burdensome either. Meaningful information is accessible to individuals with a wide variety of backgrounds -- from secondary school students to academics within and outside the field. This last point needs to be emphasized. Data from DAACS is of interest to academic researchers outside the archaeological community, such as cultural anthropologists and historians. DAACS provides easily accessible, non-jargon-laden data. Researchers are able to interpret the data at their own levels of understanding; data about the collection and classification of artifacts is available on demand, but details (e.g., specific mean-ceramic-date types used to assign dates; exact Munsell color numbers included under the description "red, light, muted") may be ignored. Such information is not required in every analysis of the physical evidence of slavery in the area. (This is one of the stated goals of the DAACS Project. "Our goal is to help scholars from different disciplines use archaeological evidence to advance our historical understanding of the slave-based society that evolved in the Chesapeake during the colonial and antebellum periods." [http://daacs.org/aboutDAACS/, accessed 20 Aug 2004.])
The second approach to disseminating digital data is to provide the primary data as a series of downloadable files that allow researchers to reconstruct the original digital database(s). Downloads include data files, documentation about file formats and contents; relationships among files, and other analytical material used by the project. In this approach the users must construct access methods to the data on their own computers. The website has provided the primary data and users provide the organized access according to their own needs and skills. This is the approach of the Danebury Project archive.
Danebury is an Iron Age Hill Fort situated in the county of Hampshire in southern England. Although the site was occupied before and after the Iron Age (7th or 6th century BC through the 1st century AD) the most extensive occupation was during this period. Excavations conducted between 1968 and 1989 by Professor Barry Cunliffe and the University of Oxford revealed extensive earthwork ramparts surrounding over 500 man-made structures and approximately 2300 pits. There was an equally massive assemblage of artifacts -- some 158,000 sherds and 241,500 animal bones. The archived Danebury Project database gives access to this data. One of the first large-scale sets of computerized excavation data, it was built in the 1980's and has undergone several migrations to allow access by current software. In its current incarnation, it is part of the AHDS/Arena archive. Note that only a subset of the original data tables are available online because only the most important files were kept current through the hardware and software changes of the last 20 years. [Danebury Excavation Digital Archives, http://ads.ahds.ac.uk/catalogue/projArch/danebury_var_2003/index.cfm, accessed 20 Aug 2004.] Extensive documentation, outlining the meanings of particular fields, abbreviations and relationships between data tables, are provided in the downloadable files on the AHDS/Arena web site. Hot-links are provided from the AHDS/ARENA page to the primary excavation publications, available as online PDF files from the Council of British Archaeology. These PDF files provide excavation details, illustrations, etc.
The Danebury Project digital archive is 14 data tables containing excavation data and a fifteenth containing descriptive details about project images available as individual JPEG files. It is an easy matter to click on individual thumbnails of these images to download them. These images may then be included as part of the user-constructed database. Excavation data include files for pottery, animal bones, pits, and daub. [Daub is a tempered clay commonly used on this site to make walls, ovens, hearths, and small, solid, round objects like spindle whorls, weights, and slingshots.] The files were comma-delimited text files with the field names in the first record. Other than a few misdirected links, I had no trouble downloading the files and building a database in Microsoft Access.® (When I contacted the ADS archive personnel about the links, the response was prompt and helpful. Data files were named and available as stated in the documentation, but the hot-link addresses had been mistyped.)
With this type of presentation, access to the primary data is as complete as the files themselves. Researchers are not limited by the access paths envisioned by the data organizer/provider, but by the software that they use and their own expertise in using it. This is not necessarily a trivial limitation. Without some knowledge of browsers and common file storage conventions, I would not have been able to circumvent the missing hot links and download the necessary data files. Building an Access database was fairly easy since the downloaded files were created with it, but in one case I did need to interpret input error messages and specify input field formats to load the data.2 Constructing the database with older versions of FileMaker® required some preliminary data manipulation to incorporate the field names contained in the first record into the data tables; FileMaker version 7, however, is set up to process this type of file layout directly. These problems are minor, but could be daunting to an unsophisticated user.
The requirements for the documentation of downloadable files are extensive. Descriptions of files (field formats, numbers of records within tables, file linkages, etc.) are needed to provide verification that downloads were successful and allow the researcher to set up appropriate access paths. The Danebury Project's documentation had minor deficiencies. For example, there were no descriptions of field formats and record counts were not given, although file sizes were. However, the descriptions of field contents and relationships among the various tables were clear, so that explanations of the data codes used were easily incorporated in the on-screen presentation of query results.
The costs and efforts to maintain and archive a downloadable database are less onerous than those of an online accessible one, but they do exist. The Danebury Project is a good case in point. Since it was created in the 1980's, it has seen several incarnations. The technology has changed significantly -- the wide-spread use of personal computers and the introduction of the web to mention two -- but data are still available. The Danebury downloadable files are currently in simple, easily migrated formats.3 Data are stored for the most part in generic formats (e.g., JPEG, comma-delimited ASCII text), and the proprietary formats used are common and easily accessible (e.g. Word DOC and Adobe PDF).
The necessity to keep the data presentation accessible requires long-term commitments of cost and efforts. Danebury has turned over this responsibility to the ADS, who considers maintaining public access among its commitments to the data in its care.
Extra burdens are placed on researchers to query the data in a downloadable presentation. Researchers must make a greater effort before they can see if the project has produced any data relevant to their own work. This preliminary hurdle will tend to restrict the community of researchers to those who have the greatest probability of finding value in the data -- in the Danbury case, archaeologists concerned with Iron Age Britain. Links to PDF files containing the Danebury Project published excavations reports did provide a complete description of the project.
An advantage to a downloadable database is the ease with which data can be combined with data from other projects. This advantage is a corollary to the general flexibility of access. Researchers can manipulate the database design to accommodate the goals of individual projects. It is in combining the data that lack of documentation about the data becomes a glaring problem. Simple descriptive differences, such as the use of metric or English units in recording sizes, or registering quantities of pottery as simple sherd counts or bulk weights, can make comparisons invalid. Subtler differences, such as identification of a material by visual inspection in one database and by chemical analysis in another are harder to detect, but can equally invalidate comparisons. A more insidious example is Roman copies of what are assumed to be originally Greek paintings and sculptures. Here some scholars identify them as Roman while others identify them as Greek. Searches of a database that combines descriptive data on such artifacts from two different sources could easily be misinterpreted. Of course, such issues are present whenever a researcher combines data from multiple projects, whether or not the data is digital.
One other advantage to researchers deserves mention. Once the users have downloaded the database, they can be confident that the data will not disappear; such a disappearance is a very real possibility with any online resource.
I have examined two specific databases of archaeological projects among the hundreds that are available. They were chosen to be representative of the two major types presentations that are common today -- those with online access to queries and those with downloadable files. Of course, there could be as many variations and hybrids of these presentations as there are individual archaeological projects, but these two illustrate the strengths and weaknesses of presentation methods that are independent of the actual content of the databases themselves.
The relative strengths of an online accessible database include:
The authors of an online-accessible presentation have taken a greater responsibility for providing the computer expertise to make the system behave in specific ways. In doing so they have hugely expanded the investment in time and money, and future changes in web standards or database standards will sooner or later make further large expenditures necessary.
By comparison, the strengths of presentations with a downloadable database include:
The presenters of downloadable databases have necessarily left the responsibility for using the data to the individual researcher, but have assumed the burden of providing adequate information about the data and file formats. As a result, the efforts associated with accessing the data may limit the research community which will use the data to the authors' peers. The expenses associated with maintaining access to the data through technological change (migration) are far less than with an online accessible database, but they do still exist.
This comparison leaves open the question of which presentation method is better. There is obviously not an single answer; every archaeological project is unique, and each has its own research agenda, goals, and constraints. These will determine which type of presentation is best for the project.
The "best" presentation method is dependent upon the goals, stability of the data, and resources of the project. If one of the goals is to present data to as wide an audience as possible, then the DAACS approach of online-access to queries becomes the better option. This option, however, comes at the price of a major commitment of time and effort. The commitment is not only for current resources, but also for future resources to maintain and migrate data and its access as required. Conversely, the downloadable database approach appeals to a more limited group of researchers, but it takes fewer resources in both time and money to maintain access to the data files. It is also possible to find an archive, like ADS in the the case of the Danebury Project, that will take over responsibility to keep the data available over the course of decades.
Careful consideration of an individual project's goals and budget must determine the best approach to provide its data to the research community.
-- Susan C. Jones
To send comments or questions to the author, please see our email contacts page.
1. In the case of any web-based project, there will come a time when software on which the database resides will become obsolete. At that point, the effort to replace the elaborate software interface between the users and the database will be enormous. This migration effort will involve more than changing data formats, more than the simple version changes that every computer user has faced. It will potentially involve an effort equivalent to the original one of placing the DAACS on line. Return to text.
2. The specific problem involved an Access quirk that treats text fields and memo fields differently. A field whose format is described as text has a maximum allowable number of characters; a memo field does not. Field formats were not specified in the Danbury documentation. Very few of the records had text-field entries that exceeded the maximum length; but those that did produced input errors when the files were imported into Access. Simply specifying the problem field as a memo field overcame the problem. Return to text.
3. This may not always have been the case. I did not look into the cryptic comment that only tables associated with four types of artifacts survived changes in location and mainframe.Return to text.
For other Newsletter articles concerning the issues surrounding the use and design of databases, or articles on electronic publishing, or uses of electronic media in the humanities, consult the Subject index.
Next Article: Image Collections on the Web -- Exciting but Still in Their Infancy
Table of Contents for the Fall, 2004 issue of the CSA Newsletter (Vol. XVII, no. 2)
Table of Contents for all CSA Newsletter issues on the Web
|CSA Home Page|