Vol. XVII, No. 3
      
 CSA Newsletter Logo
      
Winter, 2005
      

Past, present and future: XML, archaeology and digital preservation

William Kilbride


The ADS was established in 1996, so as we start 2005 we begin to reflect on what has changed over the ten years we have existed. Many of the changes -- not least in terms of online delivery of data to students and researchers -- have been high profile and are easily recognisable. Perhaps the most important areas of development have also been the least visible. The last decade has seen a radical transformation in digital preservation to the point that we can now move from theory and planning into useful and significant action. It also leads us to consider the shape of the next decade. XML - the extensible markup language - provides us with a window on what will be needed in the next decade. This is not simply a question of technology, but is also a challenge on how to get the right technology to the people who need it most.

Readers of the CSA Newsletter are on familiar territory with the problems of digital preservation, either through celebrated cases of what can happen when preservation is not planned, or through their own experiences. Cases like the Newham Archive (Austin et al, 2001) and the Domesday Disk (Darlington et al, 2004) are now familiar among those involved in digital preservation. Experience at the ADS shows that preservation works most easily when creators of digital resources anticipate a future audience and when inheritors of digital resources actively curate them. Planning for re-use is a simple slogan, but in the digital age that planning has to start at the point of data creation, not at the end of a project as happens with conventional archives. We are familiar with the reasons why we should preserve, and over the last ten years the reasons have grown if anything more compelling. Many grant agencies now either presume digital preservation in projects which they fund, or like UNESCO, have adopted digital preservation as a flagship issue. These external pressures merely confirm the discourses of the research community in which data is handed on because that is what good researchers have always done -- either to facilitate further research, or to ensure information-based management.

Less familiar are the subtle and practical developments in digital preservation. It is fashionable to describe any website or file server as a 'digital archive' but the reality is that a decade of research means we now know what a digital archive should look like: and it is not simply a file store or webserver. The Open Archival Information System (OAIS) reference model now has an ISO standard that describes the processes that need to be carried out by a digital archive, a process refined by the Research Library Group, which has outlined the characteristics of a 'trusted digital repository.' The Digital Preservation Coalition has scored some notable successes in raising the problem in political circles, while more recently the Digital Curation Centre intends practical and effective advice.

Developments at the ADS have mirrored these developments. The outward face of the ADS has been concerned with presenting standards for data creators through tools like OASIS, and practical advice like the Guides to Good Practice and a vast number of workshops. At the same time our own working practices have been the subject of continued refinement and documentation. The resulting audit trail provides a basis for quality assurance. Concomitant work continues to refine other issues that may not have been clear at the outset. A robust rights management framework has been developed, while we can begin to model more accurately the long-term costs of preservation -- which are insignificant compared to the costs of not preserving. Simultaneously we begin to understand not just the users of digital archives, but how different sorts of resources elicit different types of user behaviour. In real terms, digital preservation at the ADS has gone from theory into practice.

The last few years have seen a distinct refinement in the methods and practices of digital preservation. The next decade is likely to see the extension and implementation of these practices. If the last decade has been about research and development in preservation services, the next decade ought to be about implementation. For this to happen, it ought to be easier for archaeologists to create data in formats that are already fit for preservation. XML provides an example of such a format and how, in practical terms, the archaeological community can fit a common format to its purposes.

Documents like this one on the Web use the Hypertext Markup Language (HTML) to permit browsers to display the document. Commands -- in simple text form -- are included in the text to call for Italics, boldface, a new paragraph and so on. Browsers recognize the commands and display the document accordingly. Other commands -- also in simple text form -- direct the browser to bring up images (stored in separate files) or to call up another document. In general, the HTML commands refer to matters of appearance or linkage (images or other documents).

Published in 1999 the Extensible Markup language promises a paradigmatic shift in Internet-based resources. Heralded as the back-bone of the semantic web, XML extends the hypertext markup language (HTML) by letting users define the meanings of their own mark-up tags, which may be used to define content in addition to appearance. Most critical, XML allows users to define the content of the information in ways that can be processed directly by software agents. In simple HTML you might mark an important piece of text -- like an address or telephone number -- with bold lettering or a different font:
<em>0044 1904 433954</em> yields 0044 1904 433954 and
<h3>ADS, The King's Manor, York YO1 7EP, England, UK</h3> yields

ADS, The King's Manor, York YO1 7EP, England, UK.

In XML you could specify that the following is an address or telephone number:
<phone>0044 1904 433954</phone> or
<postal_address>The King's Manor, York YO1 7EP, England, UK</postal_address>.

Just as a simple HTML browser will transform the font following instructions, an XML transformation agent would be able to identify addresses and telephone numbers, and render them according to specific instructions -- or simply to gather a list of data items according to specific criteria. All sorts of tags can be defined for all sorts of purposes. This versatility implies unimagined opportunities for sharing information, but perhaps more importantly XML also promises a significant leap forward for digital preservation. Though it rapidly becomes convoluted, XML is based on human readable code that can be read and interpreted without the need of computers. Moreover the assumption that XML can be converted to numerous alternative formats on demand makes migration trivial, while the need for a priori explicit definitions means that XML should be largely self-documenting. Because it is based on a simple ASCII or UTF-8 text, the underlying data can be rendered in diverse and ubiquitous software. XML is good news for those with a vested interested in preservation.

Even so, there remains relatively little XML-related activity among archaeologists. The dearth of XML is all the more surprising in comparison with other academic groups. Fields as diverse as music, maths and chemistry have developed their own, discipline-specific flavours of XML while the Text Encoding Initiative has presented a model for literary and linguistic applications. Is there something wrong with XML which means archaeologists can't use it, or have we simply missed the show?

Before getting carried away, it would be wise to remember that XML has its critics (e.g. Pascal n.d.). From the fractious world of database management, there are those who would remind us of the 'bad old days' of hierarchical databases. The early and mid eighties saw considerable debate on the merits of relational data modelling as against other forms of modelling: the debate was more of a rout, and we have grown used to robust, elegant -- and extensible -- relational databases. The same cannot (yet) be said about XML tools like Xpath and Xquery, and it could be argued that XML is a bloated format for our overweight age. If atomicity is a virtue and redundancy a vice, then to some relational data modellers, XML seems like a backward step.

These debates reveal the mistrust and insecurity of some within the computing industry -- but do not account for the experience of archaeology. Of more immediate concern is the risk of 'schema' wars. XML works well when an information community agrees how their information should be rendered and reproduced. The result of the labour to generate standards is to unite into common schema all the dialects prevalent in a given information community. The larger and more diverse the information community, the more difficult the task of creating common standards. For archaeology, active fieldworkers must agree with find specialists, academics, museum curators and heritage agencies on how to describe their information. That is a hard enough task, but it is further complicated by geography. Unlike music, maths or literature, where you do archaeology changes how you do it. The risk is that we end up with multiple incompatible schemas for different information communities. This may yield staggering over-investment in a community that really should invest its scarce resources carefully (Eiteljorg, this volume). In simple terms, how do we prevent XML in archaeology from supporting ever more intricate Towers of Babel?

Help is on the way. It would be wrong to suggest that no archaeologists have embraced XML technologies, and a number of projects over the last few years have begun to show what is possible. The long running OASIS project has developed a mechanism for local and national agencies in the UK to improve and accelerate the flow of information about recent archaeological fieldwork (Kilbride and Hardman, 2004). Inter alia, it has allowed the development of an XML schema for the processing of event-based fieldwork records. The significant achievement of this project is not so much the schema as the extensive and continuing consultation to ensure that it is fit for its purpose, widely supported and actively developed. A wider context for OASIS in the UK is provided by FISH -- the Forum for Information Standards in Heritage. This membership body has commissioned a set of tools to improve information flows across the heritage sector. The resulting 'FISH Toolkit' includes a MIDAS XML standard for data exchange, a web-services based protocol by which that exchange can occur and a data validator to identify compliance against a published standards framework. Again, the impressive aspect of this is not so much the technology but the collaborative way these tools have developed. Other initiatives in the UK such as Spectrum XML, HEIRPORT and ARENA show the promise of XML for data harvesting and exchange.

OASIS and FISH are principally concerned with index-level data. Other projects have investigated the viability of XML for the full text markup of entire archaeological research projects. The XSTAR toolkit, developed at the University of Chicago is an XML schema, ArchML, to represent the details of fieldwork and thus to provide a mechanism for resolving long-standing problems of publication and data sharing. It turns the disparate parts of the archaeological research process into a comprehensive research environment and thus enhances the work of all concerned. A number of research projects have also looked beyond index data and shown that XML-based tools like TEI or Xquery and Xpath can be applied to complex fieldwork archives and reports (e.g. Falkingham, forthcoming and Meckseper, 2001). Perhaps the most impressive and certainly the most extensive and thorough implementation of a markup schema in archaeology is the Museumsprosjektet, which since 1998 has been systematically transcribing the archaeological archives of the major university museums in Norway.

Of these many initiatives, only the Museumsprosjektet and XSTAR seem to exploit the potential of XML for long-term preservation. The Museumsprosjektet is perhaps the longest running digitisation project of its kind in the world and must also be among the most extensive. It is also underpinned by the exceptional foresight of the developers involved -- not least because it began to explore markup before XML was fully developed or released. Its success is based on the generous support of the Norwegian government, and if it has one drawback, it is that it is restricted to Norway. XSTAR is a relative newcomer in comparison. Based at the University of Chicago, it has been used for a number of archaeological and philological research projects around the world. While the projects it supports are of seminal importance and the advising consultants represent many far-sighted individuals and agencies, it has not yet been adopted outside the University of Chicago. It may be some time before the tools are able to demonstrate their strengths for widespread applicability.

It is one thing for us to review the use of XML for preservation, but as an archive ADS has made a determined effort and has considerable authority to recommend data formats for archaeologists. Consequently, any implied criticism in the preceding paragraphs comes home to roost. The fact that there is so little XML is partly because we have been unable to make simple recommendations; so it is worth reviewing the problems associated with making these recommendations. Putting it simply, there is little merit in recommending formats that few are able to use and fewer are able to produce. As a preservation facility with responsibility to a wide community, we may ask for XML-based formats and may even occasionally acquire them: but only once XML tools are ubiquitous and accepted will we be able to exploit its most important feature. This is as much of a challenge to XML as schema and style-sheets.

This has an implication for those involved in XML development. In simple terms, if XML is to be anything other than a passing fad, then it is important that XML-based tools move out of development and into production. We have a suite of powerful and sophisticated tools that have much to recommend them but are invisible outside the circles of IT development. Putting these tools into the hands of active researchers and providing the training to use them is the next and most critical part of the development cycle. In the short term -- and even more in the long term -- consolidation will be more useful than elaboration.

This organisational challenge is quite different from the technical and intellectual ones already resolved, but it has a further implication. It suggests that we need to be explicit about our standards sooner rather than later. Working on the assumption that different types of archaeology will gravitate towards subtly different flavours of XML, then perhaps the most important technical part of this work will be an open and systematic declaration of the semantics of various types of XML and appropriate mappings between them, including explicit indications of terms in a given schema that cannot be mapped directly to terms in another.

Digital preservation has come a long way in the last ten years. The next ten years will see consolidation and extension, but before that transition can be complete we need to provide tools that will allow all archaeologists to create archives that are worthy of the name. XML holds great promise for preservation; perhaps its most important benefit will only become apparent several years from now, when robust and self-documenting files are drawn out of our archives.

-- William Kilbride

To send comments or questions to the author, please see our email contacts page.


Bibliography

Austin, T.; Robinson, D.J. & Westcott K.A. 2001 "A digital future for our Excavated Past" in Z. Stancic and T. Veljanovki (eds.) Computing Archaeology for Understanding the Past: CAA 2000, BAR International Series 931, ArcheoPress, Oxford, pp. 289-296.

Dartington, J.; Finney, A. & Pearce, A. 2004 "Domesday Redux: The rescue of the BBC Domesday Project Videodiscs" in Ariadne Issue 36 (online at: http://www.ariadne.ac.uk/issue36/tna/intro.html, last downloaded 19/02/05)

Eiteljorg, H. 2005 "Archiving Archaeological Data - Is There a Viable Business Model for a U.S. Repository?" in CSA Newsletter XVII.3 (online at: http://csanet.org/newsletter/winter05/nlw0501.html)

Falkingham, G. (forthcoming) "A Whiter Shade of Grey: A new approach to archaeological grey literature using the XML version of the TEI Guidelines" in Internet Archaeology 17

Kilbride, W.G. and Hardman, C.S. 2004 "It's the Small Things that Count: Digital Preservation and Small Scale Research Projects in the UK" in CSA Newsletter XVII, (online at: http://www.csanet.org/newsletter/spring04/nls0402.html, last downloaded 07/02/05)

Meckseper, C. 2001 XML and the publication of archaeological field reports, Unpublished MSc Dissertation, University of Sheffield

Pascal, F. (no date) "No Database Champion" in Dbazine.com (http://www.dbazine.com/pascal11.html, last downloaded 07/02/05)


For other Newsletter articles concerning the issues surrounding digital archiving or the uses of electronic media in the humanities, consult the Subject index.

Next Article: A Wider Vision of an Archaeological Data Archive

Table of Contents for the Winter, 2005 issue of the CSA Newsletter (Vol. XVII, no. 3)

Master Index Table of Contents for all CSA Newsletter issues on the Web

CSA Home Page