Digital Data: Preservation and Re-Use

SAA 2000

Integrated Access To Historic Environment Information Resources

Dr. Julian D. Richards
Archaeology Data Service
University of York
United Kingdom

About this document


This paper describes a vision for future access to information resources which cover the historic environment, including above-ground and below-ground archaeology. It is based on the situation in the United Kingdom and on experience gained by the Archaeology Data Service, but the vision is consistent with approaches being developed within an international setting, across a wide range of disciplines. It is hoped that the vision and the experience will be of interest to a global audience. Indeed, the effectiveness of future access to cross-cultural resources will depend upon a shared approach.

The vision starts from the basis that there is a complex and ever-changing information landscape. There is a large number of Historic Environment Information Resources (HEIRs) with little co-ordination. Some are local; some are regional; some are national. Some are defined by the type of information held; some are defined in terms of a theme; others are defined in terms of target audience or user group. There is considerable overlap and duplication between HEIRs; there is also no "one-stop-shop" for the user.

Creating a single integrated system is not an option. This was proposed for a time in the 1960s and 70s as archaeologists began to harness information technology, but it was rapidly realised that it could not be accomplished. It is neither politically acceptable, theoretically desirable, nor practically possible. The landscape is constantly changing, but is also subject to external pressures outside the control of archaeologists, even at an international level.

The alternative strategic vision rests upon the concept of interoperability, in which there will be large numbers of web gateways and portals, each providing indexed searches and specialist interfaces to a much larger number of distributed resources, across which users will be able to search simultaneously. This approach is in line with that currently being adopted for the UK';s Distributed National Electronic Resource (DNER), a project which embraces all information resources used by the Higher Education community for teaching, learning and research.

Under this vision, the information map comprises three types of entity or node: clusters of users, HEIRs of all shapes and sizes, and web-based gateways and portals which provide simultaneous access to information resources both held locally and distributed across the Internet, at a number of "targets". Such gateways will develop according to the political remit of the major stakeholders and in response to user-led demand. Users will be able to perform queries and search across distributed and disparate resources simultaneously. The map is non-hierarchical in that individuals and organisations can create their own nodes in the landscape. Resources are searchable across multiple gateways and can be maintained at the point of creation rather than handed over to another body. However, there is no need for data to be held more than once and the interoperable map will work to reduce duplication in the system. The system will also evolve to fill in the gaps in the information landscape in response to user demand.

For this full vision to be achieved requires agreement in three areas: communications protocols, metadata standards, and vocabulary control. However, it is not an all-or-nothing solution. Organisations and individuals can participate at different levels, as gateways or portals, or as targets, or may seek to have their data or just their metadata hosted on another target. Furthermore, the resources themselves can exist at various stages of computerisation. A manual card index can be represented as a single metadata record in an online database, held on a server and open for remote querying. The metadata record would be sufficiently detailed in order to allow the user to discover the resource, to identify its location, and to learn of any restrictions governing its use.

Information creation, storage, dissemination and re-use

Information management comprises a set of four inter-linked stages: creation, storage, dissemination and re-use. Most primary data are created locally, whilst the other stages may be at a regional, national or international scale. The stages are rarely combined in one organisation, and individual users may pose questions at local, regional, national or international levels. Information about the historic environment is dispersed around the world, inside archives, libraries, and museums. It is a daunting task to embark upon a search of historic information resources because the user may have to travel large distances just to discover what sorts of information already exist. The inaccessibility of information is seen as one of the principal barriers to effective, creative, and accurate syntheses across regions, countries and time periods.

Different users have different needs. From the viewpoint of information management there are two broad classes:

Resource Discovery

Computer networks are important organisational facilitators of effective information flows. The single most effective means by which information today may be published, accessed, and linked together on a global scale is the World Wide Web (WWW). By itself, however, the WWW does not necessarily enable effective access to information which is held in diverse and dispersed information resources. The WWW is currently evolving towards a framework of machine-driven information retrieval, wherein significant bodies of extant knowledge are offered up from existing databases and simultaneously searched by users many miles from - and potentially lacking any knowledge of - the originating system. Information on the Web varies in form, from static pages of text and imagery right through to web-accessible database engines, where the 'pages' are assembled from one or more databases in a real-time response to user queries or actions. In enabling valid resources to be located quickly and efficiently, we need to find means of representing information such that automated systems may deal with it directly. The problem is compounded because information seekers may wish to include many disparate sources of information in their searches. The situation is summarised concisely by the US Geological Survey:

Better search mechanisms are needed because of the size and diversity of information that people would like to take into account. The Internet has huge amounts of content itself, and often acts as a pointer mechanism to off-line media, but lacks basic agreements on how to tag information objects so that they can be found. Content Owners want their products to be found by all potentially interested seekers. . . The only recourse is to somehow acquire advertising space from all of the intermediaries (US Geological Survey, 1998).
Web-based search mechanisms may be categorised according to three broad headings: automated search engines using crawlers and robots, gateways providing access to classified network resources, and indexed catalogues of resources.

In practice automated search engines rarely provide an effective means of resource discovery. Designers of web crawlers and other software agents are frustrated because "the software agent can only deal with bits and pieces of Internet content that happen to be in text form and is constrained by a lack of distributed search mechanisms" (US Geological Survey, 1998). Where users have very specific requirements and can articulate them in a way which the search engine can interpret, the results may be adequate (e.g. "Find me the web site for the British Museum"). However, users with less specific demands (e.g. &"I want to know more about Stonehenge") are invariably swamped by large numbers of 'hits' of variable but unknown quality. Search engines are also unable to differentiate between categories of users and cannot provide a level of assistance appropriate to the level of experience or prior knowledge of the user.

At the most basic level a web gateway may simply comprise a hot-linked list of web sites organised according to the source or type of information held (e.g. "Museums" or "Stonehenge web sites"). In this it may present little more than the list of sites that might result from an automated search (and some gateways are constructed semi-automatically) except that links may be subject to some form of quality control to weed out mis-hits or material judged to be of inferior quality. Sites may vary according to geographical coverage, the extent to which resources have been validated and described, and completeness. Gateways can be developed with specific target audiences in mind, such as 'general public', 'schools', or 'professional/ HE researcher'. They can also introduce a level of quality control and 'information mitigation/ interpretation' appropriate to the target audience.

In addition to the above, indexed catalogues normally provide a means of carrying our controlled searches of specified fields in an organised database. Such catalogues may permit searching of a single resource (e.g. Canmore-WEB, which allows users to conduct structured-queries of the National Monuments Record for Scotland;, or integrated searching of distributed resources (e.g. the Arts & Humanities Data Service (AHDS) gateway (, where users can search simultaneously across digital resources in archaeology, history, literature, performing and visual arts). It is worth noting that the distinction between gateways and indexed catalogues is disappearing as gateways increasingly use indexing tools and metadata to index resources (e.g. ARGE, the European archaeological gateway; and indexed catalogues provide a brokering function including metadata catalogue records for external web sites (e.g. HUMBUL the UK Humanities gateway;

 Examples  Characteristics 
 Automated search engines   Altavista; HotBot; Lycus   Automated searching
 Freetext retrieval
 No quality control
 Mmany non-current links 
 Web Gateways   CBA gateway; ARGE; ARCHNET   Provide links to external resources
 Sites grouped according to single category
 May include quality control
 No guarantee of longevity of resources 
 Indexed catalogues   ADS ArchSearch;Aquarelle   Resources may be external or held locally
 Multiple index fields
 Quality control
 Preservation of resource


A large number of organisations hold historic environment information resources. Most are held in some computer-based form and several are already available via the Internet whilst in other cases the intention is to make them available online in the near future. Some are available as indexed catalogues. Several are used by multiple users groups, and in some cases the actual or potential users extend beyond the historic environment. Existing systems are held in a wide variety of hardware and software applications, utilising a variety of data models. It is neither realistic nor necessarily desirable to impose a single system.

It is possible to conceive of a model in which each historic environment information provider develops its own self-contained and independent online resource. Indeed, this model is basically a continuation of the current situation. Under such a model, any number of 'service providers' might set themselves up in order to deliver information. The sites would be indexed by search engines and could be linked by web gateways. The advantage of such an approach is simplicity and the lack of need for any strategic vision or control. The limitations are, however, numerous:

Technical solutions, therefore, must proceed on the basis that there will continue to be multiple providers of information about the historic environment, each operating their own systems, but that it is desirable that users are enabled to search multiple resources with single queries. The technical solutions therefore require interoperability.

By adopting an interoperating model in which any number of service providers are linked together in such a manner that data creators need only provide their data once for it to become available, these fundamental problems are largely ameliorated. Under this new model, potential users can cast their search at an appropriate geographical scope, e.g. at county, national, or British Isles level. Where a number of service providers are able to provide access to the same data users may also choose to visit individual service providers for 'value-added' services that individual providers bring to the data. One service provider might, for example, become well known for their map-based search engine, whilst another might be favoured for its intuitive interface. Users may also choose, or be directed to, interfaces appropriate to their level of prior knowledge. Some sites may be targeted at the public or schools market, for example; others might choose to focus on the academic and professional market. Interoperability is therefore seen as the key which allows diverse and dispersed heritage information resources to be made available for effective Internet access.

Interoperability exists at a basic level whenever two or more information systems can be queried simultaneously. However, unless the two systems share identical record structures and data descriptions such queries are unlikely to yield meaningful results. Full interoperability rests upon three things. Firstly there is a need to establish communications protocols which will allow users to query distributed computer systems. Secondly, these distributed systems must employ metadata which can be quickly and systematically searched by computers and presented in an understandable form to users. Metadata can be used to summarize the content of archives, libraries, museums, and even publications, so users can scan their holdings relatively easily and either download the information directly, or at least more carefully plan the itinerary required to gather the relevant pieces of information. Thirdly, the distributed systems need to apply agreed terminology controls through data documentation and content standards in order that users can retrieve comparable returns from diffuse data resources. Such contents standards include the wide range of thesauri which are employed in describing the historic environment.

Communications protocols

For an historic environment information infrastructure to succeed, whether in and of itself or in the wider context of which the historic environment is but a part, a common protocol for interoperability is required. Such a protocol allows a single user-specified query issued at a client (or commonly 'gateway') to be simultaneously passed to any number of distributed servers, or 'targets', where it is mapped into the local data structure. Such searches can be 'platform independent', in that they can work across different hardware and software applications, so long as there is an agreed 'profile', or mapping of the query to the individual data structures. The combined search results are then returned to the user and can be presented as the outcome of a single local search. Fortunately, the broader domain of information science has long recognised this requirement, and a number of avenues have been pursued in the search for a truly useful solution. The current front-runner for cross-domain interdisciplinary interoperability is the Z39.50 communications protocol (Miller 1999;


Having identified a communications protocol it is technically possible for users to search across multiple service providers. That search, however, will produce results of limited utility unless service providers have previously identified a number of key indexing fields which will cover the range of queries that users will require. The data required in order for a user to find information is here described as resource discovery metadata. Metadata is 'data about data', or the information needed to communicate sensibly about information. Metadata has three main purposes. Firstly, it allows the nature of a body of information to be assessed without having to access the data themselves. Secondly, it allows a user to locate a piece of information. Thirdly, it allows similar bodies of information to be grouped or linked together (Wise and Miller 1997).

Thus, the information about information communicated through metadata is generally:

One of the best examples of metadata is the MAchine Readable Cataloging (MARC) metadata scheme (Network Development and MARC Standards Office 1997) which has evolved into one of the most comprehensive and widely adopted of metadata schemes worldwide. By using similar cataloguing terms from a scheme like MARC across library collections, it becomes relatively easy to make computer systems search more than one catalogue in response to a user's query. The very cataloguing terms used in the search are then a form of metadata, allowing for basic description of the data, its location, and the existence of similar information to be discovered. A catalogue entry may vary in complexity from the equivalent of a library record identifying title, author, publisher, date of publication and shelving details, to a keyword-indexed abstract that enables a thorough search and assessment of the results of this search.

Metadata can also document everything the user needs to know to decide if the resource is usable. For example, charges levied, or copyright restrictions can be two pieces of metadata that describe a resource, and which potentially affect whether or not the user really wants to spend time physically retrieving it. The language of a resource is also an issue suitable for metadata catalogue a document may well be just what you need, but if it's in Gaelic and the user only reads English and French, it will be of little value. Metadata documentation for satellite images, for example, might include the date of collection, the type of sensors used, spatial coverage, amount of cloud cover, resolution, costs, and copyright information - in short, everything that a data user might wish/need to know in order to use the information contained within the data set.

Behind all this apparently seamless information, description and discovery lies a complex suite of technical problems which include speed, accuracy, precision, and completeness of the results. The answer to many of these technical problems is standardised metadata entries. If all pieces of information have an author or group of authors (in the sense that texts are written, photographs are taken, maps are digitised, and databases are constructed usually by an identifiable person or people), and this metadata is presented in standardised, machine-searchable ways, everyone can find the information they want better and faster.

To describe historic environment information adequately (with the goal of making it faster and easier for other people to discover it) we have to understand what people will want to know about it. What people will want to know determines what types of metadata will be important, and what people will want to know about it will change as they become more familiar with what is available. For example, future fieldwork may be intended to examine the prehistoric ritual mounds of northern England. The investigator would want a metadata index that quickly identified information about sites in other parts of the British Isles, Northwest Europe, the Americas, or anywhere which included prehistoric ritual mounds. After locating a region with information pertaining to the research interests/ requirements the user would want more detailed information. Perhaps the user would like to know what types of artifacts to expect from mounds in Cumbria, or what satellite imagery was available for Yorkshire. The metadata about this increased level of detail should still be general enough that the user does not have to search each piece of data individually to find out if it's appropriate. In other words, once they know there's a library section on 'Archaeology', a reader shouldn't have to walk to each individual book, take it off the shelf, and glance through the table of contents to see if it's relevant. The site or region's location is certainly not the only important starting point for most archaeologists. Also critical is the temporal affiliation of the site. Is it occupied in the Mesolithic, Bronze Age, Iron Age, or from 3000 BP? Metadata entries should cover all the basic information a user needs to decide if a resource is relevant.

At this most basic level (called, confusingly, 'high-level' metadata) the following types of information are recorded about each piece of information:

This basic level of metadata is being defined by broad international groups working on what they term ';core description'. There are a large number of metadata initiatives of value to archaeologists, whether developed for a wider community or for archaeology specifically. These initiatives range from extremely detailed and specific metadata systems such as the Federal Geographic Data Committee's (FGDC) Content Guidelines for Digital Geospatial Metadata ( to the much simpler and more generalised Dublin Core (

Terminology control

Communications protocols allow researchers to look for resources across distributed databases; metadata provides a standardised means of identifying the core attributes which researchers might use to carry out their search. However, the results of such a search will only be as good as the metadata that has been employed. Effective metadata demands the use of thesauri and terminology control. For the user to locate resources of interest they will need to know which search terms to employ, and how those terms have been applied across the collections they are searching.

The Dublin core permits the cataloguer to use any number of standard "schemes", or thesauri, to describe each of the Dublin core fields. However the effectiveness of searches which cross over records employing different schemes will then depend on the extent to which schemes can then be effectively mapped to each other. A number of thesauri have been developed for use in British Archaeology, although their scope tends to reflect the political boundaries of the originating body and there are difficulties in locating thesauri which have been used throughout the British Isles. There is also a need to define the metadata that needs to be recorded in order to allow re-use of digital data. The AHDS series of Guides to Good Practice is addressing this need. The ADS has already published Guides in GIS, Remote Sensing and Aerial Photography, and Excavation and Fieldwork archiving; further guides on geophysics, CAD, and VRML, are in the pipeline. A detailed discussion of thesauri is beyond the scope of this paper (

Within the UK the stated objective of FISHEN is the building of national standards and terminology, and making them available for inclusion in information systems. The work of FISHEN includes the INSCRIPTION wordlist standard (including thesauri), the MIDAS data content standard, Terminology and Concept mapping software and Thesaurus management software developed in-house for the English Heritage Data Standards Unit.

Recommendation 4 in Strategies for Digital Data (Condron et al 1999) noted that "Data creation standards are essential to facilitate the exchange of information"; Recommendation 5 continues: "National bodies should continue to encourage the use of standards for projects they fund. Similar guidelines need to be built into project briefs for developer-funded work". There is a need to persuade grant-aiders that (a) where appropriate an information system is an integral necessity and (b) that grant-aided projects must conform to certain reasonable minimum requirements of functionality and interoperability.

Dissemination options

Given agreed communications protocols, metadata indexing, and terminology control, then on-line indexed catalogues can provide a number of technical options for access to HEIRS. These are capable of responding to the range of local, regional, national, and international user queries identified above. They can support map-based and period based queries, and can respond to the needs of both public and scholarly access. The options concern the level of online availability of the metadata and underlying data, and the level of interoperability between distributed systems.

As far as availability is concerned an appropriate model might involve three levels of remote access, to:

For interoperability there are three levels of distribution of data and metadata: served, brokered, and linked. The ADS catalogue, ArchSearch provides examples which illustrate the range of possibilities:

 i.e. catalogue index records 
 i.e. the collection 
 e.g. Excavation archive 
 - may be web-linked e.g. NMRS Canmore 
 e.g. AHDS service providers 

Where a resource is served, then both the catalogue 'metadata' records and the resource - the 'data' - are held on a central server. The ADS catalogue holds metadata records which describe the digital excavation archives which are available for down-loading by registered users.

Where a resource is brokered then the metadata is held on the local server, but this indicates the existence of a distributed resource at a remote site. The resource itself may in digital form but could equally comprise a paper or museum archive. This option may therefore be appropriate where only basic metadata is in digital form. However, where the resource is itself digital then it is possible to provide a live link from metadata to data (e.g. the link to related resource in Canmore-Web from the NMRS metadata records in the ADS catalogue).

A resource is linked where both the metadata and the resource are held at a remote 'target' site but can be queried at the server gateway. Plans to link the online Portable Antiquities Database, SCRAN, the NMRS and the ADS Catalogue via a Historic Environment Z39.50 gateway provide an example of this type of option.

A served approach may be appropriate where the owner of the resource does not wish to provide access to the resource themselves, perhaps for political, technical or financial reasons. It is particularly appropriate where the creator/ owner no longer has a direct interest in a resource which is static and unchanging, such as a completed research project. It may also be advisable where the creator /owner wishes the resource to be made publicly available but is unable or unwilling to support the continued use of the resource (e.g. through dealing with user queries) or is unable to take responsibility for its long term preservation. Copyright can be an issue but may be retained by the owner who assigns the service provider a licence to distribute the data. The owner is unlikely to use this option where the resource is seen as having the potential to generate a future income stream. For the service provider a served option is technically straightforward as both metadata and data are held on the local server, although they will need to be satisfied that the resource does not raise issues of legal liability for them.

A brokered option may be appropriate where the information owner does not wish to hand over responsibility for a resource, but where there may be advantages to them in providing an access route via a remote catalogue. This approach will be favoured, therefore, where the owner has a continued intellectual or financial interest in the resource and wishes to maximise its usage, but for technical or financial reasons does not wish to become a 'target' site. Thus an organisation with a secure web server and the ability to provide online 24-hour access might facilitate direct access to the resource on its own server, but by the provision of metadata records to a remote system can make the resource available for simultaneous searching with other resources. Thus Canmore-Web provides direct access to the National Monuments Record for Scotland (NMRS) whilst the Royal Commission for the Ancient and Historical Monuments of Scotland (RCAHMS) also provides the ADS with metadata records which permit simultaneous searching of the NMRS alongside other resources such as the English Heritage National Monument Record Excavation Index, and the Radiocarbon Index of Britain and Ireland supplied by the Council for British Archaeology. Direct hot links use the unique NMRS record numbers to allow users to Òdrill-downÓ from the metadata to the data held in Edinburgh.

The data can therefore be dynamic and detailed records can be updated and expanded so long as the linking field is not invalidated. However, the option is not really appropriate for dynamic systems where the metadata is itself subject to change, such as where data are reclassified or new records are added. In these cases the metadata must be periodically re-supplied. Experience with the ADS catalogue to date has indicated that the loading and reloading of metadata records is a time-consuming process and so this option is not recommended as a long term solution for dynamic resources. The provision of metadata records can also be expensive. Experience gained by the partners in the Accessing Scotland's Past project has demonstrated that where suitable technical expertise is available and where there is a straightforward mapping between the data structure of the owner and the metadata catalogue fields of the service provider the process can be automated and large numbers of metadata records can be generated at minimal cost. However, where data structures are incompatible or where the quality of source data is uneven, then the process of amending and cleaning the source data so that metadata records can be generated can be very time-consuming.

The capital costs of this option may also be significant in that many organisations will need to set up a separate computer system. For RCAHMS this demanded the purchase of a stand-alone web server, a firewall, and high bandwidth Internet access. They do not need, however, to establish themselves as a Z39.50 target.

A linked option may be appropriate where it is cost-effective for owners to provide on-line access to resources themselves, but also wish to make their resources available from a number of additional access points. This option is also preferable where the metadata are themselves dynamic (as in the case of NMRs or SMRs, or Museums inventories) as changes to the data are immediately reflected in what is made publicly accessible. On the other hand the costs of establishing an online Z39.50 target may outweigh the advantages for many organisations. For meaningful results to be returned this option does not overcome the expense of data cleaning and metadata creation highlighted in paragraph 24. Experience with the AHDS Z39.50 gateway suggests that where the distributed resources are diverse less functionality can be provided at the gateway than is available from each of the service provider's individual catalogues. Thus, for example, searches for archaeological resources issued at the AHDS gateway are limited to text-based keyword searching, whereas the ADS catalogue can support period or map-based queries. Nonetheless, organisations which possess Z39.50 target capability can make their data available to an unlimited number of gateways, each with different strengths, at little or no extra cost to themselves. This represents the optimal approach for supporting different user groups whilst avoiding duplication of records as different metadata fields or schemes might be selected for searches appropriate to a schools audience for example.

In all cases where disparate resources are combined, whether on a single server, or through searches of distributed records, the results will only be as good as the level of terminological control allows. Period, for example, is recorded differently in different resources. Where a recognised standard such as MIDAS has been followed then it is technically possible to use thesauri to provide mappings between the different data sets. This is an area which requires further investigation for interoperability to become a reality. In any case it is essential that users are made aware of the limitations of the search mechanisms and are given training in making effective queries, and interpreting the results. This is a particular issue for the public and schools group where additional glossaries and mappings of professional terms will be required, but also applies to the academic and professional user group.


In conclusion, by pursuing a strategic vision of interoperability, information providers will maintain maximum flexibility and functionality in proving access to historic environment information resources. A series of web-based gateways will provide a variety of access points to underlying data which is held only once. The vision is implicitly non-hierarchical as the gateways are information nodes defined by their target audiences. They point at any number of overlapping targets, some of which may themselves be gateways . Organisations may join the network at any level, by:

In order to achieve this strategic vision at an international level it will be necessary to:


The vision for information management outlined in this paper has been developed by many players across a wide number of disciplines. Its particular application to the historic environment has been defined by members of the Archaeology Data Service, including Tony Austin, William Kilbride, Paul Miller, Julian Richards, Damian Robinson, and Alicia Wise. This vision has most recently been rehearsed as part of a consultancy report commissioned by the UK Historic Environment Information Resource Network (HEIRNET), managed by the Council for British Archaeology, and funded by a number of UK heritage bodies, led by English Heritage. The report was co-authored by David Baker , Gill Chitty, Julian Richards, and Damian Robinson (Baker et al 1999). This paper draws upon Section 5 of that report. We are particularly grateful to David and Gill and members of the HEIRNET committee, especially Mike Heyworth, for their input into this process. Outstanding problems with the clarity of expression of the information management vision remain my own fault.


Baker, D., Chitty, G., Richards, J., and Robinson, D. 1999. Mapping Information Resources.

Condron, F., Richards, J., Robinson, D., and Wise, A. 1999. Strategies for Digital Data. Findings and Recommendations from Digital Data in Archaeology: A Survey of User Needs. Archaeology Data Service, York. Online version at

Miller,P. 1999 "Z39.50 for All" Ariadne 21,

Wise, A. and Miller, P. 1997 "Why metadata matters in archaeology" Internet Archaeology 2,

Return to index of papers for April, 2000, SAA Session "Digital Data: Preservation and Re-Use"

About this document: