SAA 2000
The topic of today's session is "Digital Data: Preservation and Re-Use." The point of preservation, of course, is re-use. There is no reason to preserve digital information if the information cannot or will not be re-used. That is obvious, but what may not be so obvious is the fact that re-use of digital data requires more than passing familiarity with computer software - different programs for different kinds of data - and that the needs are more critical than has generally been recognized.
In the following I will consider three kinds of digital data in particular - databases, CAD models, and GIS data sets. I have chosen these because they represent, in my view, the most important kinds of computer data, in the sense that these data types require computers for storage and access and cannot be represented adequately on paper alone. Therefore, unlike text or images, the computer is a necessary part of the system for these kinds of data - necessary for collection and storage of the data and necessary for use of the data.
I have assumed in this discussion that archived files have been preserved in a professional manner and that the necessary checks for accuracy and completeness have been performed. The process is not automatic, as we have learned from experience, but the process of archiving is not the subject here. The re-use of properly archived files is.
I have identified four levels of skills necessary if digital data are to be re-used: 1) skills needed to use data files if the software available to the user is appropriate for the files at hand, 2) skills needed if the data files are not in the format required by the user, 3) skills needed to evaluate the quality of the data, and 4) skills needed to aggregate the data.
This, of course, is the simplest form of re-use. Someone obtains files from an archive, loads them into his/her program, for which the files are the appropriate format, and is then able to use the data in any way the software permits. This is simple; it requires little of the user. Or does it?
The data will have been entered by another scholar. Therefore, the information will have been perceived differently, perhaps only slightly differently, but differently. The data will also have been organized according to the needs and perceptions of the original scholar, needs and perceptions that will have affected the way the data were organized, entered, and stored. As a result, the user must know more than is first suggested. He or she will need to know a good deal about the way the original data were gathered, organized, entered, and stored. The more complex the data, the more important such knowledge will be. At the simplest level, for instance, a user must know what terms were permitted and what terms were not, and any synonyms permitted must be explicit. Otherwise, a search for olpe might seem to be sufficient when the search must really be for olpe or pitcher. Similarly, one must know whether to search for Late Minoan I A, Late Minoan IA, LM IA, or LMIA. The differences that are trivial to the reader are not to the computer.
Consider this example of a CAD model. I have a computer model of the entrance to the Athenian Acropolis , the so-called older propylon, dating to the period from the middle of the sixth century B.C. to 437 B.C. It is a CAD model, created and maintained in AutoCAD's file format called DWG. Using AutoCAD's layers as data segments, I have segmented the model into more than two hundred different segments, using a variety of criteria for distinguishing parts of the material from one another. A user of that model - even one very familiar with AutoCAD - would need to spend considerable time studying the way the model has been segmented and the way the data segments have been incorporated into the CAD layer names before that user could effectively use the model. (That information is part of the documentation in text files that accompany the model.) Any user would find it all but impossible to use the model without first understanding the system used to segment the model.
That same computer model of the older propylon has an attached data table (actually three). Therefore, a user wanting all the information available from the model would also need to have those tables and know how to connect them to the model in his/her system. Furthermore, one of the data tables consists of notes concerning the entrance, and the items in the table are not connected to parts of the drawing as are the other tables (the other tables are connected to individual blocks). Instead, the comments are connected to icons that indicate the areas of the structure being discussed in the notes. The icons are on special layers (data segments) in the model, named to indicate that the data segments contain those icons. Thus, using the data tables also requires understanding the data segmentation system.
GIS data sets provide similar problems. The user must know a great deal about the scales used for any map data to know how the data can be used. Maps made at a one-to-one million scale provide different levels of information from maps made at a one-to-ten thousand scale, and those, in turn provide very different information from maps made directly via survey data and therefore effectively at a one-to-one scale.
Such a GIS data set will also include data tables, and all the questions that might apply to any data table apply with them. For instance, what choices were available for a given field? Were the data gathered for this purpose or used from another source? If the data came originally from another source, what is the source and what were the aims of the original data-gathering exercise?
This information should be available in documentation files, but the user must be sophisticated enough to know what information is needed and what the implications are of the various parts of the documentation. The right questions are required if useful answers are to be obtained.
Even this relatively simple process of re-using digital data therefore requires considerable computing skill on the part of the person accessing and re-using the data. How many archaeologists today have the requisite skills? How many recognize the need for such skills? How many students are being taught those skills? How many are required to have those skills?
Computer data in an archive may not be in the format required by an individual user, but changing data formats is not a particularly difficult process. Either there is software available to do the job or there is not. If not, then the translation process is beyond the skills of most computer users. Even when the data can be translated by extant software or imported by a program for which the file format is not the native one, the result is less automatic than may seem to be the case. The user must know what issues are involved in the translation process, what must therefore be checked or monitored, and what must be supplied by the user.
Data tables are a good example for this issue, because there is a single format that it widely accepted as a standard for data exchange - the DBF format popularized by dBase. This rather simple file format is so widely used as an intermediary that it is all but impossible to find a program that cannot accept data in that format. That would seem to imply that moving data from one database management system to another is very easy, via DBF files as intermediaries.
In fact, moving the data tables is easy. However, complex databases involve many tables that are related to one another in specific ways. The relationships between and among tables are not included in the DBF format. Instead, those relationships must be specified in documentation that accompanies the data tables. Using that documentation, one may connect the various tables as appropriate, but that requires more than the ability to look at data. The potential user must be able to use the software at hand well enough to make those necessary connections between and among tables. That is not trivial.
In addition, a potential user of data tables who must import tables in DBF format will have no display forms, no printing forms, nothing for displaying the data other than simple tables. To have those necessary but more complex forms of data display - either on screen or in print - the potential user will need, once again, considerable skill and experience with the program at hand.
CAD models are similar. The AutoCAD® DWG format has become a standard; so many programs can import DWG files. However, those programs may import the files in ways that are problematic. For instance, AutoCAD has no limit on the number of drawing segments called layers in a file, but one widely used program permits no more than 63 layers. When a DWG file with more than 63 layers is imported by that program, what must be done? The user must not only recognize the problem and its potential to damage the data, he or she must also have the knowledge and skill necessary to carry out the required procedures. Similarly, many programs do not support certain AutoCAD modeling notions, using different methods to store a given shape. A potential user must know that and realize how to deal with the differences.
Of course, the problems cited can be reversed. Complex surfaces are represented in various programs in differing ways. If those surfaces are imported to AutoCAD or exported from AutoCAD, there will be differences - some of which are very important and others of which may be truly meaningless to the user.
The problems with CAD models are, in some ways, simpler than with databases. There are likely to be fewer absolute requirements for a user who needs to examine a CAD model. However, there are more ways for the data themselves to be corrupted in very subtle ways - by inadequate translation of layers, for instance, or by different ways of representing complex surfaces.
GIS data is in this case far better. Commercial demands have created a GIS marketplace that is beneficial for all data users. Virtually all mainstream GIS programs can use data from most other GIS programs without translation. While that is an over-simplification, it generally the case that files from one GIS program can be used by others. Many data sets are gathered for one purpose but used by other people with other needs - and other software; therefore the software producers recognized that data must be shared by different programs with different native file structures.
Although GIS data sets are a partial exception, the problems of moving data from one format to another are serious ones. Changes in file format are not simply out-of-sight, insignificant alterations. They can affect the quality and utility of the underlying data. As a result, significant computing skills are often required by those who would use data that must be translated.
Any potential user of digital data must be able to evaluate the quality of the data. That is fairly obvious, but it may not be obvious that evaluation requires some sophistication as to the underlying data organization. That is, one must understand the way data can and should be organized in order to evaluate those data. In addition, data gathering and entry processes must be considered.
Consider this example of a database of fibulae - ancient safety pins - from Gordion. I took the information in the tables from the Gordion publication. Included in the database is a table of types according to the scholar who has published the definitive typology. However, this was a project simply to produce an example for illustrative purposes; so I did not go back to the type study itself, using instead the information supplied about the types in the Gordion volume. Were someone to use these tables, he or she would need to know that the information in the types table is second-hand. That is a problem that can exist in any kind of scholarship, but, when dealing with computers there is a tendency to assume accuracy; so there is an added burden to be sure that the basic information has been gathered according to standards that are acceptable.
Problems that are more unique to digital data are our real concern. Consider this possibility. An excavator has created a database of excavation units with summary information about the pottery from each excavation unit, the summary information being only the styles found in each unit. Unfortunately, there is a problem with doing this. There may be any number of pottery styles in a given excavation unit, since there may be imported and local wares of various kinds, all of which are contemporary. Dealing with these styles is difficult - not because there are many styles for each unit but because the number of styles is not fixed. Experience with database design is required to deal with such uncertain numbers correctly. As a result, the excavator may (depending on his/her experience with database design) have decided in advance to include a specific number of different styles of pottery for each unit. Such a choice represents poor database design but is not unusual. As a user, recognizing this design of the database, one must also recognize what it implies concerning the utility of the data. If there are four possible pottery styles, then the value and completeness of the information have been compromised, perhaps fatally. Furthermore, if there are four data locations where any specific period might be found, all searches must be constructed with that in mind, and the user must understand that simple searches for excavation units with any specific pottery period will be more difficult than expected. In this case, the way the data have been organized, though rather straight-forward and simple, has severely reduced the utility of the data, possibly compromising it fatally. There are many such possibilities, and the evaluation process must include checks on these potential problems.
There are also potential problems with the way data have been gathered and entered. In many cases, after all, excavation data are entered each year by different people, and those people may be more or less skilled from year to year. As a result, it is important to know what processes were used to aid data entry or to limit the potential for error. Were there prepared lists from which terms could be chosen? Were there checks on spelling or terminology after the data were entered?
I would also want to know how data corrections were made, who could make them, whether they were tracked, and so on. If I look at a card in a paper system, I can tell if there have been changes; I should be able to see that when looking at digital data. (It may not be possible to find the simple mistakes such as typos, misspellings, omissions, and the like, bit a digital system is not different from a paper one in this respect, though digital data may be accorded an unwarranted higher level of authority.)
CAD models are similarly in need of careful evaluation. One of the most basic questions regards the way the model has been segmented. If that has been done well, the analytic possibilities have been dramatically increased; if not, . . .
The modeling style is also crucial. How much of the excavation or structure is represented as wire-frame objects, how much as surface-modeled ones, how much as solids? Is the difference obvious? In addition, there may be a great many short-cuts used to deal with items of extreme geometric complexity - a Corinthian capital on a column comes to mind - and those short-cuts must be understood and evaluated.
It should go without saying that the mathematical base of the model should be checked. It is not possible to check individual dimensions unless there is some other record for comparison, but it is possible to check general dimensions to be sure the model has not been scaled.
The model itself will not display the precision used in dimensions and data entry, but the documentation should, and the documentation should be carefully examined for that information. It is tempting to use the dimensions and point locations supplied by a CAD program as if they were very precise, but the true precision depends on the precision used in measuring/surveying and data entry, not on what the CAD model returns in answer to a query.
Evaluating GIS data sets is even more complex. As mentioned above, it is common to use GIS data from a variety of sources. As a result, a scholarly data set may include individual files from many sources. Each of those will imply different questions and potential problems unique to the needs and demands of the original data-gathering requirements. Therefore, each file must be approached individually, examined and evaluated independently.
There are many issues specific to GIS that need to be examined. Perhaps the most important one is that of ground-truthing. Remote sensing images of the landscape, usually from satellites, are often used, but it is not uncommon to use them without any check on the accuracy of the data interpretation. What appears a certain color in the image is said to be a specific soil type or vegetation on the ground. It is time-consuming to check - by walking over the area in question - but if that equation has not been checked, the conclusions it spawns are suspect.
Since much of the GIS data is in the form of data tables, the kinds of checks described previously are required for GIS data sets as well.
Evaluating digital data is a complex task. Whether the data sets are relatively simple or extremely complex, there are important questions that must be asked. Many of those questions require considerable sophistication concerning computing issues if they are to be answered.
Re-using digital data from archaeological work is the goal of digital preservation. Re-using data, however, often means re-using more than one data set from more than one source and attempting to combine those data sets so as to gain new insights based on a larger sample. Indeed, that kind of re-use - aggregating data from many sources to enable larger samples to be used - is often seen as the real raison d'étre of digital preservation.
The problems discussed above are all multiplied by the process of data aggregation. Every file may need to be translated from one form to another; each must be evaluated before being combined with another file. In addition, it is likely - almost to the level of certainty - that the files will require alterations in order to be used together effectively. Terms will not be the same or the data structure will be different in the case of databases. CAD models will be based on different grid systems or use unrelated data segmentation procedures.
GIS data sets are not so difficult to bring together in terms of data formats, but trying to bring together maps of different scales can cause serious interpretation difficulties. Using data tables designed for different purposes also poses serious problems with interpretive processes. In addition, much GIS data should be considered time-sensitive, but that aspect of the data is easily ignored when data sets for the past are needed but not available.
Data aggregation assumes the skills to use and evaluate data files discussed above, but significant additional skills are required to bring disparate sets of data together into a unified system. Considerable skill and experience with computing software - whether database management systems, CAD, or GIS - will be needed bring data sets into a common system for use together.
Data archives may, in the future, be able to perform for potential users some of the critical functions I have been talking about. They will certainly be able to offer information to help with evaluation, for instance, and they may eventually offer access to data that effectively includes the necessary software. I can imagine, for instance, a Web browser with built-in database features that allow me to access a sophisticated relational database as if the data were on my own computer - but without needing to translate files or even to download them. Data aggregation is certainly an aim of archives as well - if not in the short term at least in the very long term.
Optimistically, then, one might hope that it will be possible to use data without needing to do everything oneself. I would argue, though, that the potential is there to eliminate some of the problems but certainly not all. The need to translate data from one format to another may eventually recede into the background as a major issue. It may also become relatively easy to access material over the Internet and to treat that material as if it were on your own desktop. But I do not believe it will ever be a good idea to allow others - even such well-intentioned groups as the Archaeological Data Archive Project or the Archaeology Data Service - to take on such crucial roles as data aggregation and to assume that they will be able to perform all the necessary services in precisely the way you or I would think best if we were to perform them. That is, as much as we might like to think otherwise, computers perform relatively few automatic processes. We instruct them. We provide the parameters. We provide the algorithms. That being the case, there will always be disagreements about the proper instructions, parameters, and algorithms - and we will need to be able, at the least to evaluate them in order to evaluate the quality and utility of the resulting digital data.
Thus, the kinds of computing skills I have been talking about will remain important. We, the end users of these data files, will need to be computer savvy if we are to be able to evaluate and utilize the data preserved by archives.
We are collecting digital data. We are preserving it. We are rarely re-using it. That is the case now because the quantity of such digital data of interest to any one scholar is relatively limited and because the abilities of scholars to utilize the data in digital format is very limited. As the quantity of digital data increases, I would like to think the computer skills of scholars will also increase. But how will that happen? So far, most scholars who use computers well are self-taught or have been taught computing skills outside archaeology. Those who have learned within archaeology programs have, in general, learned those skills in response to a particular need for a particular project. Up to this point, that may have been sufficient, but the skills must be obtained by a larger portion of the body of archaeologists in the future. Nearly all must be able to use the technology well in order to access effectively the preserved data from their colleagues.
There seems to be an assumption abroad that, if some were able to learn computing without formal training, all can. The idea is that none of this is inherently difficult and little formal training is required. I think that assumption must be put to rest, and my own experience strongly attests to its folly. I have seen too many projects that prove the need for more formalized instruction, and, sadly, I usually see them after a great deal of time, energy, and money have been wasted.
Very often the correct type of software is not used - spreadsheets when a database is needed; photo editing software or illustration software when CAD is needed, CAD when GIS is appropriate and vice versa. Even when the right software has been chosen, the problems are numerous. Databases can be created easily; good ones cannot. CAD models are not hard to make; good ones are. GIS data sets may be more difficult to create; good ones are much more difficult yet. People who learn how to use the technology well will use it well; those who just learn how to use it may, if lucky, also learn to use it well; it is more likely that they will not.
I think the time has come to recognize that computing skills are required for archaeological work today and to acknowledge that those skills must be taught. We would not presume to tell a student to go teach himself or herself how to excavate in a controlled stratigraphic manner. Nor would we consider sending students out to dig without first explaining the importance of context and proper recording to preserve the information uncovered. Why do we act as if computing were different?
Return to index of papers for April, 2000, SAA Session "Digital Data: Preservation and Re-Use"