Jeff Rothenberg, a senior research scientist at the Rand Corporation, has written extensively about the problems of preserving digital information. He has recently written a report for the Council for Library and Information Resources discussing the problem and describing the solution he has long favored (Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation, January 1999). Mr. Rothenberg’s position has received considerable media attention and is therefore rather widely known.
Mr. Rothenberg rejects data migration (changing data files into new formats as necessary so that they can be used by new software) as a way to preserve data, arguing that it may be better than nothing but that it is far more difficult to accomplish effectively than proponents, such as myself, acknowledge.
Mr. Rothenberg also objects to migration because the migration process may preserve information, but it will not preserve the "look and feel" of the original data file(s). For example a CAD file from today might have to be converted to an environment much more like a virtual reality environment (VR) to be used in a few years. That is likely, because CAD and VR have so much in common, and VR techniques can be expected to force CAD programs to change the way they work. As a result, users of a CAD file (migrated to be used in a new program ten years or so from now) will probably have VR-like capabilities. Those future users will be able to see on-screen views that seem to permit them to walk about within the model, look around, see the model from any angle, change the lighting, and so on. The creator of the model, though, would have had far more limited viewing options - and a much less realistic sense of the model. Unless the program used by the creator can be emulated, there may be no way in the future to provide the more limited capacity of today’s software - and, consequently, to let a future user understand the limitations encountered by the data creator.
The hardware of the future will doubtless be much better as well; as a result, future users may be able to do many things we cannot. For instance, scholars in 20 years will probably be able to see levels of detail in images that we cannot. Consider the difference between a superb computer image and a photograph. Although there is still a huge difference today, we can expect the computer image to get closer and closer to the photograph over time.
Mr. Rothenberg proposes emulation as a preservation strategy to overcome the problems already mentioned. Emulation in this context means creating programs for computers of the future so that they can run "old" software and read "old" data. The emulators enable computers to mimic other computers so that today’s computers can mimic old ones and computers yet to be invented can mimic today’s computers. Having mimicked the old computer, the newer one can then run the old computer’s software. One example would be to think of trying to run Electric Pencil (my 1979 word processor) on a modern computer. Electric Pencil was designed to run on machines using a computer chip now six or seven generations old, with a monochrome screen (text reading as green or white on black) under the CP/M operating system. A good emulator would make it possible to run that old operating system (CP/M) and then to run the word processor as if I were back in my 1979 office. I would then be able to access any document created with Electric Pencil. Software would translate all processes so that the hardware would react appropriately - showing the text as white on a dark screen with the cursor moving only in response to arrow keys (no mouse back then), saving files on command, printing on command, and so on.
Emulation has been widely used in the business world to make it possible for people to continue using business data without migrating the data to new formats. Aside from keeping data in use, one of its principle advantages is its ability to maintain the "look and feel" of files, since they are running in the environment in which they were created (CP/M, Windows 3.1, or MAC system 6, for instance). Another advantage is that the emulation is simply a program running on a machine; using it does not alter the underlying data file in any way. Thus, there is no fear of changing the data accidentally. (The data would need to be moved to new media - different disks or tapes, but the data format would not change.)
Mr. Rothenberg points out that another advantage is that one emulator will provide access to large numbers of files. A good emulator for a Pentium® PC, for instance, would permit Windows 98 and all programs created for Windows 98 to run on the machine. In comparison, migrating a data file makes only that single file available to potential users. This is both a convenience and a cost saving.
Mr. Rothenberg also argues that data migration is very difficult to accomplish at certain critical moments - when software shifts abruptly so that new paradigms arise. For instance, as discussed above, CAD may well be subsumed by virtual reality programs. When that happens, how will migration then change a CAD model? Such a model, after all, would lack the texture and color information that is crucial to virtual reality systems. In fact, the model may even be a two-dimensional one, showing plan views only. How should such a model be represented in a VR environment?
Despite the advantages to emulation noted, there are also significant problems with that approach. First, I am not so optimistic about the possibilities of making emulators as Mr. Rothenberg - not for technical reasons but for financial and practical ones. Who will make all the emulators required and why? Who will pay for the work?
Second, as time marches on and we have more and more new machines, will new emulators run on old emulators? For instance, suppose we now were using an emulator for that old machine of mine. We might run CP/M on that emulator to operate Electric Pencil in order to see a 20-years-old document from my old computer. When the next Intel chip comes along, we will need an emulator for the Pentium II. Will the emulator for my old machine run on the Pentium II emulator, or will a new emulator for my old computer be constructed to run on the new machine - and similar new emulators be created for all other old models? Neither choice seems desirable, and I cannot imagine finding ways to fund either process of emulator construction. (1)
Third, a user of "old" data would have to own or have access to emulators for every machine used to create any data he or she intended to use - as well as the programs used to create the data. (2) Thus, any archaeologist wanting, for instance, something like the Lerna files we have created using the program called Access® would have to have a Pentium Pro emulator running Windows NT version 4.0 and a copy of Access - even if said scholar were living 100 years from now.
Fourth, there are many questions surrounding the specifics of the hardware to be emulated. For instance, what specific hardware is to be emulated? Must every PC combination (specific processor, with a specific hard drive, graphics card, monitor, and so on) be emulated? If not, what data types are privileged to have their hardware emulated? What hardware emulation is appropriate for a given set of data, the hardware on which the files were first created, the hardware on which they were last used, or some more generic hardware appropriate for the operating system?
Fifth, there is the question of what versions of each software application are required. That is, must one use the specific version of Access that was used to produce the data files? If not, what are the criteria by which one decides that version 1 is very different from version 5 and must therefore by kept available for use, while intervening versions 2, 3, and 4 need not be retained?
Sixth, in addition to needing multiple emulators, operating systems, and programs, a scholar in the future would need to know how to operate all those emulators, operating systems, and application programs, virtually an infinite number, in order to use the data provided from an unlimited number of old computers. For instance, just to be able to see database tables from the last decade of the 20th century, a scholar would have to be conversant with, at the least, UNIX, the MAC OS (various versions), DOS (various versions), Windows (3, 3.1, 95, 98, and NT 3, 3.5, and 4), Oracle®, bDase, FoxPro®, Access, FileMaker®, and Paradox®. Similarly, someone working with CAD models would need to know the same operating systems plus AutoCAD®, Microstation®, MiniCAD®, FormZ®, ARRIS®, and . . . Of course, the user would need to know multiple versions of each of those application programs.
I see three distinct issues involved here. One is the importance of preserving the "look and feel" of data in its original context. The second is the difficulty of migrating data at those crucial times when data organization must be changed to reflect conceptual changes in software and data organization. The third is the practicality of migrating data as compared with that of making and using emulators. The first two are weaknesses of the migration approach to saving digital data. The third is the weakness of the emulation approach.
On the first point - the "look and feel" issue - there is no doubt that Mr. Rothenberg and others who object to data migration as a preservation strategy are correct. The appearance of data, the way the user interacts with the data, the speed with which processes happen, and much more will change as users move to newer, stronger, faster computers.
The importance of this issue may, however, be overstated. Is the data important or is the presentation? It seems to me that in some cases (scholarship generally) the importance should clearly be attached to the information content, not the appearance or performance of the system. In other cases (the arts especially) appearance and performance are crucial. Given the nature of the archival work of interest to the Archaeological Data Archive Project, appearance and performance issues are not critical; data content is. As a result, emulation is not necessary to preserve the "look and feel" of old data. "Look and feel" may be more important in electronic publications, but I believe that preservation of "look and feel" is unnecessary in this area as well (see "Electronic Publication for the Archaeological Data Archive Project" in this issue).
Major shifts in software that will require changing the way data are organized can surely be expected. Data migration will then be more problematic, because it will be a more complex process requiring scholarly attention, not just technical work. It would be a mistake to minimize the importance of this issue, but, on the other hand, such changes in software are likely to require re-casting data anyway, since the intellectual models of interest will naturally be related to the organizational schemes demanded by data recording and storage models. If, for instance, relational databases are replaced by another form of data storage, will we not want old data to be re-cast so that relational data can be used in the same ways newly gathered data are?
Finally, there is the question of practicality. Mr Rothenberg points out that, for the migration strategy to work, every file must be migrated before use. Therefore, the labor requirements are onerous. I do not expect the difficulties to be so severe. Once a file in the Archaeological Data Archive has been archived, the difficult problems - documentation - have been dealt with. Migrating is then a relatively labor-free act except when there is a real shift in data organization involved. Even in those more difficult cases, once the migration scheme has been determined, each new set of files will be easier to migrate. Furthermore, cooperative arrangements such as the informal one between the ADAP and the Archaeology Data Service in England should help us apply real expertise to the problems and to be able to automate much of the work.
The practical problems with emulation, though, will impact all who must try to use old data. Every user of old data will need to know how to use many emulators, operating systems, and application programs. That is an enormous burden, one attested to by those of us who cannot now easily use programs we used daily only a few years ago. Although this could be overcome by creating common interfaces for certain data types, that would vitiate the advantage of the emulation strategy - maintaining the "look and feel" of the original. The user might have the same on-screen appearance and speed but not the same operating environment as the creator.
Users of emulators will also be dependent on the emulators to a degree I would find unsettling. I can imagine no competition to produce emulators and, therefore, little incentive for quality control. Furthermore, those who produce emulators will be able to determine what computers are available for use by deciding what emulators they will create. Of course, they may also be able to determine what data are lost to us by declining to emulate machines on which such data have been generated.
In the final analysis, I think both data migration and emulation have serious problems. Data migration seems more desirable to me because it depends on relatively predictable operations of the marketplace (migration paths will be available because businesses will need them, i.e., pay for them) whereas emulation demands special programming that may or may not have perceived value to anyone with the means to pay for it. In addition, the need to know how to use an unlimited number of software packages makes emulation seem quite impossible to me. I must grant, however, that there are real problems with the "look and feel" issue and that data migration cannot fully deal with those problems.
-- Harrison Eiteljorg, II
To send comments or questions to the author, please see our email contacts page.
(1) Running emulators on emulators is more likely, since it requires only a limited number of new emulators as time moves on. Unfortunately, though, the variety of hardware requiring emulation means that the construction of emulators on emulators can be counted on to create incompatibilities as the law of unintended consequences intervenes and as each new machine adds new complexity to the process. That, in turn, would require that there be people available who can fine tune old emulators to mimic machines that have not existed for decades. Return to body of text.
(2) Mr. Rothenberg rightly comments that access to data from old programs does not require a program capable of editing the data, only viewing and manipulating the data as a user, not a creator. Thus, the programs required in the future would be simpler than the programs used to create the files. Nevertheless, a user must have a program suitable for every data type and file format to be encountered. Return to body of text.
For other Newsletter articles concerning the use of electronic media, consult the Subject index. For other reviews of the use of the use of electronic media in museum exhibitions, consult the index of reviews.
Next Article: Letter to Editor
Table of Contents for the Spring, 1999 issue of the CSA Newsletter (Vol. XII, no. 1)
Table of Contents for all CSA Newsletter issues on the Web
Return to CSA Home Page