Harrison Eiteljorg, II
These data mining discussion lead to several interesting and important issues, but they seem to me to look well beyond the urgent problems we are facing in the field today and, in so doing I fear, to encourage scholars to focus on the wrong problem.1 If the problem we currently face were access to individual data items from archived excavation files, those discussions about data mining and the ways to make data mining possible would have been crucial. (I would be questioning some of the underlying assumptions here if we had reached that stage.) However, the problem we currently face is, in my view, much more basic. There simply are not many archived excavation files available either for downloading or for data mining -- or simply for long-term preservation. In the United States the number is vanishingly small.
While there are many reasons for ignoring the elephant in the room, I believe the critical issue can best be summarized by a single word never uttered in the meeting: money. I am aware of no archival plan in the U.S. that has found a funding system that could support an on-going archival storage program for digital archaeological data, and the problem is not technical. It is financial. As a result, neither excavators nor field surveyors can include real, proven archival procedures in plans for their data. They can, of course, prepare files and compile the kind of documentation necessary for archival storage, but I know now of only one active digital repository (down from two not long ago and three before the demise of the Archaeological Data Archive Project) in the United States that might take archaeological data, and that repository does not seek materials from outside its own academic halls.
Thus, the discipline has a very short-term problem that, from my point of view, overwhelms the long-term one of data mining: there are effectively no archival repositories in the U.S. for our data files. If the files are not saved today, the potential for mining their data tomorrow is zero. To be sure, the files can be left in a university computing center on the assumption that they will be cared for. They will be. They will be kept pristine so long as the computer center exists. But maintaining files unchanged and unharmed is not helpful. That is not what digital repositories must do. The files must be actively managed, migrated to new forms as necessary -- and migration will be necessary on a regular basis. Only that process will keep the files useful and accessible. That process, in turn, requires substantial documentation from the file creators.
The Archaeological Data Archive Project (ADAP) operated by CSA was not successful precisely because no way to support the repository financially -- what would be called a business model in another context -- had been put into operation and none could be conjured for the future. Indeed, the earliest paper that was presented as part of the work of the ADAP, in London in 1994, still seems to have had the critical technical issues for archiving well in hand (not including the issues for data mining which I take to be a follow-on issue), but the business model was unclear then and is no more clear now.
The only attempt to revisit the need for digital archives for archaeology of which I am aware takes as a given the assumption that any successful data archive will be self-sustaining. Yet the only example of such a self-sustaining digital archive that I am aware of in the U.S., the Inter-university Consortium for Political and Social Research (ICPSR) at http://www.icpsr.umich.edu/org/index.html, is for social science data, and there are more than 500 members of the consortium. This archive is sustained by a very large consortium of academic users, far larger than the archaeological community could put together. Otherwise, digital archives are supported by individual or foundation donors, universities, governments, corporations, or museums but not users. They may one day be supported by depositors (as are some repositories for archaeological finds), but that day will arrive only if and when funding agencies both require and fund depositing the data. In the meantime, the problem becomes more urgent with each passing day. There are no repositories for our precious data. Debating how we will use the data in 10 or 20 years -- and how to archive the data to make data mining possible -- seems to me to be grossly premature. The critical immediate need is to get the files into good digital archival repositories now, before it is too late. Once the files have been archived -- and archiving has become standard -- the next steps can be considered.
Words such as the foregoing are nothing but an empty exhortation to go forth and archive unless I can go to the next step and make some serious suggestions. So I will attempt to do that, to make some recommendations. To do so, I begin with my perceptions of the needs of an archaeological data archive.
1. Tasks: I believe there are four general tasks: data acquisition, data protection, data migration, and data serving. (This definition does not include provisions for data mining, which I consider, at best, to be a task for the relatively distant future.)
Data acquisition includes checking data documentation, examining documents that accompany data files, determining data file integrity, approving file formats, logging information about incoming data files, and storing the files on the chosen media.
Data protection is an awkward term for simply checking the files on a regular basis to be sure there has been no damage to the magnetic or optical signals that encode the data. (The ADAP procedure was to make three copies of files on CDs, store them in separate places, and compare them every six months, assuming that damage would not happen to all three copies in the same way on the same schedule. Finding differences between files would permit correcting the erroneous ones and writing correct files to new media.)
Data migration, of course, is the process of moving data from one file format to another. Except in very unusual circumstances, it should be possible to migrate files with automatic procedures. However, migration processes will produce new files in new formats, and the content must be checked. Archaeologists who understand the data in its scholarly context will need to check the data at the conclusion of migration processes to be sure that the data have not been damaged, less by alterations to the data items than by organizing or presenting the data in subtly different ways.
Data serving is the process of providing access to the files for downloading. It assumes that the repository also provides finding aids to permit resource discovery via the Internet. For the archive I conceive to be necessary today, it does not assume provisions for data mining.
2. Personnel: For the early years of such an archive, when little data migration will be undertaken, the personnel can probably consist of a single full-time person who is more salesman than scholar or archivist. Some personnel will be needed to help with incoming data; the needs will depend on the quantity of data arriving. Serving the data files, at least until there are many files in the archives, should require relatively little personnel time. More personnel time will be needed to make sure standards for resource discovery data are properly followed.
Eventually cultural area specialists (probably not full-time but employed for specific projects) will be required for data migration work. They will not be responsible for the migration process itself but for checking to be sure the migration has been properly accomplished without harm to the data organization and/or presentation. Technical personnel will be required to manage the deposit of data files and the preservation of the data files, but the term technical should not be taken to imply very highly specialized computer experts.
3. Expenses: Funds will be needed for computer systems and personnel. The computer systems will include servers for a Web site and the necessary data storage systems to hold files safely. Personnel will include those mentioned above.
4. Income: I do not believe users2 will pay for access to data -- or that access payments could cover a significant portion of the costs of operating a repository, probably not even enough to justify the extra costs involved in billing users. If the repository is to be independent, that leaves only data depositors to provide income for a self-sufficient repository. It is my view that they should pay for the services of the repository in the following ways. First, they should pay for the direct costs of depositing data files -- the costs of checking files, checking documentation, and so on. Indeed, they should directly perform as much of that work as possible. (One good reason for projects to pay such costs is to encourage them to document their data files as they go along. Doing so will greatly reduce the costs of preparing the files for deposit.) Second, they should pay the costs of serving the data to users, that is, a share of the costs of the computers and personnel required to provide the files to users.3 Third, depositors will need to pay the costs of preservation and long-term maintenance of the files -- data checking and data migration. Some estimate of the costs data serving and those of data checking and migration must be made in advance, before files have been deposited, and I think the only way to cover those costs is for the data depositor to provide a kind of endowment, the interest from which will provide funds for these services. While the costs of the general computing services required can be rather accurately predicted, the costs of data checking and migration will be harder to estimate. Yet both these estimates will be crucial to the repository's future. If the money received is inadequate to generate funds for on-going work, the repository could fail, with disastrous consequences.
Given the foregoing, what kind of a business model can be imagined? First, it seems to me that no successful model of a completely independent organization can be designed at a bearable cost. The costs for computer hardware and personnel are too great during the years before the repository has a critical mass of data (a period I estimated to be a decade nearly ten years ago, and it has not grown shorter), even if most of the hardware and storage capacity is obtained via a service provider. Access to some existing computer center, on the other hand, can provide many services with virtually no incremental costs to the computer center. Thus, I believe that a successful business model begins with a repository that can piggy-back on a computer center somewhere, presumably a university.
Second, personnel costs can be very small until data files are numerous. Although a director will be needed from the beginning, and the salary of such a person will be relatively high, other personnel will be part-time employees (and can probably be people already at work in the computer center, hired for specific jobs).
Third, the time between start-up and something approaching normal operation (whatever that may mean) cannot be predicted. As noted above, I once thought that the Archaeological Data Archive would take a decade to get going; after nearly a decade, it did not seem that we were closer. In that decade, however, there have been important changes. There are more files; there is more awareness. A decade may now be a more reasonable estimate, but it is still just that, an estimate.
Fourth, the repository must make a much more vigorous effort to acquire data than did the ADAP. I initially operated the ADAP as if the job were to make people aware of the problems and of the ADAP as a solution, refining the arguments and the message but not working hard enough at finding/helping/encouraging potential data depositors. I needed to do much more -- to help scholars find funding and to build an archive that had some utility to other scholars. "Nothing succeeds like success" is an oft-used aphorism for good reason.
Fifth, the repository should have initial support from an institution or institutions with project data to be archived. That would provide serious opportunities for putting the theories into practice, cost experience, and a reasonable expectation of utility for the archive in the near term. (I believed at the time the ADAP was started that institutional jealousy made it preferable to have the repository separate from educational institutions so that depositing data in the archive could not be seen as disloyal. Since I still believe that is a concern; the more institutions involved from the beginning, the better.)
Sixth, significant funding for the first decade of the repository will be required to pay salaries. However, the total required should be within a range that is defensible for the importance of the long-term goals.
Seventh, every project that produces data for deposit will need to find funds to pay for depositing the data. Near-term, that is the most problematic item. It is unrealistic to expect the same funding agency or agencies that support the repository directly to do so indirectly by paying archival fees. In many cases, some additional funding will be available from forward-looking granting agencies that see the archiving as the final step in a project. In other cases, funding will be more difficult, sometimes impossible. In the long run, I think the stick rather than the carrot will be at work; data archiving will eventually be required by those government agencies that grant work permits. That, in turn, will require that budget plans include the costs of archiving.
Putting the foregoing together, I see a possibility for a successful business model, but not a certainty. I believe that following are required. 1. A university -- and as many partners as possible -- with a strong archaeology program, a large computer center (not necessarily a large computer science program), data ready for archiving (and funds to cover the costs), and a willingness to commit to long-term support of the repository on the assumption that the best possible planning for self-sufficiency will be undertaken. 2. A sponsor -- whether the institution(s) or an outside funding source -- to bear the costs of the first ten years of operation. 3. Additional funding to cover the costs of archiving data from projects outside the supporting institutions.4
Such a repository will come about only if a number of people from the institutions and funding agencies involved are willing to commit themselves to a course that is ethically and professionally demanded but that involves clear risks. If they succeed, they will be praised as visionaries. If not, we can always continue debating the possibilities for mining nonexistent data.
-- Harrison Eiteljorg, II
To send comments or questions to the author, please see our email contacts page.
1. As if to emphasize the difficulties with data mining, or at least the problems with very complex data access, the NY Times reported the following on the front page of the January 14, 2005, edition: "The Federal Bureau of Investigation is on the verge of scrapping a $170 million computer overhaul that is considered critical to the campaign against terrorism but has been riddled with technical and planning problems, F.B.I. officials said on Thursday." That article was followed by Nicholas G. Carr's Op-Ed piece, "Does Not Compute," on Saturday, January 22. Mr. Carr argued that large and innovative software projects more often end in disaster than success.Return to text.
2. Users here means individual scholars, though they may obtain access via their institutions. By asserting that users will not pay for access, I am indicating that I believe neither the users themselves nor their institutions will pay for access to archaeological data archives. Return to text.
3. I can understand objections here. It is customary for projects to bear parts of the costs of publication, but buyers do purchase project publications. They share the burden. In the case of electronic files, however, I believe that usage will be both too hard to predict and too uneven to make it a reliable source of income. Return to text.
4. Note that I have not included any financial support from professional archaeological organizations. While I believe such groups can and will provide encouragement and will assist with the ethical arguments, they do not have excess funds with which to help on the financial side. Return to text.
For other Newsletter articles concerning the issues surrounding digital archiving or the uses of electronic media in the humanities, consult the Subject index.
Next Article: Past, present and future: XML, archaeology and digital preservation
Table of Contents for the Winter, 2005 issue of the CSA Newsletter (Vol. XVII, no. 3)
Table of Contents for all CSA Newsletter issues on the Web
|CSA Home Page|