A record is a record, whether it's a sheet of paper, an e-mail, an electronic document or a digital image.
"It's the content that drives retention, not the media it's written on," says Adam Jansen, a digital archivist for the state of Washington. And recent federal regulations are requiring more companies to save more content for longer periods of time. While content may be king in theory, in practice, the media on which it's stored and the software that stores it present problems. As digital tapes and optical discs pile higher and higher in the cavernous rooms of off-site archive providers, businesses are finding them increasingly expensive to maintain.
The software that created the data has limited backward compatibility, so newer versions of a program may not be able to read data stored under older versions.
Moreover, the media on which the data is stored degrade relatively quickly. "Ten years is pushing it as far as media permanence goes," says Jansen.
Varied Approaches
Today, the only safe path to long-term archiving is repeated data migration from one medium and application to another throughout the data's life span, experts say.
But the storage industry is working on the problems from various angles.
One solution to the backward-compatibility problem is to convert data to common plain-text formats, such as ASCII or Unicode, which support all characters across all platforms, languages and programs. Using plain-text formats to store data enables virtually any software to read the files, but it can cause the loss of data structure and rich features such as graphics.
Another approach is to use PDF files to store long-term data. There can be backward-compatibility problems with PDFs, but the file format's developer, Adobe Systems Inc., has created an archival version of its software, called PDF/A, that addresses them.
To date, the most promising standard data-storage technologies are emerging in new XML-based formats, according to analysts and studies. XML is a file format and self-describing markup language that is independent of hardware and operating systems.
On the media side, the Storage Networking Industry Association (SNIA) is working toward solving what it calls the "100-year archive dilemma" through a standards effort for media. The goal is to store data in a format that will always be readable by a generic reader.
"Degrading media is not at all the issue. Rather, the real issue is long-term readers and compatibility -- the logical problem which we intend to address," says Michael Peterson, president of Strategic Research Corp. in Santa Barbara, Calif., and program director for the SNIA Data Management Forum.
Some businesses are postponing the long-term archival problem with large farms of disk arrays, which keep data online and accessible. Jim Damoulakis, chief technology officer at Framingham, Mass.-based consultancy GlassHouse Technologies Inc., suggests that companies look into using an emerging class of inexpensive disk arrays as a storage medium. "At least you know the data is there and readable," he says. "A tape or optical media sitting in a vault can degrade."
The new disk arrays, sometimes called disk libraries, are based on relatively inexpensive ATA disks, formerly used only in PCs.
Peterson says that this is a temporary solution, however. "Long term, I am not sure that current disk interfaces won't have the same migration problem [as tape]," he says. "Whether it is tape or disk, you are going to have to migrate."
Managing Metadata
Meanwhile, users struggle on. Last October, for example, Jansen and his IT team completed a three-year project to create an open-systems-based archive management center for the state of Washington that will house records from 3,300 state and local agencies in perpetuity.
The center, in Cheney, Wash., currently stores 5TB of data and is expected to grow to 25TB by the end of the year. It cost about $1.5 million for management software and hardware, including servers, a storage-area network and tape drives. Washington spent $1 million more on a joint development project with Microsoft Corp., which is helping the state create what it hopes will become an open format.
"We want to avoid proprietary file formats to the extent it's possible," Jansen says.
He says that the most important part of any long-term archival system is centralizing the backup of data in order to be able to standardize the storage method. At the heart of the state's archival system is the storage of metadata, the information that describes the data.
When documents are transmitted over the WAN to a central data center, information such as who created the document, what type of document it is, where it was created, when it was created and why is was created is captured and stored in a SQL database. That way, "20 years from now, you don't have to know that particular document, but you can perform a search based on the record type," Jansen says.
The state's system also notes which computer originated the data. "We capture the actual IP address, CPU type and Ethernet adapter. We get the digital fingerprint of that computer," says Jansen. This helps to prove the authenticity of data. In addition, the state issues a digital certificate for any document using the MD5 hashing algorithm to verify the authenticity of that data.
Most data is kept in a standard format: Word documents are turned into PDF files, and images are converted into TIFF files.
Jansen says he is considering using Microsoft's Office 12 and its new XML-based file format as a standard archiving format in the future.
And virtually everyone hopes that standard -- or another one -- will stick. Peterson sums up the 100-year dilemma this way: "There aren't what we'd call standards for long-term archiving -- only best practices."
Autor: Lucas Mearian
Quelle: Computerworld, 25.07.2005