Creating Intelligent Electronic-Based Archives

by David J. Wilson

In This Paper:

Considerations along the digital paper trail
Sizing up the situation
Scanner Throughput Calculator
Creating an Intelligent Archive
Storage Alternatives
Determining Storage Needs
About the Author

Considerations Along the Digital Paper Trail

Capturing a backfile of drawings or documents goes well beyond the process of scanning them into a compressed digital raster image. Storing them as intelligent records on fast and reliable electronic media is a goal that clients want and will pay for to preserve their company’s greatest historical assets.


The greatest benefit of an indexed digital backfile is the significant reduction in drawing search time. This process that otherwise could have taken hours or even days using manual methods, now can be done in a matter of seconds or minutes. This benefit alone can provide the cost justification backbone for the move to a digital archive.

This paper explores both the considerations and some components that service providers should realize when providing a backfile conversion effort.


Sizing up the Situation

Clients don't always have a handle on the quantities and size breakouts of their legacy archives but can usually provide ballpark estimates to help plan throughput and storage needs. Be sure to consider what the target resolution is for each scanned object type, the percentage of office documents that are duplex and the average number of pages per document.

In addition, average file size changes at varying resolutions and compression factors need to be clearly understood and considered. This can be modeled to quantify the number of pages to scan and storage capacity needs for the given population of backfile records.

Scanner Throughput Calculation

Whether you are helping select a scanner for a client or sizing your internal capacity for a particular project, you need to consider efficiency factors as well as traditional parameters to determine scanning capacity. Scanner efficiency is the expected amount of up-time for the scanner after considering paper jams, preventive maintenance (cleaning), and repair times. A more important and dynamic factor is determining operator efficiency. How much of their time is spent actually scanning as opposed to preparation, indexing, and other job functions? This environment is modeled in the following analyzer. The variables can be altered to help identify the impact of variations in your individual situations.


In our example, a single shift 30 PPM scanner operating at 60% operator efficiency and 90% hardware efficiency can meet our target of 7691 pages per day. The target daily rate was calculated by dividing the anticipated document count by the target time allowed to complete the job.

Creating an Intelligent Archive

Numerous indexing strategies, provided by today's digital archival systems (EDM, CD Archivers, etc.) include:

Manual Field Indexing and Searching is the most commonly used and allows users to find the desired drawing or document using Boolean logic and traditional wild card searching. The findings of the query are often presented in a result "hit list". Field indexing is very common but requires moderate labor to create the intelligence desired. Some systems include a fuzzy search capability which operates similar to a "sounds like" basis. The resulting hits are ranked according to their closeness to the original search value.

Hollerith Indexing is an alternative to manually indexing your database. Aperture cards already contain information specific to a drawing in the form of punched Hollerith data. Aperture card scanners capture this information into a text file associated with the scanned image. This data can be field "parsed" and placed into specific fields of a database saving time and money and assuring accurate capture of client records.

Existing Data Integration is conducted using information that is already organized. It is common for companies to have a corporate database (revisions history, or document libraries). This information can be imported into an archival system with a match to the appropriate object thus eliminating the need for multiple information sets and manual indexing.

Zone OCR/ICR Indexing, which is found in the office imaging environment, is a proven technology that has the promise of easing the manual indexing process. This approach allows OCR (text recognition) to be conducted within specific areas of a scanned image and the results placed into a specific field.

Full-Text Indexing is the latest extension of OCR/ICR text recognition and native file parsing. Full-text indexing allows all the text found on a document or drawing to be indexed into a searchable database. This technique is currently being used to index CAD drawings, native office files and scanned office documents. Search results are presented in a ranked list of matches. More advanced tools highlight the desired text or symbol within the document itself. It has the promise of being expanded into the scanned engineering archive environment as well. With little to no up-front indexing costs as well as a full index of the file contents, this indexing method can create a win/win solution.

Storage Alternatives

Users have long dreamed of digital mass storage for their legacy archives. Various affordable storage options are now available thanks to cost reductions in on-line magnetic storage and near-line optical based systems.

Magnetic (RAID) storage is best used as a medium for on-line, active records. CD-ROM and other optical media (DVD, Worm, MO, etc) are efficient near-line and off-line media for the many archives in a less active stage. With a 25-year life for CD-R and 250-year life for stamped CD-ROM's, optical media storage offers long-term viability over paper. CD-ROM's are also valuable for long-term archival needs due to the non-erasable nature of the CD and its’ acceptance as a legal audit trail.

DVD (Digital Versatile Disk) technologies will emerge over the next few years as even better storage alternatives to CD-R. First generation units now offer over 4 - 6 times the capacity of CD-ROM's (2.6 Gigabytes) and will be moving to 4.7 GB in later 1999. Second generation DVD-RAM technology is expected to approach 25 times that of CD storage capacity by late 2001. That means a single DVD disk will hold over 50,000 engineering drawings with access times approaching that of magnetic storage. The advent of jukebox exchangers, with their ability to handle hundreds of CD's or DVD's, tremendously increases on-line storage capacity.



Determining Storage Needs and the Right Media

Many factors need to be weighed to determine the storage needs of a particular client situation. The type of media used and the size of the image or documents being placed on the media play an obvious role, but there are other factors that also need to be considered. CD-R, for example, has a space requirement for close sessions and multiple track gaps. Also, if you are imbedding a viewer and the index (full text or database) onto the media to create transportable databases, there is reduced storage space for the actual image or native files.

A well-indexed digital backfile of drawings and documents can provide unlimited cost and time saving benefits to a company. Both the justification and tools exist for clients to move towards this allusive paper-less digital archive environment. It is our responsibility to provide quality images and intelligent records to help them meet their payback objectives. 




About the Author

David J. Wilson (dwilson@openarchive.com) is President of Open Archive Systems, Inc. which specializes in consulting and implementation of digital archiving systems. The company has helped mid-size and major corporations and reprographics service providers move to a digital archive environment.

The company has developed an archive planning and analysis tool that is available for trial use by visiting www.openarchive.com. Mr. Wilson can be contacted through e-mail at dwilson@openarchive.com.