Rev: 01 Nov 2004.

Introduction

A resource page describing my personal experience in creating a low-cost web-friendly document management system for our home LAN/intranet. Document acquisition is by scanner. I'm interested in scanning either "textual material represented as images" (OCR susceptible) or handwriting/drawings (not susceptible to OCR, similar to scanning maps or line art).

In assessing solutions I look first at the file formats and associated browser Plug-Ins, then at corresponding software, and lastly for compatible scanners. Most of my information comes from web research on newsgroups, corporate scanning/document management vendors, file format/compression technologies, and digital library projects. I am particularly indebted to experts who have corrected errors and updated my comments.

Objectives

non-proprietary data format
data mobility across vendors and platforms
web access to documents.

Current Approach

Scanning primarily handwritten notes at 300 dpi bitonal b/w with maximal sharpening on an HP ScanJet 6390C Scanner which I've come to despise. The ADF (sheet feeder) routinely shreds paper. The PrecisionScan Pro 2.0 software is buggy and bizarre (to remove saved settings you must edit the registry). Scan quality is awful when using the HP Copy utility, quite good when Acrobat drives the scanning or if you scan to PNG documents.
Using Acrobat to acquire scans. With the ADF in place all documents in the ADF are merged into on PDF document. I edit it, add notes, then break it into subfiles as needed. I omit adding metadata. My Acrobat output folder is within my webserver's file space.
Save images with descriptive name and appended date (to reduce namespace collision) for the file: YYMMDD. Example meetingNotes_990812 is a scan on Aug 12 1999.
When I'm done scanning, view the directory listing in web browser. Drag and drop URLs into text field of a FileMaker Pro (I'm still on version 3!) flat-file database (modified from Webbase).
Add metadata as I desire in the FMPro database using checkboxes, free text, etc.
As desired create links from existing documents in the webspace to the scanned images.
When wish to view a document my FMPro database (modified from Webbase) sends a DDE message to a browser to open the URL I saved into the database, the browser renders the image using the Acrobat Plug-In.

Scanning resolution

I've been trying 300 dpi B/W scans on my handwritten pen material. Pencil does not scan as well as ink in b/w; but with grayscale the images are too large. The HP PrecisionScan Pro software does better on pencil if set to maximal sharpening. This gives quite good results with PDF files.

In addition, this is what AT&T recommends for scanning prior to conversion to their DjVu format:

For best results, color images should be scanned at 300 dots per inch at 24 bits per pixels and saved in uncompressed, lossless format such as TIFF, PPM, or BMP (24 bits) ... a typical page at that resolution will occupy 25MB before compression.

... set the gamma of scanned images to 1/2.2 to get acceptable results on all platforms (Adobe PhotoShop on Windows does that by default).

GIF images can be readable when scanned @ 150 dpi

File Formats

Scanners produce TIFF images, which can bundle multiple scanned pages into a single file. These have some minimal lossless compression, but at text-scanning resolutions they're huge -- typically 20MB for an 8x10" 24bit color 300 dpi page. Displaying such images taxes system RAM, storage sucks up drive space, and sending images strains bandwidth.

Various lossy compression techniques are used to make these images more manageable. The best modern approaches use different techniques for textual material (edges, repeating letters, little color) and for pictorial material (blurry, lots of colors). It turns out that there are a lot of compromises involved in handling textual materials displayed as images.

Wavelet compression is 'hot' for images (like maps), but textual materials displayed as images do better with CCITT-4 or JBIG2/JB2. Much of what I'm interested in scanning is handwritten -- I wonder if the correct image format is what's used for maps (wavelet compression at the Library of Congress.)?

The ideal format does not exist. It would be in the public domain and would be built into browsers. Next best are formats with inexpensive licenses and well done Plug-Ins.

There are many other image formats besides the ones listed here, but I don't want to have to handle each page individually; I want to be able to 'collate' pages. That narrows the field to formats that are optimized for scanning (or faxing -- same difference).

Current Alternatives

	advantages	disadvantages	source	notes
TIFF	public domain	huge files if 24bit color		Might work with bitonal images at 300 dpi.
XIFF	inexpensive to use good compression nice Plug-In reasonably fast compression	uncertain future only one vendor	Pagis Pro	XIFF (eXtended TIFF) uses standard compression schemes: CCITT-4 for the linework (text), JPEG for the contone (pictures). Handwriting/line drawings are handled via JPEG. DigiPaper - Cornell/Xerox project, apparently uses XIFF as file format.
DjVu	best compression best images computationally intensive compression	uncertain future	AT&T	DjVu uses new compression schemes: JB2 for the linework and another called IW44 (wavelet) for the "contone" (pictures). Serious commercial implementations are very expensive.
PDF	established standard, many persons have Plug-Ins or viewers, cross platform, published specs, excellent OCR integration	JPEG only compression, single channel, text images fuzzy	Adobe	See Adobe Acrobat 4.0.
JBIG2*	emerging standard	not done yet, text oriented, less useful for line art or handwritten?		Replaces CCITT-4? William Rucklidge: "XIFF includes a bi-level compression format which was part of Xerox's proposal for JBIG2; DjVu includes a bi-level compression format which was part of AT&T's proposal for JBIG2. You can think of the final JBIG2 as the merger of the best ideas in those two proposals, with a number of other components added (for instance, special modes for coding halftones), plus the most popular parts of JBIG1 and G4/MMR."
JPEG2000*	emerging standard, specification finalized 12/1999.			See EETimes story. Wavelet compression and support for mixing compression types to eliminate the JPEG fuzzy-text problem.

* JBIG2 and JPEG2000 are very closely related. I'm still not sure what the difference is. The 'official site' is members only.

Digital Library Choices

Digital Library projects, typically managed by university libraries (U of Michigan, Berkeley) or the Library of Congress are very web focused and have to manage a great deal of material. Their choices are interesting.

U of Michigan: Making of America

Uses PDF (1 page per PDF), or 3 GIF levels (100%, 75%, 50%) or OCR text for display. TIFF for archival storage.

Library of Congress: Digital Formats for Content Reproductions

An interesting range of choices, depending on the medium. They've had the common problem with "standard" TIFF headers, archival images are done in TIFF at 300 dpi and 24bit color. Display images are JPEG/GIF for pictures and thumbnails, proprietary MrSID wavelet compression for maps and other dense line drawn images.

Software

Adobe Acrobat 4.0

Adobe's scanning solutions use the PDF file format. This format has relatively weak compression properties, but it has some overwhelming advantages: full-featured viewers/Plug-Ins very widely distributed on Mac/Windows/Unix, a published file format, and an industry devoted to PDF management. A PDF file will readily handle multiple-page documents; the viewers provide powerful document navigation tools.

Adobe's corporate and scanning focus is on text, not images. Most of their product emphasis is on post-acquisition manipulation of text documents rather than the image acquisition process I'm interested in. This includes a novel OCR approach in which the resulting document can be an integrated blend of recognized text and non-recognized text images! My current interest is primarily in image acquisition.

Adobe has two scanning products, neither of which are adequately described on their abysmal web pages. Adobe Acrobat Capture is their higher-end ($600 limited license) Windows product for acquiring large numbers of documents (as of March 2000 they've just changed the name and doubled the price). It will turn a stream of TIFF files into a stream of PDF files. Although the web site description is almost useless, the 1997 manual is online. (Warning: Adobe uses the word "Capture" in the Acrobat product to refer to the OCR transformation of PDF files; but their Acrobat Capture product is also their volume oriented image acquisition product.)

Adobe Acrobat 3.0 and 4.0 (not to be confused with the free Acrobat Reader software) is a lower-end ($100 to 200 depending on promotions) Mac/Windows product. Acrobat's primary focus is on manipulating and managing PDF files, but it has three acquisition tools. Adobe has deliberately limited Acrobat's acquisition abilities so as not to compete with Acrobat Capture, with some modest changes Acrobat would be a quite powerful product. The available tools include:

Acrobat Scan - This simply activates your TWAIN (scanner) software. Acrobat accepts the scanner output, and, with user assistance, collates the individual scans. Each page scanned requires user intervention; there's no support for a sheet-fed higher volume process. (For that functionality Adobe wants you to buy Acrobat Capture.)
Image Import - If you produce a collection of TIFF files, you can import them all at once into Acrobat. This will assemble them into a single PDF file. You can select subsections of this file and save them separately, or you can rearrange documents using Acrobat thumbnails.
Drag and Drop onto icon - You can drag and drop a collection of TIFF files onto the Acrobat icon. Acrobat will then turn them into individual PDF files. You must name each prior to saving them however; Acrobat will not simply use the original name with a PDF extension (again, Adobe wants you to buy Capture to do that).
Drag and Drop onto open page - If you have an open Acrobat document and drag and drop files onto it, the files are concatenated (not sure how order is set).

Once you have a PDF document, you can optionally use the "Capture" tool within Adobe Acrobat to transform the PDF file into some mixture of text and image (see scanning/capture tip). The text can then be processed by a text indexing engine for powerful searching and retrieval (Adobe Catalog). You can also use the standard Acrobat tools to add web links, digitally sign, add security features, etc. In the words of Ted Aaron Eytan (edited and paraphrased in parts):

this is one of the best OCR apps I've used, better than Omni-page. It's relatively hassle free because it doesn't prompt you for every misrecognition, it simply leaves the graphic in and blends it with text. You can simply copy and paste into a word processor if you need to.

I use Acrobat pro to scan important documents, and I can put related documents together as multi-page PDF files. For example: one file with all relevant MD licensing info including my DEA license, MD license etc, that I just print or fax on demand to whoever needs it. Using Acrobat you can clip out pages and put them into custom PDF files that you then distribute as needed, so they are easy to catalogue.

Acrobat does provide some limited facilities for adding indexing information to pages (author, keywords, title); however their search tool does not use this information! I found their document management tools to be very feeble, I use my own.

ScanSoft Pagis Pro 3.0

A big and somewhat slow application. Pagis Pro is intrusive software, the document browser tries to take over the native Win file browser. I tested by scanning into a folder that's within a web server directory space. When I also tried scanning into the same folder from Acrobat I induced severe instability in Acrobat.

The tools for adding metadata (keywords, etc) are weak. Uses a proprietary XIFF file format which is efficient but not widely known. XIFF allows multiple pages to be bundled, good compression and editor. Free viewers. There's a XIFF plug-in viewer for Netscape/IE that works quite well.

The workflow for managing document collation and assembly is stronger than Acrobat; more comparable to Adobe Capture.

ScanSoft manufactures the PaperPort software as well, but it appears to be focusing its efforts on Pagis Pro. PaperPort has a free viewer but no Plug-In, so I did not consider it.

Misc Products

FileMaker Pro with Troi Grabber (FileMaker Pro TWAIN acquisition plug-in)
ImageMagick a UNIX utility that can supposedly convert TIFF to PDF. Not useful for me, but the approach is interesting given Adobe's very high-end approach to Acrobat Capture.
DocuLex, Inc. Document Imaging Solution: PDF.Capture, bundle with Ricoh scanner

Hardware

For document management the key item is probably time from initiating work until a scanned document is available for handling. PPM ratings are an important ingredient.

There are tons of image scanners for under $500, but when you talk about document handling -- a really different application, -- it's a different story. The HP ScanJets are awful. The Fujitsu 15C is the next one I'm evaluation. After that prices start at $2000 and climb very quickly.

UMax scanners are reportably too slow, Canon, Xerox and HP are faster. Some scanners may require DOS based drivers!!

HP 6350c scanner, SCSI, with page feed. Street < $500. I hate it. Buggy, awkward, and difficult to configure software. A very poor low-capacity and unreliable sheet feeder. An obscure and undocumented methodology for assigning SCSI IDs. No pass through SCSI connector and no obvious way to disable SCSI termination.
Fujitsu 15C: a workgroup scanner, $800. Built like a tank. Lowish resolution is adequate for documents but not for image work. TWAIN drivers not compatible with Windows 2000 and no upgrades promised. Outputs only TIFF and BMP. Complex Kodak software is an older version (2.0) of current (2.5) release.
Scanners recommended by Archive Power systems for document acquisition
Scanners tested with Pagis Pro 3.0

Links and References

General links and references not included elsewhere.

Digital Libraries: A Selected Resource Guide: good overview, most materials not on web however.
D-Lib - Ready Reference
Turning pages within in a digital reproductiom (Library of Congress, NDLP): Handling images without a collating file format.
JBIG FAQ
Imaging & Document Solutions Magazine
DigiPaper - Cornell/Xerox project, apparently uses XIFF as file format.
Newsgroups: comp.doc.management
ScanSoft Pagis Pro 3.0
- Pagis Email Archive
- XIFF Plug-In (hard to find documentation)
DjVu
- Root Technologies' DjVu Conversion Service: Why they favor DjVu over PDF.
- Technical article on DjVu has interesting discussion of formats
PDFZone: non-Adobe info
Imaging Magazine discussion of high-end PDF use
Scanned Images: halftones
Document Management Alliance (DMA)
DMIA - Document Management Industries Association
Document Management Avenue

History

March 20, 2000: Painful personal experience with the HP ScanJet ADF 6390.
Dec 5, 1999: extensive revisions, some corrections, more Acrobat news, settling down into my ongoing approach.
Nov 28, 1999: Acrobat update
Nov 27, 1999: incorporate digital library expertise, new file formats identified. Now getting to be a reasonable resource.
Nov 25, 1999: personal experience
Oct 24, 1999: more info from newsgroups, identify some consumer products, locate links to more serious scanning/acquisition tools. Begin trolling newsgroups.
Oct 15, 1999: initial version, no content.

Footnotes

[1]
[2]
[3]

Author: John G. Faughnan. The views and opinions expressed in this page are strictly those of the page author. Pages are updated on an irregular schedule; suggestions/fixes are welcome but they may take weeks to years to be incorporated. Anyone may freely link to anything on this site and print any page; no permission is needed for citing, linking, printing, or distributing printed copies.