Faughnan Home | Contact Info | Site Contents | Search
Rev: 01 Nov 2004.
A resource page describing my personal experience in creating a low-cost web-friendly document management system for our home LAN/intranet. Document acquisition is by scanner. I'm interested in scanning either "textual material represented as images" (OCR susceptible) or handwriting/drawings (not susceptible to OCR, similar to scanning maps or line art).
In assessing solutions I look first at the file formats and associated browser Plug-Ins, then at corresponding software, and lastly for compatible scanners. Most of my information comes from web research on newsgroups, corporate scanning/document management vendors, file format/compression technologies, and digital library projects. I am particularly indebted to experts who have corrected errors and updated my comments.
I've been trying 300 dpi B/W scans on my handwritten pen material. Pencil does not scan as well as ink in b/w; but with grayscale the images are too large. The HP PrecisionScan Pro software does better on pencil if set to maximal sharpening. This gives quite good results with PDF files.
In addition, this is what AT&T recommends for scanning prior to conversion to their DjVu format:
For best results, color images should be scanned at 300 dots per inch at 24 bits per pixels and saved in uncompressed, lossless format such as TIFF, PPM, or BMP (24 bits) ... a typical page at that resolution will occupy 25MB before compression.
... set the gamma of scanned images to 1/2.2 to get acceptable results on all platforms (Adobe PhotoShop on Windows does that by default).
GIF images can be readable when scanned @ 150 dpi
Scanners produce TIFF images, which can bundle multiple scanned pages into a single file. These have some minimal lossless compression, but at text-scanning resolutions they're huge -- typically 20MB for an 8x10" 24bit color 300 dpi page. Displaying such images taxes system RAM, storage sucks up drive space, and sending images strains bandwidth.
Various lossy compression techniques are used to make these images more manageable. The best modern approaches use different techniques for textual material (edges, repeating letters, little color) and for pictorial material (blurry, lots of colors). It turns out that there are a lot of compromises involved in handling textual materials displayed as images.
Wavelet compression is 'hot' for images (like maps), but textual materials displayed as images do better with CCITT-4 or JBIG2/JB2. Much of what I'm interested in scanning is handwritten -- I wonder if the correct image format is what's used for maps (wavelet compression at the Library of Congress.)?
The ideal format does not exist. It would be in the public domain and would be built into browsers. Next best are formats with inexpensive licenses and well done Plug-Ins.
There are many other image formats besides the ones listed here, but I don't want to have to handle each page individually; I want to be able to 'collate' pages. That narrows the field to formats that are optimized for scanning (or faxing -- same difference).
advantages | disadvantages | source | notes | |
---|---|---|---|---|
TIFF | public domain | huge files if 24bit color | Might work with bitonal images at 300 dpi. | |
XIFF | inexpensive to use good compression nice Plug-In reasonably fast compression |
uncertain future only one vendor |
Pagis Pro | XIFF (eXtended TIFF) uses standard compression schemes: CCITT-4 for the linework (text), JPEG for the contone (pictures). Handwriting/line drawings are handled via JPEG. DigiPaper - Cornell/Xerox project, apparently uses XIFF as file format. |
DjVu | best compression best images computationally intensive compression |
uncertain future | AT&T | DjVu uses new compression schemes: JB2 for the linework and another called IW44 (wavelet) for the "contone" (pictures). Serious commercial implementations are very expensive. |
established standard, many persons have Plug-Ins or viewers, cross platform, published specs, excellent OCR integration | JPEG only compression, single channel, text images fuzzy |
Adobe | See Adobe Acrobat 4.0. | |
JBIG2* | emerging standard | not done yet, text oriented, less useful for line art or handwritten? |
Replaces CCITT-4? William Rucklidge: "XIFF includes a bi-level compression format which was part of Xerox's proposal for JBIG2; DjVu includes a bi-level compression format which was part of AT&T's proposal for JBIG2. You can think of the final JBIG2 as the merger of the best ideas in those two proposals, with a number of other components added (for instance, special modes for coding halftones), plus the most popular parts of JBIG1 and G4/MMR." | |
JPEG2000* | emerging standard, specification finalized 12/1999. | See EETimes story. Wavelet compression and support for mixing compression types to eliminate the JPEG fuzzy-text problem. |
* JBIG2 and JPEG2000 are very closely related. I'm still not sure what the difference is. The 'official site' is members only.
Digital Library projects, typically managed by university libraries (U of Michigan, Berkeley) or the Library of Congress are very web focused and have to manage a great deal of material. Their choices are interesting.
Uses PDF (1 page per PDF), or 3 GIF levels (100%, 75%, 50%) or OCR text for display. TIFF for archival storage.
An interesting range of choices, depending on the medium. They've had the common problem with "standard" TIFF headers, archival images are done in TIFF at 300 dpi and 24bit color. Display images are JPEG/GIF for pictures and thumbnails, proprietary MrSID wavelet compression for maps and other dense line drawn images.
Adobe's scanning solutions use the PDF file format. This format has relatively weak compression properties, but it has some overwhelming advantages: full-featured viewers/Plug-Ins very widely distributed on Mac/Windows/Unix, a published file format, and an industry devoted to PDF management. A PDF file will readily handle multiple-page documents; the viewers provide powerful document navigation tools.
Adobe's corporate and scanning focus is on text, not images. Most of their product emphasis is on post-acquisition manipulation of text documents rather than the image acquisition process I'm interested in. This includes a novel OCR approach in which the resulting document can be an integrated blend of recognized text and non-recognized text images! My current interest is primarily in image acquisition.
Adobe has two scanning products, neither of which are adequately described on their abysmal web pages. Adobe Acrobat Capture is their higher-end ($600 limited license) Windows product for acquiring large numbers of documents (as of March 2000 they've just changed the name and doubled the price). It will turn a stream of TIFF files into a stream of PDF files. Although the web site description is almost useless, the 1997 manual is online. (Warning: Adobe uses the word "Capture" in the Acrobat product to refer to the OCR transformation of PDF files; but their Acrobat Capture product is also their volume oriented image acquisition product.)
Adobe Acrobat 3.0 and 4.0 (not to be confused with the free Acrobat Reader software) is a lower-end ($100 to 200 depending on promotions) Mac/Windows product. Acrobat's primary focus is on manipulating and managing PDF files, but it has three acquisition tools. Adobe has deliberately limited Acrobat's acquisition abilities so as not to compete with Acrobat Capture, with some modest changes Acrobat would be a quite powerful product. The available tools include:
Once you have a PDF document, you can optionally use the "Capture" tool within Adobe Acrobat to transform the PDF file into some mixture of text and image (see scanning/capture tip). The text can then be processed by a text indexing engine for powerful searching and retrieval (Adobe Catalog). You can also use the standard Acrobat tools to add web links, digitally sign, add security features, etc. In the words of Ted Aaron Eytan (edited and paraphrased in parts):
this is one of the best OCR apps I've used, better than Omni-page. It's relatively hassle free because it doesn't prompt you for every misrecognition, it simply leaves the graphic in and blends it with text. You can simply copy and paste into a word processor if you need to.
I use Acrobat pro to scan important documents, and I can put related documents together as multi-page PDF files. For example: one file with all relevant MD licensing info including my DEA license, MD license etc, that I just print or fax on demand to whoever needs it. Using Acrobat you can clip out pages and put them into custom PDF files that you then distribute as needed, so they are easy to catalogue.
Acrobat does provide some limited facilities for adding indexing information to pages (author, keywords, title); however their search tool does not use this information! I found their document management tools to be very feeble, I use my own.
A big and somewhat slow application. Pagis Pro is intrusive software, the document browser tries to take over the native Win file browser. I tested by scanning into a folder that's within a web server directory space. When I also tried scanning into the same folder from Acrobat I induced severe instability in Acrobat.
The tools for adding metadata (keywords, etc) are weak. Uses a proprietary XIFF file format which is efficient but not widely known. XIFF allows multiple pages to be bundled, good compression and editor. Free viewers. There's a XIFF plug-in viewer for Netscape/IE that works quite well.
The workflow for managing document collation and assembly is stronger than Acrobat; more comparable to Adobe Capture.
ScanSoft manufactures the PaperPort software as well, but it appears to be focusing its efforts on Pagis Pro. PaperPort has a free viewer but no Plug-In, so I did not consider it.
For document management the key item is probably time from initiating work until a scanned document is available for handling. PPM ratings are an important ingredient.
There are tons of image scanners for under $500, but when you talk about document handling -- a really different application, -- it's a different story. The HP ScanJets are awful. The Fujitsu 15C is the next one I'm evaluation. After that prices start at $2000 and climb very quickly.
UMax scanners are reportably too slow, Canon, Xerox and HP are faster. Some scanners may require DOS based drivers!!
General links and references not included elsewhere.
[1] | |
---|---|
[2] | |
[3] |