|
PDF Document Management Summary:
When I upgraded my scanner to one with automatic document feed, I lost the use of my old scanner/OCR software that created searchable PDFs. I would like to start using the ADF scanner for my latest batch of bills and assorted scannable paperwork (now about an inch high) and shred them, but I need to know that at some point in the near future I can split, compress, rejoin, and OCR them into searchable PDF like I had before.
Right now I've got the hunch that I can, as long as I scan everything between 150x150 and 300x300 dpi 16 color grayscale. Anyone who can confirm or deny that a good OCR tool will happily consume these types of documents will win a free beer (Baltimore-area redeemable).
I can invest quite a few hours into research to make up for the time savings of being able to batch scan a huge pile of papers into a single PDF file instead of putting each on the plate and waiting a minute for the next page as I have been doing for the last two years. I could probably recoup 8 hours of time savings in six months if I figure this out right.
So here's what I got from everybody.
On scanning to a searchable PDF:
If I didn't already have the functionality, the responses I got would have convinced me that it was a pipe dream that no software could possibly provide. In reality, I can make a searchable PDF out of any document in my house right now with my old scanner and don't want to give it up.
Simon in particular gave me a detailed description of why it impossible to do with PDFs, which was quite an entertaining read. :-)
Recommendations included:
Abbyy Fine Reader (http://www.abbyy.com/products/)
- Might be AppleScriptable or Automatorable (is that a word?)
IrisLink/ReadIrisPro
(http://www.irislink.com/c2-73/Readiris-Pro-11-for-Mac-OCR-software.aspx)
- Good cost: $130-152.
- A version of IRIS comes embedded in the HP toolkit, but it refuses to create searchable PDFs. It will create only TXT/Word docs.
- Only server version does compression, only Windows server version.
- Check software websites and MacWorld for deals.
OmniPage/OmniScan (http://www.versiontracker.com/dyn/moreinfo/macosx/13115)
- Mentioned warily but not endorsed by anyone.
- Someone even mentioned some hatred.
- Best-known, but poorly rated on VersionTracker.
Acrobat Pro (http://www.adobe.com/products/acrobatpro/)
- Mentioned once, in passing, probably because it didn't meet my cheap-ass restriction. I might have to investigate it anyway if nothing else does searchable PDFs.
Tesseract
(http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html)
- On August 30, 2006 Google released this to Open Source.
- Does not include a page layout analysis module
(yet)
- Not so great on grayscale documents.
- Something to keep an eye on.
On tools for compressing my PDFs down to a reasonable size and merging pages (everyone addressed these in the same paragraphs, strangely enough):
Mac OS X 10.4/Preview/Automator
- Automator can build and save a workflow for reuse. The actions you need are "Combine PDF Pages," "Compress Images in PDF Document," and "Apply Quartz Filter to PDF Document (filter: reduce file size)."
- Compression seems to work better than ImageMagick.
- If I do get searchable PDFs, I don't want the compression to remove the 'searchable' part.
- There are downloadable Automator workflows available (http://www.apple.com/downloads/macosx/automator/)
ImageMagick
- Can combine and adjust JPG compression level of PDFs.
- May sacrifice quality that Preview and Acrobat Pro do not.
PDFLab (http://www.iconus.ch/fabien/pdflab/)
- Capable of simple merges.
pdftk (http://www.accesspdf.com/pdftk/)
- CLI tools. This is a pro and a con.
- Works with shell scripts or Automator.
Combine PDFs 2.1
(http://www.monkeybreadsoftware.de/Freeware/CombinePDFs.shtml)
- From their page: This application uses the PDF libraries of the Mac OS X version you have. Depending on which version you have some bugs may show. PDF files may miss parts of the content or may not be searchable.
- Works with 10.2 or newer.
PDF Shrink
(http://www.apagoinc.com/prodhome.php?prodid=30)
- Will take multiple input PDF files and combine those into one larger PDF document.
PdfCompress
(http://www.versiontracker.com/dyn/moreinfo/macosx/15836)
- Not much known about it, need to do more research.
PDF Merge
(http://www.apagoinc.com/prodhome.php?prodid=17)
- More advanced tools for merging, need to do more research.
On tools for managing documents:
yep
(http://www.yepthat.com/)
- Tool for managing paper documents in OS X
- Has tags (but then what app doesn't these days?)
VooDooPad
(http://www.flyingmeat.com/voodoopad/)
- Desktop wiki.
- Can have PDF files embedded in it.
On unique solutions:
I got a suggestion to hook the HP scanner up to the PowerBook andusing the Canon software and TWAIN. I'll be looking into this.
On the HP OfficeJet series:
I would describe the HP OfficeJet 6310 as an ADF-enabled color scanner that works well on OS X.
- Annoying as scanner/fax when far away
- Not annoying when nearby
- Not annoying as printer no matter where it is
- Sucks down ink like water
Thanks to Neil Strand, Nadine, Sweth Chandramouli, Simon Slavin, Jennifer Mullen, Jon Lasser, and Brad Knowles.
Background question:
For the last few years I've been using a Canon CanoScan LiDE 50 to manage all my documents. Using the CanoScan Toolbox X
on my PowerBook G4, I can scan a full sheet on the plate in about 60 seconds that gets output at 150x150 resolution to a
grayscale high-compression single-page PDF file. Each page is about 100-150k and the clear text can be searched with
grep. Not bad for a scanner than only cost $79 new.
Although I can't link the pages together in a single document without buying an expensive copy of Adobe Acrobat, I can just
stick them in a directory together and have a small electronic form of my documents.
So in short, I'm really happy with the document management I get from the CanoScan software, but not so happy with thefact that I have to feed each sheet individually onto the scanner.
Fast forward to today, where Suzy and I picked up a HP OfficeJet 6310 network-attached all-in-one
printer/fax/copier/scanner to replace her old USB Lexmark printer that has seen better days. It's a sweet unit with adocument feeder (instead of one sheet at at time), and can scan to my laptop and print from anywhere in the house. Unfortunately the feeder scan output is a uncompressed, unsearchable grayscale PDF that averages about 700k per page.
I know that's not a big deal these days, but it means I can fit 4 or 5 times less documents on the same media. Also, not being able to search my documents is a problem because I tend to scan in batches over the course of a month before givingthe files better names, and need to access some specific document in the meantime. I was only able to finish my taxes
this year thanks to grep, and I don't want to give that up. My question is, is there a good piece of PDF management software for Mac or Linux that will allow me to take the PDF output from the HP and:
- Piece together multiple pages into a single document
- Pull apart multiple page scans into seperate documents without losing resolution
- OCR to make that resulting document searchable
- Compression to bring the pages to around 200k max each
- Cost less than the printer/fax/scanner cost in the first place ($250)
If anyone has any suggestions, please let me know or tag it on your del.icio.us account with some combination of ocr and
pdf.
No Comments | #6349
Unless noted, all content on epistolary.org is © Copyright 1999-2008 to Rob Carlson with all rights reserved. All information is verified when possible, cited as appropriate and applied in the real world at your own risk.
Send all feedback to rob@vees.net.
|
Leave a Reply
Please let me know how you got here, if this page was useful to you, and your opinions.