Scanning
I took the book and scanned it with the Sharp MX-M350N copier at work: it's a big copy machine that can also make black-and-white scans, and mail you the images. With a little bit of tinkering, I got it to scan at 600 dpi, in the correct size for this small paperback book, and to save everything in one big multi-page TIFF file. Within 20 minutes or so I had scanned all 270 pages, and gotten my hand sore from pressing down on the spine of the book.The scanner's output ain't pretty:
- adjacent odd and even pages are combined
- there are ugly dark creases between pages and around the edges
- pages are slightly rotated and off-center
- there are lots speckles of noise/dust throughout
Extracting page images
First thing I did was to extract images of the individual pages from the PDF file. (If your scanner produces separate image files for each page, then you don't need to do this.) The fastest way to split the TIFF up under Linux is with the tiffsplit command:$ tiffsplit scan.tiff page-This ran amazingly fast, and produced a whole bunch of files, page-aaa.tif through page-afg.tif.
So now I had a whole bunch of TIFF page images. What next?
ScanTailor is awesome
As I mentioned above, there are lots of ugly features of the scanned book. It turns out that there's an amazing open-source program to automate all of the tedious cleanup tasks: ScanTailor. It's only at version 0.9.5 so far, but it's really stable and can do a lot of things:With a few clicks, ScanTailor will automatically split pairs of pages, automatically deskew page images (rotate them correctly), automatically figure out where the margins are (not perfect, but worked for about 98% of the pages in my book), lay out the extracted images into nice uniform pages with margins, and even remove some of the noisy dust flecks from the images (although this tends to eat punctuation marks sometimes)!
This involves a lot of heavy-duty image processing, so it took a while to run even on a fairly fast desktop computer. Basically, I set up ScanTailor to clean up the scanned images, and left it alone for half an hour or so...
At the end of it all, ScanTailor produces a whole bunch of new, cleaned-up TIFF images, 0000_page-aaa.tiff through 0274_page-afg.tiff.
Putting everything together
Now we want to put these cleaned-up pages all together into a single document. I decided to use the DjVu format since it offers really good compression for text, better than PDF.First I need to convert all these TIFF images to DjVu's format. I decided to write a little Makefile to do this, in ScanTailor's output directory:
This Makefile uses DjVu's cjb2 utility to convert monochrome TIFF images in DjVu's format, and then it uses djvm to combine them all together into one DjVu file.TIFF = $(wildcard *.tiff) INDIVDJVU = $(patsubst %.tiff,%.djvu,$(TIFF)) OUTPUTDJVU = complete.djvu all: $(INDIVDJVU) $(OUTPUTDJVU) $(OUTPUTDJVU): $(INDIVDJVU) djvm -c $@ $^ %.djvu: %.tiff cjb2 $< $@
I ran make all, and after a bit of processing, I had the complete book in the file complete.djvu. It's about 6.5 MB in size... very good for a 270-page book at 600 dpi!
To do...
It'd be nice if I could select and search for text in this book. Right now it's just a series of images! The next thing I'll do is try to figure out how to do optical character recognition (OCR) on the images. There are some interesting projects like Ocropus and Tesseract that are making open-source OCR software.
No comments:
Post a Comment