Wednesday, March 3, 2010

Make your own electronic book

Ever wanted to make your own electronic book?  No?  Me neither!  But then I found an out-of-print book at the library, which I really wanted a copy of.  Here's how I scanned it and cleaned it up into a nice usable document, using my desktop computer running Ubuntu.

Scanning

I took the book and scanned it with the Sharp MX-M350N copier at work: it's a big copy machine that can also make black-and-white scans, and mail you the images.  With a little bit of tinkering, I got it to scan at 600 dpi, in the correct size for this small paperback book, and to save everything in one big multi-page TIFF file.  Within 20 minutes or so I had scanned all 270 pages, and gotten my hand sore from pressing down on the spine of the book.

The scanner's output ain't pretty:
  • adjacent odd and even pages are combined
  • there are ugly dark creases between pages and around the edges
  • pages are slightly rotated and off-center
  • there are lots speckles of noise/dust throughout

Extracting page images

First thing I did was to extract images of the individual pages from the PDF file.  (If your scanner produces separate image files for each page, then you don't need to do this.)  The fastest way to split the TIFF up under Linux is with the tiffsplit command:
$ tiffsplit scan.tiff page-
This ran amazingly fast, and produced a whole bunch of files, page-aaa.tif through page-afg.tif.

So now I had a whole bunch of TIFF page images.  What next?

ScanTailor is awesome

As I mentioned above, there are lots of ugly features of the scanned book.  It turns out that there's an amazing open-source program to automate all of the tedious cleanup tasks: ScanTailor.  It's only at version 0.9.5 so far, but it's really stable and can do a lot of things:



With a few clicks, ScanTailor will automatically split pairs of pages, automatically deskew page images (rotate them correctly), automatically figure out where the margins are (not perfect, but worked for about 98% of the pages in my book), lay out the extracted images into nice uniform pages with margins, and even remove some of the noisy dust flecks from the images (although this tends to eat punctuation marks sometimes)!

This involves a lot of heavy-duty image processing, so it took a while to run even on a fairly fast desktop computer.  Basically, I set up ScanTailor to clean up the scanned images, and left it alone for half an hour or so...

At the end of it all, ScanTailor produces a whole bunch of new, cleaned-up TIFF images, 0000_page-aaa.tiff through 0274_page-afg.tiff.

Putting everything together

Now we want to put these cleaned-up pages all together into a single document.  I decided to use the DjVu format since it offers really good compression for text, better than PDF.

First I need to convert all these TIFF images to DjVu's format.  I decided to write a little Makefile to do this, in ScanTailor's output directory:
TIFF = $(wildcard *.tiff)
INDIVDJVU = $(patsubst %.tiff,%.djvu,$(TIFF))
OUTPUTDJVU = complete.djvu

all: $(INDIVDJVU) $(OUTPUTDJVU)

$(OUTPUTDJVU): $(INDIVDJVU)
        djvm -c $@ $^

%.djvu: %.tiff
        cjb2 $< $@
This Makefile uses DjVu's cjb2 utility to convert monochrome TIFF images in DjVu's format, and then it uses djvm to combine them all together into one DjVu file.

I ran make all, and after a bit of processing, I had the complete book in the file complete.djvu. It's about 6.5 MB in size... very good for a 270-page book at 600 dpi!

To do...


It'd be nice if I could select and search for text in this book.  Right now it's just a series of images!  The next thing I'll do is try to figure out how to do optical character recognition (OCR) on the images.  There are some interesting projects like Ocropus and Tesseract that are making open-source OCR software.

No comments:

Post a Comment