Ana Mardoll's Ramblings: eReader: pBook to eBook Conversion

So you have a paper book (pBook) that you want in eBook form and you've been poking the publisher in vain for days, weeks, months, or even years and nothing has come of it. And you're looking at that pBook thinking, you know, I could maybe turn that into an eBook myself, but you're not sure how to go about it cheaply. Well, I'm going to walk you through the cheapest, crappiest possible way to turn a pBook into an eBook shy of typing the darn thing in yourself.

WARNINGS!

You are going to ruin your paper book copy in the process of this method.
You are not going to get a high quality ePub format out of this method; you're going to get a series of scanned images that will be lumped into an image-based PDF. That means that the final book will have no text reflow and will only be readable on a device that supports image-based PDFs. If you're okay with that, continue on; if you're not okay with it, or you don't know what that means, STOP. The reason you are not going to get a high quality ePub format out of this method and are instead getting an image-based PDF is because the scanners and software that can pull text out of an image and perform OCR (optical character recognition) to render the image-words into text-words are expensive. Most people don't have the cash to invest in really good hardware and software to properly convert paper books, nor to pay for a conversion company to do it. So this method is really for picture books only unless you're prepared to put in the money and muscle to go all the way. (See my note at the end.)
This method takes a long time and is really only for the obsessive-compulsive eReader.

Okay. Let's check our equipment list. I use:

Canon PIXMA MX860 Wireless All-In-One Printer. Review here. I use this because it was simply what I had on hand at the time; the most important feature is the stack-feed scanning capabilities.
X-Acto Knife. This will be useful when cutting the strings for sewn books.
Bulk rename utility. This is freeware, and will be used to rename your scans.
Irfanview image editing software. This is freeware, and will be used to batch trim the scans.
Quick PDF tools. This is freeware, and will be used to compile the final book.
A book. For this example, we will use James Lilek's delightful Gallery of Regrettable Foods.

NOTE: Click any of the pictures in this tutorial to enlarge them for clarity.

Step 1: Destruction

First we're going to need to separate the book into individual pages. This is a long process that requires patience.

If the book is sewn, flip gently through looking for the straps that occur in the spine every few pages; lay the book flat and slice the straps with the X-Acto knife. This will liberate a few pages; gently disengage them from the spine, but leave them in position as you work through the book. Once you've cut all the straps, flip back to the beginning of the book and gently begin to remove the still-attached pages. The easiest way to do this is hold a hand flat on the rest of the pages while you gently tear or cut with the X-Acto knife the attached page. Pages that came from from the sewn binding are usually TWO pages with the sewing down the center and will have to be torn from their other half -- the easiest way to facilitate a clean tear is to bend the pages along their middle crease. There are YouTube videos for all this, but as long as you are slow and careful, you should be fine.

If the book is glued in place, you will have to get a hairdryer on the lowest setting and gently blow on the spine to loosen the book pages. As the glue softens, pull the pages out one by one. Be careful not to smear the glue on yourself or the other pages; lay the pages aside so that the glue can dry. Once again, there are YouTube videos for this; for some books, it may be easiest to soften the glue to remove the spine and then just cut the edge with all the glue on it with a heavy duty paper cutter. You can find these for use at some libraries and some places that offer lamination services, like Mardel or Hobby Lobby. Kinkos probably has them, too.

Step 2: Scanning

Unless you have a duplex scanner, you'll need to scan one side of the book and then the other side. (I.e., odd numbered pages and then even numbered ones.) The Canon scanner I use does have duplex scanning, but only for 8x11 inch sheets, which most books aren't. Slap a pile of paper on the scanner feed (usually you won't be able to do the whole book in one go -- for this example, I divided the pile into four parts), and let the scanner do its work.

NOTE: Don't leave the room while this is happening. You really want to be able to catch feed jams when they happen, not three hours later when the paper has been permanently creased.

The scanner will have scanned all your odd-numbered sheets, and likely will have named them something like "IMG_01, IMG_02, IMG_03...". This is wrong, and we need to rename them now so that we can add the even-numbered sheets in later.

Create a new folder. Call it "Odd Numbers" or something similar. Put all your scans so far in that folder. Now right click on the folder and choose "Bulk Rename Here". (If you don't see that option, reinstall the Bulk Rename Utility and try again.)

In order to sort everything properly later, we need to give everything the same name plus an appended page number. The Bulk Rename Utility will let us do this easily. You need to set the following fields:

File (2): A fixed name will go here. I used "Lilek - "
Numbering (10): The important things here are the Pad (select 3, for an XXX format) and the Increment (select 2, since we've skipped a page per scan).

Set the Start number to 1, and locate the book's "Page 1" via the page numbers on each sheet. Since book numbers frequently don't show until several pages in, you might (a) have to count backwards (in this case, the first numbered page was Page 9, so I had to count back from there) and (b) you may have some "leftover" front pages that shade back into the negative numbers. You can deal with those later -- the important thing now is to make sure that your file naming matches the book numbers.

Select all the files in the Bulk Rename gui and see their projected name in green on the right. Once everything lines up properly, hit "Rename" and poof.

Now is a great time for you to actually look through your scans and triple check that the page numbers and file names match. It's entirely possible that your scanner missed a page and it's a lot easier to find it now than later.

Once you're done with the odd-numbered pages, take your stack of pages, turn them over, and slap them back into the scanner to get the even-numbered pages. Once you have all the scans done, create a second folder, name it "Even Numbers", put all the scans in there, and use the Bulk Rename Utility to re-number them properly. Once again, double check that everything is in there and that the scanner didn't miss a page -- mine missed 2 pages of ~200 in this example.

NOTE: Do not combine the two folders yet -- keep the even and odd numbered scans separate.

Step 3: Trimming

So now you've got your scans, but since your printer is cheap, you've got all this white space around the pictures that you don't want or need. You're going to want to trim that space and you're not going to want to do it 200 separate times.

So. First things first. Go through your pictures with Windows Photo Viewer or whatever you have standard on your machine and make sure all the pages are oriented right side up. Sometimes your scanner flips stuff upside down just to make your life interesting.

Now. Look through your even and odd numbered folders for white space patterns. Some of your pictures will have white space on the top of the picture, some will have it on the bottom. In general, this divide will occur between the even numbered scans and the odd ones, but there will probably be a few oddballs that need moving from one side or the other. Isolate all your "white space on top" files in one folder and all your "white space on bottom" files in the other.

Right click one of the images to "Open With" Irfanview.
Select File --> Batch Conversion:

This dialog will open:

Click the "Advanced" tab:

What we're interested in is the CROP option up in the left hand corner. Let's talk about those values. The X-Position/Y-Position simply tells the program where in the OLD image to start the NEW image. The Width/Height determines the width/height TOTAL of the new image.

Width-wise, I've told the program to take the old images and crop off everything to the left of 180 pixels and crop off everything to the right of the old 2410 pixel mark (2410 - 180 = 2230). Height-wise, I've told the program to start at the top of the image and gobble up everything 2460 pixels down, but then to crop everything below that. This is for all my "white space at the bottom and on the sides" images.

So how did I get the pixel widths? Just open the image in MS Paint or something similar, hover your cursor around the sides, and watch the status bar at the bottom:

If you enlarge the image above, you'll see two sets of numbers at the bottom. The 2550x3300 is the picture's total width and height; the 180x2237 is where my cursor was hovering at time of taking the screenshot. That 180 is the same 180 I used for my X-Pos above.

Once you've set the Advanced options, click OK. Select all the images in the Batch Conversion files at the top and hit Add. The Input files dialogue at the bottom should populate. Set the output directory you want (I recommend a third folder called "LilekBatch") and then select "Start Batch". A dialogue will pop up telling you that the files are converting.

Once that's done, we need to do the same thing with the "white space on the bottom and sides" files. It's the same exact process, but with different Y-Pos values. (Your X-Pos values will probably be the same if you used the scanner's physical guide-rails to keep everything centered when you were scanning.)

Note that the only difference here is that the Y-Pos starting position has moved. The total width and height should remain the same. Go ahead and let the batch converter dump the new files in with the last batch conversion ("LilekBatch").

Step 4: Merging

Now you should have a big folder of images that have been scanned in, named according to their page number in the original book, and carefully trimmed of white space. Open the first image in Windows Photo Viewer and start flipping through the images to make sure everything is as it should be. Check this carefully -- an ounce of prevention is worth a pound of cure in this case.

Once you're satisfied that everything is in exactly the correct order and scanned as perfectly as you want them, select every image in the file and right click on the first one in the series. Select Quick PDF Tools --> Convert --> Image to PDF, and select that you want a single PDF file. Give the file a name and presto.

Now you have an image-based PDF format eBook out of your sadly destroyed pBook. Slap that baby into Calibre and back it up to a cloud storage area or something, because you don't want to lose it after all that work!

Cover in Adobe Reader

Page 9 in Adobe Reader

Cover on Sony PRS 950 (Portrait View)

Page 9 on Sony PRS 950 (Landscape View)

There it is in all of its "Meh" glory on my Sony reader. (I warned you that this was a lot of work just to satisfy an obsessive-compulsive need. No sense in complaining now.) Enjoy.

Step 5: Recycling

What do you do with your ruined pBook? Well, you can try to bind it back together if you really want, but in most cases this will be utterly impossible. You can re-purpose the pages into delightful paper crafts. Or you can recycle the pages in the recycle bin and let the city take them to be turned into recycled paper. You can shred them and use them for kitty litter, although now that I say that, it occurs to me that you probably shouldn't because god knows what the dye in the book will do to their little paws. Ditto for using the paper as a firestarter in the fireplace. Just please don't throw the book in the garbage.

NOTE: Going a Step Further with OCR Software

If you are seriously interested in going to the next step in scanning and springing for an OCR (optical character recognition) program to turn your PDF images into Word text documents that can be converted into ePub, the ABBYY software FineReader10 is currently on sale for $170 rather than their usual $400.

I bought this software on the recommendation of a cut-and-scan enthusiast friend, and it does do a surprisingly good job with a really high level of accuracy, but there are still going to be a lot of "hand work" and manual corrections needed before the book is 100% perfect. And an awful lot of the outcome is going to depend on the quality of your scanner -- with my scanner, I have to go back and re-scan 5% of a book to get the best, cleanest OCR results.

I don't really recommend investing in this software unless you're just absolutely fanatical about conversion like I am, but if you *are* like me, the sale is quite a value at the moment and I felt I should mention it.

12 comments:

Redwood Rhiadra said...: You can actually find out the pixel values within IrfanView itself. Just click any point on the image, and look at the title bar - it will show the XY coordinates (and also the color).; 7/07/2011 1:40 PM
Ana Mardoll said...: Redwood Rhiadra Ah, thank you! I don't use Irfanview much (well, I didn't before I found out about their AWESOME BATCH FUNCTION which will guarantee that I use it lots more in the future), so I didn't realize that. Thanks!; 7/07/2011 1:43 PM
Cupcakedoll said...: Or for text-only novels, the labor-intensive method:
1) Scan each page on a scannerbed, using a scanner/reader program that creates a .doc or similar of the results.

2) Set the book in some kind of bookholder above your keyboard. Get your scanned version in a .doc on the screen. Read through it, correcting the mistakes by hand. Then re-re-read for typos.

3) transfer to pdf and there you go.

This takes a good few work-hours, but--depending oin the length of the book-- should be possible within the time of an interlibrary loan checkout. If, for some reason, you really want a copy of something that costs a few hundred dollars and is so far out of print it can see the curvature of the Earth. =); 7/08/2011 12:02 AM
peterpatrickgo said...: That's so brillant..; 7/08/2011 2:34 AM
Pthalo said...: That's really interesting. I've scanned (short) things in before, but not converted them to ebook, and I never took out the pages, I just scanned them and put up with a little hard to read text here and there.

I've heard that tesseract is a good, free ocr program, but I don't know how good it is. http://code.google.com/p/tesseract-ocr/ It runs on windows and linux, and probably on mac. It's in the ubuntu repositories, so easy to install on that platform at least. I installed it, but I don't have anything scanned on my computer right now in black and white .tif format.; 7/08/2011 3:42 PM
Elfwreck said...: Thank you for the reminder!; 7/08/2011 5:32 PM
Kadia said...: Wow, that's pretty neat. Wouldn't it be faster -- in the long-run -- to remove your own eyeballs and replace them with LCD display spheres?; 7/08/2011 7:58 PM
Inquisitive Raven said...: Hmm....I already have Omnipage for OCR software and it can convert images straight from the scanner, thereby skipping a step. So where do you get a scanner with an automated sheet feeder for input? I've got a multifunction scanner/copier printer, but I have to put the pages in one at a time, two at time for smaller page sizes, e.g. mass paperbacks. OTOH, if I'm putting the pages in one at at time, I can get away with not cutting apart smaller books or books with lie flat bindings.; 7/10/2011 12:57 AM
Bob said...: Rather than using a scanner, try a digital camera. Lots of stuff on the web on how to do this. If it's good enough for google :) Besides, it's probably quicker and the book can stay in a pristine state :); 7/13/2011 4:53 PM
Anonymous said...: I do truѕt аll of the ideas you've introduced for your post. They are very convincing and can certainly work. Still, the posts are very quick for starters. May just you please prolong them a little from subsequent time? Thanks for the post.
my page > loans for bad credit; 1/28/2013 11:10 PM
Anonymous said...: Please let me κnow if yοu're looking for a article writer for your blog. You have some really good posts and I feel I would be a good asset. If you ever want to take some of the load off, I'd absοlutеly love to write ѕomе mаteгіal for your blοg
in еxchange foг a linκ back to mine.
Pleаse ѕend mе an emаіl if
іnterested. Cheeгѕ!
Feel free to visit my homepage :: loans for bad credit; 2/08/2013 1:16 AM
Anonymous said...: Woаh! Ι'm really enjoying the template/theme of this website. It's simplе, yet effeсtіve.
A lot οf tіmes іt's hard to get that "perfect balance" between user friendliness and visual appearance. I must say you'vе done a great јob with thіѕ.

Aԁditiοnally, the blοg loads very quіck for me on Fiгefoх.
Εxceρtional Blog!

Feel fгee to visіt my ωeb pagе - bad credit payday loans; 2/26/2013 2:18 AM

index

eReader: pBook to eBook Conversion

12 comments:

Post a Comment