Monday, July 11, 2011

Tech Note: Complete Books

[post 163]

I'm about to overwhelm you with a slew of free, complete books related to the history of clowns in the theatre — and, yes, I expect you to read every single word of every single book —but for now I'll be happy if you just read the next paragraph:

They are not the first to do so, but Google is scanning tons of books that are in the public domain and putting them on the web for free download. A scan is simply a digital image of each successive page of the book. Think of it as a snapshot you might take with your camera. The computer does not initially recognize the text as characters and words, only as random pixels on the page. This may be fine, you can still read the book, but what you can't do is search for specific names or copy text from the book into a separate document. Which kind of book you're getting is not necessarily advertised, so you may need to use the word search feature in programs such as Adobe Acrobat to see if it can find a word on the first page; if not, all you have is an image of the book.  They have to use OCR (optical character recognition) software to turn that image into a searchable text document but, depending on the quality of the source material, the accuracy level will vary. If this paragraph were 99% accurate, it would still have 11 errors in it! So in reality, once these scans are converted to text, the results need to be carefully proofread by human eyes — knot just run threw spelt Czech — and you can bet Google ain't doing that. When I post a Google book, I do run it through OCR for you if it needs it, so it is searchable. I do not, however, have the time to spell check and proofread the book for you. Sorry...

Okay, you can stop reading now.  If, however, this kind of tech stuff interests you, perhaps you're planning to convert some books or magazines of your own, then read on... A few more things to know:

Public Domain
Copyright laws vary from country to country and are of course subject to rulings by individual judges as to what constitutes "fair use." Basically material before 1923 is pre-copyright law and therefore in the public domain, but more recent material may be public as well because the copyright holders failed to renew their copyright. It is not necessarily easy to find out a book's copyright status, as the Library of Congress charges an arm and a leg, though I did hear of a new web site, currently in beta, designed to help in the process:

Google is also putting large portions of copyrighted texts online as what they call "previews," claiming this is fair use because they are only excerpting it. This is frankly a dubious argument, given the size of their "excerpts," but they are a powerful organization and have managed to strong arm many publishers into going along. I have mixed feelings about this. As a reader / researcher, I love the access. As a writer /creator, I want to be paid for my work.

Scanning a Book
Old books are old. The pages yellow, type flecks off, the paper disintegrates. People write in them, adding notes or underlining key phrases. Stick the book on a scanner and it won't lay flat, giving you scanned text that is seriously curved. Scanning is done by humans, and some Google Books are comically off-kilter. In other words, the poor image quality of many scans makes accurate OCR problematic, to put it mildly. Yes, you can take each page into Photoshop and make adjustments to alignment, color and contrast, even erase what doesn't belong (see below), but like I said, you can bet Google ain't doing that.

Optical character recognition software used to cost $800 (OmniPage), but now it's built into Adobe Acrobat Pro. The results usually look good at first, and as a bonus it does "deskew" the pages for you, but if you export the .pdf file as a text document and run it through spell check, you may be in for a rude surprise. Yes, tons of errors — and those are just the ones spell check catches. Thus the need for a human proofreader.

Copying and Pasting Text
It is not necessarily easy to copy and paste large sections of text from a pdf into a text document. The workaround is to use Adobe Acrobat or another program that can export the entire document to plain text, html, rtf, and/or a Word doc file.

How to Do It
(Yep, this has zero to do with physical comedy!)
The vast majority of complete books on this blog were not converted by me, but when I do digitize books, this is how I do it; other tips more than welcome!
• I usually scan at 600 dpi for greater accuracy, though this also picks up more schmutz, so I'm not sure it's necessarily better than 300 dpi.
• Unless you actually rip the pages out of the original book (some libraries frown on this), your pages will probably not come out properly aligned. In Photoshop, I use the ruler tool to trace a line I know should be horizontal of vertical (e.g., the baseline of a line of text), and then realign the page by clicking on the IMAGE pull-down menu, then IMAGE ROTATION, then ARBITRARY, then OK.
• I convert to black & white (IMAGE / MODE / GRAYSCALE) and then improve the contrast with first an auto levels adjustment ((IMAGE / ADJUSTMENTS / LEVELS /AUTO), followed by a custom curves adjustment (IMAGE / ADJUSTMENTS / CURVES) that usually looks something like the graph you see to the right (depending on the scan).
• I crop each page to the same size: marquee select at fixed dimensions, then IMAGE / CROP.
• I save the file as a flattened jpeg at maximum resolution.
• In Adobe Acrobat Pro, from the FILE pull-down menu I choose CREATE PDF / MERGE FILES INTO A SINGLE PDF. You drag your jpegs into a window, put them into the exact order you want, and once you click on COMBINE FILES it creates the pdf document for you.
• Finally, to convert it into a searchable text document, click on DOCUMENT / OCR TEXT RECOGNITION / RECOGNIZE TEXT USING OCR
• You then might want to export it as an .rtf (rich text) document and run it through a spell check program.

1 comment:

Blogger said...

DreamHost is definitely the best hosting provider with plans for all of your hosting requirments.