Showing posts with label Tech Notes. Show all posts
Showing posts with label Tech Notes. Show all posts

Monday, July 11, 2011

Tech Note: Complete Books

[post 163]

I'm about to overwhelm you with a slew of free, complete books related to the history of clowns in the theatre — and, yes, I expect you to read every single word of every single book —but for now I'll be happy if you just read the next paragraph:

They are not the first to do so, but Google is scanning tons of books that are in the public domain and putting them on the web for free download. A scan is simply a digital image of each successive page of the book. Think of it as a snapshot you might take with your camera. The computer does not initially recognize the text as characters and words, only as random pixels on the page. This may be fine, you can still read the book, but what you can't do is search for specific names or copy text from the book into a separate document. Which kind of book you're getting is not necessarily advertised, so you may need to use the word search feature in programs such as Adobe Acrobat to see if it can find a word on the first page; if not, all you have is an image of the book.  They have to use OCR (optical character recognition) software to turn that image into a searchable text document but, depending on the quality of the source material, the accuracy level will vary. If this paragraph were 99% accurate, it would still have 11 errors in it! So in reality, once these scans are converted to text, the results need to be carefully proofread by human eyes — knot just run threw spelt Czech — and you can bet Google ain't doing that. When I post a Google book, I do run it through OCR for you if it needs it, so it is searchable. I do not, however, have the time to spell check and proofread the book for you. Sorry...

Okay, you can stop reading now.  If, however, this kind of tech stuff interests you, perhaps you're planning to convert some books or magazines of your own, then read on... A few more things to know:

Public Domain
Copyright laws vary from country to country and are of course subject to rulings by individual judges as to what constitutes "fair use." Basically material before 1923 is pre-copyright law and therefore in the public domain, but more recent material may be public as well because the copyright holders failed to renew their copyright. It is not necessarily easy to find out a book's copyright status, as the Library of Congress charges an arm and a leg, though I did hear of a new web site, currently in beta, designed to help in the process: www.durationator.com.

Google is also putting large portions of copyrighted texts online as what they call "previews," claiming this is fair use because they are only excerpting it. This is frankly a dubious argument, given the size of their "excerpts," but they are a powerful organization and have managed to strong arm many publishers into going along. I have mixed feelings about this. As a reader / researcher, I love the access. As a writer /creator, I want to be paid for my work.



Scanning a Book
Old books are old. The pages yellow, type flecks off, the paper disintegrates. People write in them, adding notes or underlining key phrases. Stick the book on a scanner and it won't lay flat, giving you scanned text that is seriously curved. Scanning is done by humans, and some Google Books are comically off-kilter. In other words, the poor image quality of many scans makes accurate OCR problematic, to put it mildly. Yes, you can take each page into Photoshop and make adjustments to alignment, color and contrast, even erase what doesn't belong (see below), but like I said, you can bet Google ain't doing that.

OCR
Optical character recognition software used to cost $800 (OmniPage), but now it's built into Adobe Acrobat Pro. The results usually look good at first, and as a bonus it does "deskew" the pages for you, but if you export the .pdf file as a text document and run it through spell check, you may be in for a rude surprise. Yes, tons of errors — and those are just the ones spell check catches. Thus the need for a human proofreader.

Copying and Pasting Text
It is not necessarily easy to copy and paste large sections of text from a pdf into a text document. The workaround is to use Adobe Acrobat or another program that can export the entire document to plain text, html, rtf, and/or a Word doc file.

How to Do It
(Yep, this has zero to do with physical comedy!)
The vast majority of complete books on this blog were not converted by me, but when I do digitize books, this is how I do it; other tips more than welcome!
• I usually scan at 600 dpi for greater accuracy, though this also picks up more schmutz, so I'm not sure it's necessarily better than 300 dpi.
• Unless you actually rip the pages out of the original book (some libraries frown on this), your pages will probably not come out properly aligned. In Photoshop, I use the ruler tool to trace a line I know should be horizontal of vertical (e.g., the baseline of a line of text), and then realign the page by clicking on the IMAGE pull-down menu, then IMAGE ROTATION, then ARBITRARY, then OK.
• I convert to black & white (IMAGE / MODE / GRAYSCALE) and then improve the contrast with first an auto levels adjustment ((IMAGE / ADJUSTMENTS / LEVELS /AUTO), followed by a custom curves adjustment (IMAGE / ADJUSTMENTS / CURVES) that usually looks something like the graph you see to the right (depending on the scan).
• I crop each page to the same size: marquee select at fixed dimensions, then IMAGE / CROP.
• I save the file as a flattened jpeg at maximum resolution.
• In Adobe Acrobat Pro, from the FILE pull-down menu I choose CREATE PDF / MERGE FILES INTO A SINGLE PDF. You drag your jpegs into a window, put them into the exact order you want, and once you click on COMBINE FILES it creates the pdf document for you.
• Finally, to convert it into a searchable text document, click on DOCUMENT / OCR TEXT RECOGNITION / RECOGNIZE TEXT USING OCR
• You then might want to export it as an .rtf (rich text) document and run it through a spell check program.

Saturday, January 23, 2010

Complete Book: Memoirs of Joseph Grimaldi, edited by Charles Dickens

[post 059]

Today I introduce yet another new feature to this blog, a complete book in the form of a pdf file suitable for reading online, downloading, or printing. Because of legal issues, most if not all books presented here will be from the pre-copyright era, roughly a century or more ago, and therefore of a historical nature.

We start off with a classic, the Memoirs of Joseph Grimaldi, edited by none other than Charles Dickens (pseudonym Boz). Grimaldi (1779–1837) was perhaps the most celebrated clown who ever lived, the clown credited with elevating the craft to an art form, the man from whom latter-day clowns derived the nickname "joey." If you want a quick introduction to Grimaldi, go to post 002 on this blog and take a look at chapter five (pp.8–14) from my book Clowns.

How these memoirs apparently came about is its own story, here summed up by our good friend Dr. Wikipedia:

The book's accuracy is not entirely clear, since it went through a number of revisions, not all with Grimaldi's input. Grimaldi's original manuscript, which he mostly dictated, was about 400 pages; he completed it in December 1836. The original "excessively voluminous" version was apparently not good enough for publication, and in early 1837 he signed a contract with a collaborator, the obscure Grub Street writer Thomas Egerton Wilks, to "rewrite, revise, and correct" the manuscript. However, two months after signing the contract, Grimaldi died, and Wilks finished the job on his own, not only cutting and condensing the original but introducing extra material based on his conversations with Grimaldi. Wilks made no indication which parts of his production were actually written by Grimaldi and which parts were original to Wilks. He also chose to change Grimaldi's first-person narration to the third person.

In September 1837, Wilks offered the Memoirs to Richard Bentley, publisher of the magazine Bentley's Miscellany. Bentley bought it, after securing the copyright from Grimaldi's estate, but he thought it was still too long and also badly edited, so he asked one of his favorite young writers, the novelist Charles Dickens, then twenty-five years old, to re-edit and re-write it. At first Dickens was not inclined to take the job, and he wrote to Bentley in October 1837:

"I have thought the matter over, and looked it over, too. It is very badly done, and is so redolent of twaddle that I fear I cannot take it up on any conditions to which you would be disposed to accede. I should require to be assured three hundred pounds in the first instance without any reference to the sale -- and as I should be bound to stipulate in addition that the book should not be published in numbers I think it would scarcely serve your purpose."

However, Bentley agreed to Dickens' terms (a guarantee of three hundred pounds and an agreement to publish the book all at once, and not in monthly numbers.) Dickens signed a contract in November 1837, and completed the job in January 1838, mostly by dictation. Dickens seems never to have seen Grimaldi's original manuscript (which remained in the hands of the executor), but only worked from Wilks' version, which he heavily edited and re-wrote. Bentley published it in two octavo volumes in February 1838.

How faithful this twice-edited, twice-rewritten version is to the original cannot now be determined, since the original manuscript was sold at an estate sale in 1874 and has never been seen since.

Tech Note: The scan of this book is by Google, which you may have heard is ruffling a lot of feathers by trying to digitize every book they can get their hands on, copyright be damned. As far as I can tell, what they do is scan the book as an image, that's all, nothing but a bunch of dumb pixels that don't even know they're banding together to form language. Google makes no attempt to perform OCR (optical character recognition), which would translate the image of text into individual letters and words a computer can recognize separate from one another, thus allowing for searching topics, copying & pasting, editing, etc. The reason they don't do this is that OCR software is not 100% accurate, especially when applied to old books, so for it to come out right someone would have to spend hours.... and hours... and hours of proofreading the entire book. Unfortunately, an old scanned book is harder on the eyes than one converted to crisp, clear text but — you know what they say — you get what you pay for.


GrimaldiMemoirs


Monday, July 6, 2009

(Forced) Blog Vacation

[post 019]

Talk about technical difficulties!

Traveling across eight countries in a matter of a month or so allowed me to see a whole lot of bloggable performance, but left me very little time to edit video or to write. The result was only one post, though a substantial one: my recent report on the Antibes Street Theatre festival. I thought I'd catch up now that I'm in one place here in Dikili, Turkey, where all I'm doing is saving you, your grandchildren, and planet Earth from extinction and all that other messy stuff. In other words, teaching a lot and learning even more at the Climate Advocacy Institute, where I'm heading a Bloomfield College Creative Arts & Technology team working in collaboration with OSI (the Open Society Institute), Tactical Tech Collective, 350.org, and IndyAct training activists to prepare for COP 15, the United Nations Climate Change Conference this December in Copenhagen.


All fine and good, but disaster has struck, at least in terms of this blog. For starters, internet service here has been spotty at best, at times non-existent. To make matters worse, YouTube is blocked in Turkey and the proxy servers (vtunnel; hidemyass; etc.) aren't doing the trick this year. Yes, it's hard to do this kind of blog without YouTube! All of which I might be able to work around, but right after finishing the Antibes post my lovely Mac laptop died, dead as a doornail. In NYC I live five blocks from an Apple Store; here I'm five countries away. This means I am suffering the ultimate indignity of having to type on a Windows computer, and with a Turkish keyboard no less. But the really sad news is that my Mac had a ton of performance video and half-written blog posts on it, which may indeed be lost forever, but I won't know for sure until I reach London or New York. Yeah, yeah, I know I should have backed it up, but I wanted to travel light for a change and not lug a firewire drive around for ten weeks.

So the blog will be back at full delta force, maybe not The Day After Tomorrow, but certainly by the last week of July; meanwhile, if you have any London physical comedy recommendations for me for the week of July 20th, drop me a line!

Update: Back in NYC, and yes it was total hard drive failure, lost some good stuff but did have some things backed up. Hope to have computer back by end of the week and start posting soon thereafter. — jt (7-30)


Sunday, June 7, 2009

Weekly Blog Bulletin

[post 016]

And by weekly, I do mean bi-weekly.


So, dear friends, Barack, Michelle and I are in Paris but, man, things are so busy that we don't even get to spend any quality time together. Had to ix-nay the prez on helping out on the trip to Normandy (or as they say in France, Normandie), but at least I got to show him around the Pompidou. Yes, the life of a chief executive is a hectic one, whether you're running a country or a physical comedy blog. Speaking of which, back to work...


______________________________

Quote of the Week

"What you have to do is create a character. Then the character just does his best, and there's your comedy. No begging." -- Buster Keaton

New to the blog?

Check out the intro in the sidebar somewhere to the right... >>>>

This blog is best viewed on Firefox. Please do yourself a favor: get Firefox! (Yes, it's free.)

Still coming soon to a blog with the same URL as this one
Waiting for Godot New York vs. London Ultimate Smackdown with them fightin' supertramps Bill Irwin, Nathan Lane, Patrick Stewart & Ian McKellen, plus Peter Brook goes mano à mano with Beckett in Paris
• Live video report from Déantibulations, the street theatre festival of Antibes, France
• Complete coverage of this summer's Jacques Tati Exposition in Paris

What's New this Week
New posts: The Julians Acrobats, Dick Van Dyke on slapstick, and a Feydeau Performance Report
New sidebars: Blog Post History, My Other Blogs, the Visitor Counter, and Followers.

Keep Showing Me Your Shows!
Still in Europe (Paris at the moment) and will be here and in Turkey thru July 25th and welcome any suggestions of shows to see, especially physical comedy and physical theatre, but other arts events as well. Here’s my remaining schedule:
June 9 – June 12: Amsterdam
June 13- June 17: Berlin & Poznan
June 18–22: SW Turkey
June 22-26: Istanbul
June 27-July 18: Dikili (Turkey) = work
July 19 – July 25: London
July 26: back in New York

So drop me a line if you have any tips for me....

Tech Notes
• I have Comments turned on but for some reason it's giving an error message.
• The blog title field seems empty (no title in your browser tab) because Blogspot has been refusing to let me load my banner instead of the title text. Right now I have a workaround that keeps a minimal title (just a period) and allows me to load the banner.

I hope to get both of these fixed soon, but I've learned the hard way that
technology can make you less creative and less productive, not more, especially if you let it bog you down spending days on end trying to solve some glitch. I'm seduced by technology but at times miss the old days when if you wanted to write, you wrote, if you wanted to do a show you just rehearsed and did it. (Or didn't rehearse.)

I could, for example, have spent another six months figuring out blog coding to get this site to look the way I really want it to look, but I've wisely made a content-first vow and will deal with design/tech issues as I go. For example, I don't like this narrow 2-column Blogger template, but every time I tried something different I ran into glitches that I'd waste a day or two trying to solve, and meanwhile no blog. So eventually this may have a wider three-column layout and a snazzier design, but one thing at a time, eau-quais?