Digitising Books, One Word at a Time

Digitising Books, One Word at a Time

By Greg McNevin

May 28, 2007: CAPTCHA boxes are used all over the web now. Adding a comment to a public discussion or signing up for an account at a web site will usually see you re-typing obscured characters from an image. But can these boxes be used for more than security?

They sure can says Carnegie Mellon computer scientist Luis von Ahn, who has proposed a new CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) system be used to retype characters from digitized books that optical character recognition (OCR) systems struggle with.

Called reCAPTCHA, the system is similar to the SETI project which enables users to donate their unused processor cycles to process information gathered in the search for intelligent life. Instead of searching for extra terrestrials though, reCAPCHA takes unrecognised words scanned from books and delivers them in a verification box, increasing security on the web and digitising text for the Internet Archive project one word at a time.

“I think it’s a brilliant idea - using the Internet to correct OCR mistakes,” said Internet Archive director Brewster Kahle in a statement. “This is an example of why having open collections in the public domain is important. People are working together to build a good, open system.”

With an estimated 60 million CAPTCHAs being solved a day, if the system is widely adopted this will translate to around 150,000 hours of work a day improving the digital archive.

Now, normal CAPTCHA systems require the obscure text to be known beforehand for the system to work. Since the systems premise is to translate what OCR scanners cannot though, a slightly different system needed to be developed for reCAPTCHA to work.

“Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known,” reads the reCAPCHA website. “The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one.”

Ingenious. So while PC users devote their spare machine time searching for ET and PS3 owners help folding@home fold proteins and cure diseases, the rest of us can contribute to the preservation of the written word and keep Web 2.0 secure at the same time.

Comment on this story