Using reCAPTCHA to help digitize books

10Jan09

We’ve reactivated reCAPTCHA on our new domain. CAPTCHAs are the distorted images found on registration forms that help determine whether a user is a human or a computer program, such as a spam bot. Carnegie Mellon’s reCAPTCHA takes this function to a new level by using human-generated inputs to help digitize old books.

As recaptcha.net explains, Optical Character Recognition (OCR) cannot successfully digitize all words from book images. 

reCAPTCHA takes these unreadable words and uses them to generate CAPTCHA images. When human users solve the CAPTCHA, their responses help decipher the unreadable words. (In case you’re wondering how it works, users are given two images, one successfully OCRed and another that is not. When a user gets the OCRed word right, the system assumes he/she is correct about the other word. Responses are aggregated together to improve the confidence of digitization.)

reCAPTCHA currently helps digitize books from the Internet Archive and old editions of the New York Times.

We use reCAPTCHA on two areas of the site: user registration and Phylo Forum (where visitors can post messages without registering.

0 Responses to “Using reCAPTCHA to help digitize books”


  1. No Comments

Leave a Reply


Comment guidelines: No spamming, no profanity, and no flaming. Inappropriate comments will be deleted outright.