Book Scan Wizard and the Internet Archive team up to form a massive book scanning robot!
Book Scan Wizard is the software that is the heart and soul of the DIY Book Scanner that I’ve been working on for the last month (and blogging about here). It greatly eases the process of scanning images of pages of text from books and turning them into something actual usable. In reality, the process of scanning a book is pretty easy to do cheaply and sloppily but the real difficulty is turning your crappy scans or photos of pages into something usable by others. Book Scan Wizard was written to be a solution to this problem and it does an astounding job at it.
It was just announced a couple of hours ago that the brand new version of the Book Scan Wizard will support automatic uploads to the Internet Archive, a non-profit that, you guessed it, archives the Internet and also acts as a repository for all kinds of digital information. Even better, it does it for free and supports the public domain and also allows for the creative commons licensing of works. The podcasts that I’ve done in the past and the various public events (such as Cory Doctorow doing a public reading) are all actually permanently stored on the Internet Archive.
The Internet Archive has been working with various groups for quite a while to create a repository of free texts to fulfill the promise of making the cultural legacy of the world’s books available to people (something that Google Books has failed to do for all of their verbiage about scanning texts).
With the new version of Book Scan Wizard, or even through just uploading directly to the Internet Archive, any PDF composed of images of book pages or organized zip file filled with images of book pages will be automatically processed. The Internet Archive’s servers will then automatically perform optical character recognition (OCR) on the book and make a pdf, epub, kindle (mobi), daisy, djvu, and text file copy of the entire book available for download by anyone, anywhere. You can see a sample book from this process to get a better idea. All this happens within a few hours of the book being uploaded and then anyone can download it. This is free OCR for anyone in the world.
I hope that I don’t have to spell out the implications of software that makes it easy to process the images out of scanners to create books combined with the resources, availability, and now processing of the Internet Archive. This is a game changer in many ways. While this is likely to be controversial to some people, the benefits of having this end to end chain of tools from physical books to electronic texts that can be read on any computer or mobile device is just amazing.