You are cleaning out the attic of your house and find a diary from the early 1900s written by a distant relative. What do you do with the diary? How do you make it useful to the general public? Donating it to a museum or an archive is a good start. However, in order for the diary to be useful to a wider audience it needs to be transcribed. A transcribed document can be made full text searchable, copies can be made of the text, and the entire document becomes accessible to a wider audience. Transcription can be a time consuming and a painstaking process. But, once a document has been transcribed its usefulness increases exponentially.
Optical Character Recognition (OCR) software is useful tool for speeding up transcription. OCR is the translation of images into machine readable text. However, even using OCR software is a time consuming process. OCR is not 100% accurate and often requires a user to reread anything which has been run through it. Additionally, most OCR software is designed to work with type written text in good condition. Making it pretty useless for a handwritten diary that may not be falling apart.
Given the time consuming but beneficial nature of transcription it isn’t surprising that some online crowdsourcing transcription projects have been released. Projects like Transcribe Bentham ask the general public to help transcribe original historical documents. In the case of Transcribe Bentham, the manuscript papers written by Jeremy Bentham have been placed online and the general public is being asked to help transcribe this wealth of historical documents. Users’ transcriptions will eventually be stored and used to create a fully searchable online database. This project is in the early phases but it already seeing great user participation and results.
In addition to projects like Transcribe Bentham, which focus on a specific series of documents, Scripto an open source tool was recently launched by CHMN. Scripto is a crowdsourcing transcription tool, which will eventually provide institutions with a platform to implement their own online transcription initiatives. Scripto is still underdevelopment but has the potential to greatly increase the ease of online transcription.
Remember that old diary you found earlier? What might happen to it once it has been transcribed? One of the most well known digital examples of the use of transcribed material is the Old Bailey Project. Transcription has allowed for access to all of the surviving documents from the Old Bailey Proceedings and the Ordinary Newgate Accounts. All of these documents are keyword searchable, text markup has been done on the digitized text to facilitate organization of material by themes, and numerous contemporary maps and images accompany the historical records. The Old Bailey Project highlights the wealth of information that can be retrieved from transcribed historical documents. Transcription allows for historical documents to become accessible, searchable, manipulated, mined for data, and useful to a wider audience.
Transcription also has the potential to increase public engagement and involvement. Crowdsourcing transcription initiatives which aim to involve the public can increase general knowledge about the past and help engage the general public within the historical process. Additionally, encouraging the public to be involved in transcription is essential for many smaller museums and archives. In smaller institutions transcription is often completed by volunteers. Online applications have the potential to increase this volunteer pool, and assist heritage organizations in reaching out to wider audiences. Transcription is an invaluable part of processing text based documents. As technological aids develop, transcription is becoming less labour intensive and practical for a wider range of organizations.
Krista McCracken is a public history consultant and is currently working as a Digitization Facilitator for Knowledge Ontario.
Although they do not do handwritten manuscripts, another neat approach is the “Distributed Proofreaders” project, http://www.pgdpcanada.net/c/default.php. They use OCR to get a baseline text, and then have participants correct any errors.
The same with the National Library of Australia’s Newspaper collection. It’s one of the slickest examples of distributed OCR proofreading out there: http://newspapers.nla.gov.au/ndp/del/home
Transcription is a great way of preserving information that matters to a lot of people, particularly historical documents that resonate with a collective. And what’s more is that just about anyone can do it–except of course for legal and medical transcription, which requires background in their respective fields.
It’s good to know about these opensource transcription tools. I think there would be more advances in OCR and transcription technology if there are more opensource projects as such.