Too Much Information: The Case for the Programming Historian

The Programming Historian

Depending on your vantage point, we have a looming opportunity – or a looming problem. Historical digital sources have reached a scale where they defy conventional analysis and now call out for computational analysis. The Internet Archive alone has 2.9 million texts, there are 2.6 million pages of historical newspapers archived at the Chronicling America site of the US Library of Congress, the McCord Museum at McGill University has over 80,000 historical photographs, and Google Books has now digitized fifteen million books out of their total goal of 130 million. Archives are increasingly committed to preserving cultural heritage materials in digital, rather than more traditional analog, forms. This is perhaps best exemplified in Canada by digitization priorities at Library and Archives Canada. The amount of accessible digital information continues to grow daily, making digital humanities projects increasingly feasible, and for that matter, necessary.

In this post, I will do two things. Firstly, I will give a sense of how much information is out there, and make the case for why Canadian historians need to start thinking about it. Secondly, I will introduce readers to the Programming Historian, a wonderful resources that at least puts you on the right track to a programming frame of mind.

TMI?

Too much information? (Photo of FEMA Publications Warehouse, WikiMedia Commons - http://bit.ly/zjmlYc

Information overload is not new. People have long worried about the impact of too much information. In the 16th century, the German priest Martin Luther decried that the “multitude of books [were] a great evil,” in the 19th century Edgar Allan Poe bemoaned that “[t]he enormous multiplication of books in every branch of knowledge is one of the greatest evils of this age,” and as recently as 1970, American historian Lewis Mumford lamented that “the overproduction of books will bring about a state of intellectual enervation and depletion hardly to be distinguished from massive ignorance.” The rise of born-digital sources must thus be seen in this continuous context of hand wringing around the expansion and rise of information.

Despite the frustrations of microfilm for today’s historians, as well as the pitfalls of separating the wheat from the chaff amongst rising numbers of modern sources, historians have undoubtedly benefitted from these technical developments. This is perhaps disproportionately for those engaged in social and cultural pursuits. Historians will profit meaningfully from born-digital sources. These, however, do present added – albeit surmountable – challenges due to their scope and production processes. Sources do not always have attributable or reliable authorship, are often undated, but in aggregate can give a sense of the zeitgeist of a time.

Library of Congress (Photo from WikiMedia Commons - http://bit.ly/ArU8YZ)

Storage price is falling. For example, James Gleick [in his book, The Information: A History, a Theory, a Flood] estimates that the Library of Congress collection is around 10TB (although the LOC itself claims around 200TB). These would previously have been unimaginable figures; I can now pick up 10TB of data storage for under a thousand dollars. Born-digital collections are larger, of course: the LOC’s digital collection is 254TB, larger than their print holdings, and the Internet Archive now has 3 Petabytes (PB) of information, growing at 12TB/month! In Canada, LAC has about 4TB of federal government web information and 7TB in its own internet archive. Information is also being preserved through programs such as the Roy Rosenzweig Center for History and New Media’s September 11th Digital Archive, the Hurricane Digital Memory Bank (focusing on Hurricanes Katrina and Rita), and, as of writing, the #Occupy archive. Online content is curated and preserved en masse: photographs, news reports, blog posts, and now tweets. These complement more traditional efforts at collecting and preserving oral histories and personal recollections, which are then geo-tagged, transcribed, and placed online.

What can we do about this conventional and especially born-digital deluge? There are no simple answers, but historians must begin to conceptualize new additions to their traditional research and pedagogical toolkits.

Where to Start: Programming

By the end of the Programming Historian, you'll have a basic know-how of Python and will be able to tackle projects requiring textual analysis.

One important thing we can do with this deluge of information is learn how to interact with digital information on a mass scale. Luckily, we have a tremendous resource available to us: The Programming Historian, by William Turkel and Alan MacEachern, hosted on the Network in Canadian History & Environment (NiCHE) site. Why might you want to open up this free, open-access website book?

If you were to try to deal with born-digital sources in a traditional manner, you would spend A LOT of time flicking through websites. Much of it hasn’t been curated, and realistically, you could not read every blog comment published on a given day in Canada, navigate the tweets, or so forth. For this, you will need computational analysis.
The same holds true for the conventional array of information discussed above: if you want to use 2.6 million newspaper pages to their full potential, there must be a way to “distant read” it.
Digital history is ‘hot.’ The American Historical Association, meeting right now, is full of panels and twitter has been afire with the field. Even if you do not necessarily see yourself using programming languages, it behooves you to be able to understand it.
And, most importantly, it isn’t that hard, and it doesn’t take that much time. You could move through the whole guide in a weekend, or – better yet – break it into small chunks, spending 20-30 minutes here and there.
Finally, I believe we’ll also have to equip the next generation of historians, as I’ve written about elsewhere on ActiveHistory.ca.

The Programming Historian is very straight forward, but by the end of it, you’ll be able to do the following:

In an automated, systematic fashion, you will be able to take a website and extract all of the words from it for further analysis.
Establish word frequency, similar to what a Wordle word cloud displays (the possibile utility of this is discussed elsewhere on this site). Indeed, you will be able to make your very own tag clouds!
Move beyond word frequency to see the keyword-in-context – i.e. you see that the word ‘aboriginal’ appears a hundred times in a given site, so why not see where it has appeared. This enables you to move very quickly to the relevant information.
Download and harvest information automatically. Say you find a large collection of a hundred websites. Rather than clicking repeatedly through each to download the information, a simple script can do it for you!

Conclusion (and a proviso about why we don’t all have to be programmers!)

Don’t be afraid. It’s New Years, so why not make it your resolution as a historian to figure out some of these very basic steps. It could make you a better historian, or in any case, will equip you to figure out what’s going on. In any case, it’s an additional tool in one’s toolkit. Unlike earlier social science histories of counting with computers in the 1970s (which did revolutionize areas of historical inquiry), it is important to remember that we can use broad analysis to find issues, but then move dynamically down into context.

That all said, historians will not all have to become programmers. Just as not all historians need a firm grasp of Geographical Information Systems (GIS), or a developed understanding of the methodological implications of community-based oral history, or in-depth engagement with cutting edge demographic models, not all historians have to approach their trade from a computational perspective. Nor should they. Computational history – to use only a few examples – does not replace close reading, traditional archival inquiry, or going into communities to uncover notions of collective memory or trauma. Indeed, computational historians will play a facilitative role and provide a broader reading context; yet there will still be historians, collecting relevant primary and secondary sources, analyzing and contextualizing them, situating them in convincing narratives or explanatory frameworks, and disseminating their findings to wider audiences.

Too Much Information: The Case for the Programming Historian

Related

Please note: ActiveHistory.ca encourages comment and constructive discussion of our articles. We reserve the right to delete comments submitted under aliases, or that contain spam, harassment, or attacks on an individual.Cancel reply