Accessing Treasure Troves of Data: Empowering your own Research

By Ian Milligan

This post is a bit technical. My goal is to explain technical concepts related to digital history so people can save time and not have to rely on experts. The worst thing that could happen to digital history is for knowledge to consolidate among a handful of experts.

From the holdings of Library and Archives Canada, to the Internet Archive, or smaller repositories like digitized presidential diaries, or Roman Empire transcriptions, there are a lot of digitized primary sources out there on the Web. You don’t need to be a “digital historian” to realize that sometimes there is a benefit to having copies of these sources on your own computer. You can add them to your own research database, make them into Word Clouds (I know, they’re not perfect), or find ways to manipulate them with tools such as Voyant-Tools, a spreadsheet software, or many other tools that are available. If you can download sources, you may not have to physically travel to an archive, which to me suggests a more democratic access to sources.

Digital historians have been working on teaching users how to access the databases that run online archival collections and how to harness this information for your own research. In this post, I want to give readers a quick overview of some of the resources out there that you can use to build your own repositories of information. If you ever find yourself clicking at your computer, hitting ‘right click’ and then ‘save page as,’ or downloading PDF after PDF after PDF… this post will help you better utilize your computer’s tools, making the digital research process a bit quicker.

So how can we download sources?

Look for the big red button

First, in some cases, some websites have actually built an export feature into their databases. Say you’re researching the war dead of the First World War and wanted to get a database of every Canadian who died. You could go to the Commonwealth War Graves Commission, do an advanced search for those who served with the Canadian Forces, and in the results be presented with a list of 110,365 fatalities.

You can’t read those all through the web portal, but you might be able to do something with the data in Microsoft Excel. Luckily, before you freak out, you might notice this:

Click the big red button, and the data is downloaded.

This is becoming more common (and for the life of God, if you’re designing a database, you should put this in). The Epigraphic Database Heidelberg (Latin and Latin-Greek inscriptions found throughout the Roman Empire has this as well, albeit a bit more hidden away.

But, unfortunately, not all websites make downloading data this easy. Fear not, there are other tools available:

Outwit Hub

A search result from Suda On Line. Imagine we just wanted to grab Adler numbers and Translations. With Outwit Hub, we can.

Let’s use a simpler example. Let’s go to Suda On Line and do a search. I searched for “pie” and received eleven results: this link brings you to the result.

Let’s say we wanted to make a spreadsheet that just had two values: the “adler number” and the “translation.” These are just two terms in this example: your own data might have its own specifics (surnames, for example, or service numbers). You could manually go down that list and type out every adler number, and then copy and paste every translation. But you’d be wasting time that you could instead spend with friends or family.

We can use Outwit Hub to capture this data with a few clicks. We want to find the structure of the website, so we right click in your web browser and click view source, which brings up a window like this:

The HTML soup that runs the web.

When we look closer at each entry, we see code like this:

[code language=”html”]
<strong>Adler number: </strong>chi,63
<br/>
<strong>Translated headword: </strong>groundwards<br/>
<strong class="high">Vetting Status: high</strong><br/>
<strong>Translation: </strong><div class="translation">[Meaning] to the ground, into/towards [the] earth. Also [sc. attested is] <span style=’font-family:;’>?????? </span>, [also meaning] to the ground.</div>
[/code]

Basically, the Adler number always appears after

[code language=”html”]Adler number: </strong>[/code]

and before

[code language=”html”]</br/>.[/code]

Likewise, the Translation always appears after

[code language=”html”]<div class="translation">[/code]

and before the next

[code language=”html”]</div>[/code]

We can point Outwit Hub at this website and tell it to grab all the numbers between those start points and end points, and likewise the translations, and put them into a spreadsheet.

A Quick Walkthrough

Install it from here, open it up, paste the URL. In our example it’s:

[code language=”html”]http://www.stoa.org/sol-bin/search.pl?search_method=QUERY&login=guest&enlogin=guest&page_num=1&user_list=LIST&searchstr=pie&field=any&num_per_page=10&db=REAL)[/code]

In the top bar, click “scrapers” on the left column. Click “new” at the bottom, call your scraper “STOA,” and press OK. If it asks you to buy the product, just say no thanks. 🙂

It should look like this:

Now we want to fill out the form with our information from above. If you click on blank fields below, you can enter values. I’ve filled out the form below. Try replicating it:

When you click ‘Execute’ you’ll be brought to a table with your information!

You can then click ‘Catch’ at the bottom if you want to keep going with other websites (and then repeat the process to find more data), or you can click ‘Export’ at right to bring your information to a spreadsheet.

Some play, and you’ll be scraping like a pro!

Programming?

Sometimes, you want even more information. In that case, you’ll need to start programming. Again, don’t be scared: there are tutorials that are awesome and can help you find information.

There are four lessons at the Programming Historian that I want to highlight, but will leave you to navigate yourself.

Data Mining the Internet Archive Collection, by Caleb McDaniel. Following McDaniel’s lesson, you’ll be downloading metadata for the Anti-Slavery Collection at the Boston Public Library. It’s 7,571 items, too many for you to necessarily read yourself, so the lesson takes you through some automated ways to download them.

Automated Downloading with Wget, by yours truly, Ian Milligan. Wget is a blunt force instrument that can let you download tons of sources in a fell swoop. In the lesson, I show you how you could download everything from ActiveHistory.ca to your local computer.

Applied Archival Downloading with Wget, by Kellen Kurschinski, develops the lessons in the previous lesson further and shows how you could generate record codes for collections in the Canadian and Australian national libraries and begin to download sources very, very quickly.

And finally, Downloading Multiple Records Using Query Strings, by Adam Crymble, uses the the Old Bailey Online database as a way to show how you could make your own collection of trials.

All of these examples could be extended to your own research interests.

Conclusions

I don’t want to mislead you: the first time you try to do some of this it will be hard. But the second time, it’s easier. The third, even easier… Soon you’ll be collecting historical sources on a massive scale.

Good luck! And if you run into trouble, drop a note in the comments. My goal is to make our research methods transparent, and more importantly, help you save time.

Ian Milligan is an assistant professor of Canadian and digital history at the University of Waterloo. He’s one of the co-authors of the The Historian’s Macroscope, a forthcoming peer-reviewed handbook on digital history methodologies that will hopefully be appearing in 2015 *knock on wood*.

One thought on “Accessing Treasure Troves of Data: Empowering your own Research”

Conal Tuohy (@conal_tuohy) November 29, 2014 at 11:45 pm

Thanks Ian!

Another approach to downloading bulk records is to use the Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH). Many cultural heritage institutions do provide an OAI-PMH interface to their records, and installing and using an OAI-PMH harvester (such as jOAI) to query the OAI-PMH server is not too tricky. It’s even possible to construct an OAI-PMH interface for online services that don’t provide one, based on whatever web API they DO provide. I’ve blogged a bit about this at http://conaltuohy.com/blog/tag/oai-pmh/

Please note: ActiveHistory.ca encourages comment and constructive discussion of our articles. We reserve the right to delete comments submitted under aliases, or that contain spam, harassment, or attacks on an individual.Cancel reply