Illusionary Order: Cautionary Notes for Online Newspapers

by Ian Milligan on March 26, 2012

By Ian Milligan

The splash page for the Globe and Mail's "Canada's Heritage Since 1844" website.

The splash page for the Globe and Mail's "Canada's Heritage Since 1844" website.

Online digitized newspapers are great. If you have access (either through a free database or via a personal or library subscription), you can quickly find the information you need: a specific search for a last name might help you find ancestors, a search for a specific event can find historical context for it (i.e. the Christie Pits Riots, or a certain strike), and generally the results are beautiful, render relatively well, and are – crucially – immediate.

In some ways, however, poor and misunderstood use of online newspapers can skew historical research. In a conference presentation or a lecture, it’s not uknown to see the familiar yellow highlighting of found searchwords on projected images: indicative of how the original primary material was obtained. But this historical approach generally usually remains unspoken, without a critical methodological reflection. As I hope I’ll show here, using Pages of the Past uncritically for historical research is akin to using a volume of the Canadian Historical Review with 10% or so of the pages ripped out. Historians, journalists, policy researchers, genealogists, and amateur researchers need to at least have a basic understanding of what goes on behind the black box.

An example of a "results list" from the Globe and Mail's newspaper database. It all seems so orderly and systematic.

An example of a "results list" from the Globe and Mail's newspaper database. It all seems so orderly and systematic.

And the ensuing results, a newspaper article focused on the Artistic Woodwork Strike of 1973

And the ensuing results, a newspaper article focused on the Artistic Woodwork Strike of 1973

An amazing array of information at your fingertips (but…)

In Canada, when one thinks of online digitized newspapers, the Toronto Star’s Pages of the Past and the Globe and Mail’s “Canada’s Heritage from 1844″ often come to mind. There are other wonderful collections, of course, notably the incredible historical newspapers of British Columbia collection, but the Star and the Globe are most commonly used.

The Star and Globe can be accessed through an institutional or personal subscription (you can also access these two databases through libraries like the Toronto Public Library – with a valid library card). You can search by a specific word, or a specific phrase, and narrow it down by a date range. A keyword search (such as for “Artistic Woodwork” at right) and a date range can quickly take you to a seemingly systematic, quantified, and perhaps even complete listing of relevant articles. History laid before you, neatly ordered, from the comfort of your home, library, or office. Another click, and you’re brought to a PDF version of the scanned document: complete with placement, accompanying advertisements, etc.

An example of a feature, front-page, above-the-fold article on the Artistic Woodwork strike that does not appear in a keyword search.

But we need to use these databases with greater caution. In the example at right, for example, the Globe and Mail‘s database has correctly found a large feature article on the Artistic Woodwork strike of 1973. Yet it is a continuance of an article from Page One. That headline, the first page of the newspaper, does not appear in the search list. If one just uses the search engine, you miss this vivid headline, picture, and entire story.

Why?

Primarily, the issue lies in faulty optical character recognition (OCR). This issue is not just limited to these newspapers, and is an inherent flaw in large projects. Tim Hitchcock has described the uncritical use of digitized sources as “roulette dressed up as scholarship,” as historians are “not even bothering to apply the kind of critical approach that historians built their professional authority upon.”

What about the specific case of the Toronto Star and the Globe and Mail online? These databases were assembled at the turn of the present century, and indeed, the Toronto Star is heralded on Paper of Record’s (the company responsible for the database creation) as the “first newspaper in the world to have its entire history … digitized.” It was created quickly, as Bruce Gillespie reported in 2003 in his “All the News That’s Fit to Scan”:

 Using technology developed in-house, Cold North Wind [Paper of Record's parent company] converts documents stored on rolled microfilm into digital computer files. It is an automated process that works quickly-Mr. Huggins says two million pages from The Toronto Star’s 110-year history were archived in less than four months.

This incredible speed and the use of microfilm originals comes at a cost, however. The former means that basic OCR is used: hyphenations are not covered (problematic in smaller columns, where Woodwork might be hyphenated as Wood-work across two lines), if microfilm streaks obscure a letter, if it was slightly tilted, or if the OCR just plain misses a character. This is currently unavoidable with large-scale digitization projects: I am currently OCRing a large collection of word processed documents from 1997 onwards – about as perfect a sample as you can get, and while the OCR under these ideal circumstances is well above 99%, it can never be perfect. Quite frankly, without human proof-reading and additional layers, you can never be completely convinced of your accuracy. Furthermore, comprehensive database use requires some limited understanding of Natural Language Processing (NLP). NLP is a complicated field of research, and a proper search query would also need to be formulated to pick up alternates such as ‘Woodworking,’ etc. without unnecessarily duplication of results.

Another issue lies in the proprietary nature of the Star and Globe databases: I have been trying to track down their technical support team to discuss a research project, to no avail. E-mails often bounce back from the addresses provided on their search portals, and they can be a bit impenetrable. This is understandable, in a way: unlike other national newspaper projects, they are run by private companies.

So what can we do?

Now, with a strike (as in my example above), one could pop the date ranges in, go through each newspaper throughout the period, and explore specific events. This would avoid the above problem. But studies that purport to trace social or cultural trends over a long period of time can fall into the habit of relying on these databases without critical reflection. That’s not to say that they should not use them – we can find most articles, especially by the postwar period and its attending better image quality. Indexes are hardly perfect alternatives. History has always had an element of serendipity.

Indeed, we cannot and should not abandon our use of digitized online databases. Despite their faults, they allow us to cover large swaths of time and space on a realistic timeline, and are much quicker than using microfilm. They also open up new frontiers of large-scale data and textual processing, although the current user interface and databases are not terribly amenable to this form of work.

But we do need to be cognizant. Dissertations and articles that extensively rely on these databases need to be up-front about the issue and at least mention how they have dealt with or recognized the very real and concrete limitations inherent in this form. In my on-going survey of English-language dissertations and other historical work, I have found that while these databases appear to be having some impact on citation counts, few scholars note their database use. Doctoral supervisors, journal editors, bloggers, public historians, etc. need to realize how these databases are potentially shaping professional and amateur historical inquiry in Canada.

So next time you’re using the databases, think about what’s going on. Are you getting everything? Are you missing something? Should you do some digging around a hotspot of hits on a given date? In all cases, we should be more up-front about the tools we’re using and how they might be shaping our research.

{ 5 comments… read them below or add one }

ian March 26, 2012 at 8:04 am

Great post! It’s really true that powerpoints at conferences are now often filled with yellow highlighted search terms (which is great because I can get a much better sense of how the presenter did their research). I’m looking forward to seeing the final results of your analysis of the use of the Globe and the Star by Canadian historians.

Patti Kmiec March 26, 2012 at 8:50 am

Very interesting post. I’d be interested to see how these conclusions hold up when looking at more local, and less familiar, newspapers. In my own dissertation research I am relying very heavily on local newspapers from early Ontario- most of which are available digitally through ourontario.ca. I am finding many more problems with those that are not digitized (pages out of order, partial indexes, pages missing, incomplete collections, etc)- yet I rarely see these issues mentioned by historians who have only used the traditional sources. I have also had the complete opposite experience with the project co-ordinators who run ourontario.ca (which is part of Knowledge Ontario). They are amazing and reply to my inquiries literally within seconds on facebook. I think that for smaller, local newspapers- digitization is the best thing to happen, although you could be right about larger, more well-known ones. I have a feeling that newspapers that have hardly been touched by historians (like 19th C Provincial Freeman and Voice of the Fugitive), but are now available digitally, are about to get the attention that is long overdue.

The biggest problem I have come across using digital sources in general, is that my advising professors still send to traditional archives to find things that are available digitally.

Sean Kheraj March 26, 2012 at 9:50 am

This is an important post, Ian. Thanks for getting the conversation started on this challenge for historical newspaper research.

This, of course, is a general challenge or all digital archive research. I agree with Tim Hitchcock’s remarks about uncritical digital research as a form of roulette, a random scatter-shot approach to scholarship. However, working with digital archives and large newspaper collections has got me thinking about the extent to which traditional analog newspaper research was perhaps even more random and unsystematic. Yes, we once scrolled through microfilm reels for endless hours scanning pages with our eyes in search of mental keywords hoping to stumble upon a gem of an article here and there. We also scanned editorial columns and other regular sections of newspapers to chart change over time and to track changes in current events. Digitized newspapers and searchable text, though flawed, has provided historians with a very powerful tool to help overcome what was arguably a very fallible and capricious analog research methodology.

Because my research includes digital and microfilm materials, I find myself swapping back and forth between different methodologies of searching and scanning my sources. The key for researchers using digital archives, I think, is to continue to exercise critical reading and research skills that should be developed in graduate training. As I think you argue well in this post, those skills are changing and we need to think about how to modify and adapt traditional historical research skills to new (and sometimes staggering) digital technologies that have substantially transformed the research process in a very short period of time. We also need to think critically about these new methodologies and write about them openly in our work.

Ian Milligan March 26, 2012 at 1:38 pm

Thanks for your comments, Ian, Patti, and Sean. All very appreciated!

Sean and Patti, I think you both hit on the important issue that conventional historical research as/is hardly any more rigorous or systematic than newspaper-based sources. Completely correct and I’m not suggesting that we go back to the old microfilm approach. One of the major differences that I am trying to highlight here, however, is that at least with traditional microfilm we kind of ‘knew’ our research processes – knew our skimming, new the fallibility, etc., – whereas the proprietary “black box” of digital archives gives an illusion of order and comprehensiveness. If these considerations became de rigeur in methodologies sections, I think our scholarship could be tightened up.

As for smaller newspapers, these are great resources! If there are substantial problems with print archives and collections, these should be noted by scholars as this has the potential to skew results.

Your experiences with ourontario.ca and Knowledge Ontario also suggests the importance of publicly funded knowledge disseminators and collectors. I know they’ve had previous funding issues, which worries me. If only our entire digitized apparatus in Canada could be so responsive and friendly!

Alastair Dunning March 27, 2012 at 7:22 am

Some really good points here

From experiences on digitising the British Library newspapers (http://newspapers11.bl.uk/blcs/) and now a European gateway project (http://www.libereurope.eu/news/a-gateway-to-european-newspapers), it is really difficult for digital libraries and publishers to represent the gaps, imperfections and confusions in a physical collection.

The metadata models that are created in digitisation projects are based around what exists in physical form and can be digitised.

But if we want to provide scholarly transparency we need to come up with metadata and interfaces that show what has not been digitised (ie because of missing pages, uncollected material, ripped or poor-quality pages, because of lack of funding, or because of curatorial choice)

Leave a Reply

{ 5 trackbacks }

Previous post:

Next post: