Text mining European newspapers

By Tessa Hauswedell

Text mining European newspapers –
a brief introduction

In what ways do digital archives affect the way we read historical newspapers? How does the increasing use of computer aided text mining change the way we think about these newspapers? And how does using multilingual archives allow us to retrace the manifold ways in which newspapers from different European countries influenced each other?

This project analyses digitised newspaper archives from the nineteenth and twentieth century from Germany, Great Britain, Belgium and the Netherlands in order to study cross-cultural influences between these countries. It seeks to investigate the ways in which certain European cultures functioned as a ‘reference’ or model for other European countries at certain times. Did Great Britain envy the fabled and sophisticated French culture or did it seek to emulate it? How important was the cultural influence of Germany for the Netherlands? And vice versa, which form of social and cultural impulses did the Netherlands bestow on its larger European neighbours? Which particular aspects of a certain culture served as positive or negative reference points? What form of similar connotations, perceptions, imaginaries and memories did these European countries share, and where did they differ? Rather than looking at newspapers in their national context, we place them in a transnational, cross-cultural context to understand how transnational exchanges between comparatively smaller (Belgium, Netherlands) and larger (Germany, United Kingdom) European countries. develop and change over time.

The role of newspapers

Newspapers present one of the many possible documents which we can study to gain insights into the events and mentalities, attitudes and self-portrayal of a specific society at a given time. They play a role in shaping perceptions through the daily reporting of everyday events and occurences and present us with a great level of detail and reporting on current affairs, which we can analyse over a longer stretch of time. In order to work with newspaper corpora that stretch over decades, it becomes necessary to harness the power of digital tools to help us find relevant information. In this way, digital tools can help us to discern larger trends, which are not recognisable by focusing on just one edition or singular newspaper article.

But in practice, obtaining access to large corpora and comparing them across linguistic border means relying on archives which are maintained principally by national institutions and libraries and which are subject to their procedures, guidelines and selection processes. Any archive is always the result of a complex selection process in which certain publications are included at the expense of others. They involve economic- and cost factors, but also ideological choices about what is being deemed ‘worthy’ of representing cultural heritage and what is being left out. Predictably, European digitised newspaper holdings vary strongly in terms of quality, representativeness, quantity, availability and accessibility. The European initiative Europeana is one exception to the rule with its aim to create a truly European and free digital repository with access to multilingual newspapers collections. That said, Europeana relies on the co-operation and willingness of individual countries to make its collections accessible, resulting in an uneven distribution of material.

And so, a lively debate is taking place about how digital archives will determine how we read and find information. While some foresee a true democratisation of access, a new era of a super-abundance of sources and availability which promises unrivalled access, others are wary of the increasing privatisation of hitherto public material and remain sceptical. The American historian Roy Rosenzweig has pertinently encapsulated the advantages and dangers of conducting digital research. As one of the promising developments he notes better ‘capacity, accessibility, flexibility, diversity, manipulability, interactivity and hypertextuality’. Amongst the dangers or ‘hazards’ are issues of ‘quality, durability, readability, passivity but also inaccessibility’.[1] Rosenzweig’s list shows that for all its potential advantages, the inverse might also be said to be true – better accessibility can also turn into inaccessibility, better ‘flexibility’ might translate into questions of ‘durability’.The verdict on the implications of using digital archives is still out, but the strategic decisions on the infrastructure of these archives, which are being taken at this stage will decide how accessible and user friendly they will prove to be. Equally, a similarly opinionated debate is occurring regarding the ‘promises’ and ‘hazards’ of digital text mining.

Digital tools

Digital text mining can be broadly ‘defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of software analysis tools. In a manner analogous to data mining, text mining seeks to extract useful information from data sources through the identification and exploration of interesting patterns.’ This definition places the emphasis rightly on the arguably most powerful aspect of using text mining to analyse texts: the ability to recognise patterns, distributions and networks over long stretches of time, which would not be discernible otherwise.

Critics of text mining take issue with the perceived lack of its methodological rigour and maintain that it lacks a true method, and is, instead, defined by the capabilities of digital tools. But this misses the point that using any tool still places the interpretative onus on the people who use them – while digital tools can draw attention to interlinkages in public debates and discern trends or patterns which would be difficult to detect by reading these sources manually, it is still up to the researcher to meaningfully draw insights from them.

[1]Source:Daniel Cohen and Roy Rosenzweig, Digital History: A Guide to Gathering, Preserving and Presenting the Past on the Web (Philadelphia: University of Pennsylvania Press, 2006) p.3.