Entity-Centric historical text mining

By Mariona Coll Ardanuy

The volume of digitized sources such as newspapers, magazines, pamphlets, and other historical materials has been growing steadily in recent years. There is no need to emphasize the importance and impact of this trend in historical research. Nevertheless, many of the potential advantages that digitization has to offer have not yet been exploited. Text mining has been searching since its origins manners to improve the exploration of large collections of texts. We propose and adapt new and existing technologies from the natural language processing field in order to reveal hidden relations and connections in the data that could otherwise hardly be found by means of traditional approaches. In our project we explore sophisticated ways to approach the historical sources, focusing on two kinds of entities: places and people.

Place-centric text mining

Cultures cannot be described independently of the temporary and spatial framework in which they originated, developed, and blossomed. Cultures, understood as collections of symbols, language, traditions, art, values, beliefs, and norms, arise and live in a particular point of time and in a particular place. AsymEnc aims at tracing the changing influence between cultures in contact through time and space, focusing on the period between 1815 and 1992. Given the profound relation existing between places and cultures, we believe it is of utmost importance to be able to correctly identify in an automatic fashion which places are mentioned in the texts. In this manner, the historian could easily have access to all those documents in which a certain place was mentioned. The biggest challenge of this task is that placenames are very often ambiguous. Our goal is to pair each placename found in the texts with its unique set of geographical coordinates, to be then able to visualize them in a map. Below, we provide two examples which illustrate both the difficulty and the motivation of the task.

Example 1

At the mention of the word ‘London’, the capital of England comes first to almost everybody’s mind. At the mention of the word ‘Toronto’, the biggest city in Canada comes first to almost everybody’s mind. However, it is not always the case that placenames such as ‘London’ or ‘Toronto’ refer to their most probable interpretation. Let us consider the following news article:

Two CNR trains running between London and Toronto and passing through St. Mary’s at 8:05 a.m. and 8:20 p.m., and which did not stop at the depot here, have been advised to do so.[1]

It is evident from this news article that either ‘London’ cannot refer to the capital of England, or ‘Toronto’ cannot refer to the city in Canada, since both cities are obviously not connected by train. There are several clues that indicate that the two mentioned places are in Ontario, Canada. One clue is the fact that this piece of news was published in the South Western Ontario newspaper. Another clue has to do with the relevance of the place: the more relevant it is, the more likely it is that it will make it to the media. Whereas the second biggest and most important ‘London’ is the one in Canada with more than 350,000 inhabitants in 2011, the second biggest ‘Toronto’ is in New South Wales and has a population of around 5,500 inhabitants, which makes of it a very poor candidate. Once agreed that ‘London’ and ‘Toronto’ correspond to cities in Ontario (Canada), coherence dictates that the highly ambiguous term ‘St. Mary’s’ should correspond to the small town in Ontario that happens to be in the train line between Toronto and London (see map in Figure 1).

Disambiguated places
Figure 1: Google Maps snapshot with the disambiguated places ‘London’, ‘Toronto’, and ‘St. Mary’s’ from the first example.

Example 2

Going back further in history, we might encounter texts which depict a different world from the one we know today. Some placenames might have become obsolete throughout the years, some places might have changed their relevance in the world, some places might even no longer exist.

Marshal Zhukov had reached the middle Oder at a number of places between a point 15 miles east of Kustrin and Glogau. He was within 9 miles of Frankfurt and had also advanced the north-west flank of his salient.[2]

In this example, a ‘Frankfurt’ is mentioned which does not refer to its most common referent. Whereas an inattentive reader might think of the bigger and more influent Frankfurt am Main, context again informs us that the text talks instead about Frankfurt an der Oder, a town located on the Oder river, about 30 kilometers away from another of the mentioned places in the text, Kustrin, and in the direction of the last mentioned place, Glogau. These three clues inform an alert reader that ‘Frankfurt’ in this text refers to the town on the German-Polish border rather than to Germany’s financial capital (see map in Figure 2). This text shows at the same time that placenames are subject to changes throughout history. The Polish cities Głogów and Kostrzyn nad Odra appear in the text in their then German form: ‘Glogau’ and ‘Kustrin’.

Frankfurt an der Oder
Figure 2: Google Maps snapshot with the river Oder painted in blue and the disambiguated placenames ‘Frankfurt’, ‘Glogau’, and ‘Kustrin’.

In natural language processing, the task of identifying placenames in text is called location recognition, and the task of recognizing the referent behind each place mention is called location disambiguation. We identify the locations in the texts by means of a combined approach using both statistics and heuristics. Then, once all placenames have been identified, we find out which is the referent behind each identified location and we assign it its pair of geographical coordinates. We have already mentioned that the geographical coordinates of a location are its unique identifier. There are basically two kinds of external resources that can be used to obtain this information: gazetteers (geographical dictionaries that usually accompany maps) and encyclopedic databases. Our choice was to use the Wikipedia, the most complete and up-to-date online encyclopedia which offers geographical coordinates for the vast majority of entries and has a system of redirection system by which different coreferents of the same entity are linked (e.g. ‘Petrograd’ is redirected to ‘Saint Petersburg’).

In order to map each placename with its coordinates we designed an algorithm that emulates the decision-making process of humans. We asked ourselves: how do humans reach the conclusion that the ‘London’ in the first example corresponds to the city in Canada and not the one in England? What clues make an attentive reader realize that the ‘Frankfurt’ mentioned in the second example corresponds to the town bordering to Poland, and not to the city in the state of Hessen? We learned from real-data examples how such process could take place and designed an algorithm that reflects it. The possibilities of exploring historical texts with this technology are many. Assigning a pair of coordinates to each mentioned location allows us to place them in a map. To name one example, by mapping texts from Dutch newspapers from the 19th century we would be able to obtain a sort of cognitive map of the world known to the Dutch citizens of that time. Each point in the map would be a container of information relating to each place, such as the most relevant or common concepts discussed in the same articles where the placename appears, or the link to the article where it is contained so that the scholar would be able to refer to the original source. A timeline would enable us to see how the image of Europe has changed through time according to our texts

People-centric text mining

People are often at the core of the events reported in news articles. This is particularly interesting in historical research. As more and more historical newspapers are digitized, new potentialities arise to explore individual biography and societal history in a way that was unfeasible until recent years. Newspapers have traditionally been the platform by means of which ordinary people have become public figures, and therefore the digitization of newspapers can contribute to look at history in ways that would not have been possible before. For the first time, historians can have access to texts that were never before available other than through a visit to the respective archive; historians interested in a particular person in history may no longer need to close-read the original manuscripts in search of mentions of this person, but may be able to search in a collection of digitized newspapers the query of interest.

High-quality person mining, though, is at the moment difficult to achieve, due to the ambiguity which is very often found in person names. For example, a historian who is interested in the British First Lord of the Treasury between 1887 and 1891, i.e. William H. Smith, might find in the newspapers of the time occurrences of other William H. Smith, such as a Governor of Alabama, a politician of the Canadian province of Nova Scotia, or maybe a William H. Smith who appeared only once in a newspaper for having won a quilt in a raffle. Person names are not uniformly ambiguous. Very uncommon names such as ‘Edward Schillebeeckx’ are virtually non-ambiguous, meaning that there exist very few people in the world sharing this name. Conversely, very common names such as ‘John Smith’ are shared by many people, leading to great confusion. We designed an algorithm that exploits the relation between how ambiguous a person name is and how many entities it is likely to refer to. The aim of the person name disambiguation task is to group all documents according to the entities this name refers to. Our algorithm decides whether two mentions of the same person name refer to the same entity based on two factors: the social relations found in the same articles and the similarity of the semantic contexts of each name in the articles.

In the same way that when the focus is on the mentioned places we represent the collection of news as a map, when the focus is on the actors of the news stories we represent the collection of news as a social network. We think of a social network as a kind of social map of a particular period. Similar to a map, a social network is an abstraction which sets its focus on the actors instead of on the places. Likewise, we also think of this structure as a container of information: each node and relation in the social network contains information such as a link to the original sources and the list of relevant concepts discussed in the articles in which each person name appears. This method is thought as an exploratory technique to approach historical texts, assisting both in confirming and generating hypotheses.

Figure 3: Snapshot of a social network created from a selection from the St. Vither Volkszeitung (1955- 1964).
Figure 3: Snapshot of a social network created from a selection from the St. Vither Volkszeitung (1955- 1964).

Figure 3 shows the social network created from a selection of articles from the German-language Belgian newspaper St. Vither Volkszeitung between the years 1955 and 1964 with a focus on two actors: ‘Jean Monnet’ and ‘Robert Schuman. This kind of network assesses the utility that our method can have in biographical research. We focus on specific people, their relations, and their trajectories as seen by the public eye. By creating successions of monthly or yearly networks, we can obtain a sort of schematic public biography of the people in which we are interested.

Figure 4 does not have a specific focus on one person, but shows a people-centric approach to a particular episode in history, in this case European integration. It is created from a selection of all articles from the Dutch newspaper De Telegraaf from the year 1953 which contained the word ‘Europe’. By exploring the resulting structure of nodes and relations we find some expected results which prove the validity of the method and, more interestingly, some unexpected results that are worth investigating, since they may question well-established hypotheses about this episode in history.

Figure 4: A part of the social network created from the De Telegraaf (1953).
Figure 4: A part of the social network created from the De Telegraaf (1953).


The main goal behind this work is to illustrate how techniques and methods from the natural language processing field can strengthen historical research. With the growing emergence of digital humanities, many voices have expressed the fear of a decay of traditional historical research and close reading. Through our work, we want to show how using information extraction and text mining methods in historical data can be positive for historical research and can reveal new perspectives of the text collections that would otherwise remain hidden.

[1]Source: Text from 1964, contained in SoutwesternOntario.ca in January 22nd , 2014.

[2]Source: The National Archives http://filestore.nationalarchives.gov.uk/pdfs/small/cab-65-49-wm-45-14-14.pdf