Week Beginning 19th October 2015

I was on holiday last week and I spent the whole of this week at the D2E conference (http://www.helsinki.fi/varieng/d2e/) in Helsinki. It was a really interesting and informative event, and I learned lots of new things about corpus linguistics, visualisation techniques, statistics and the general management and research potential of large English language datasets. The conference began with an excellent plenary by Tony McEnery of Lancaster University. It was an excellent introduction to the conference and also to corpus linguistics in general and really set the tone for the whole conference. He pointed out how collocation is psychologically meaningful, that words appearing near each other are significant and that there is proof that this is how the brain works. He also stated that corpora can provide the evidence for historians – the actual proof that words were actually used as historians think they are. A very important point Tony made was that ‘distant reading’ via n-grams and other visualisations is not enough. ‘Close reading’ of the actual data is also vital. Distant reading is a useful tool to get ideas but it is important to link to the real data to check that theories actually pan out. Tony also noted that the problem of working with language data is that language is not a static data point and this can make the management and use of the data tricky. He gave the example of variation in the use of the word ‘rogue’. Its meaning has changed significantly from the 1700s to the present day so just searching for occurrences of the word will not give meaningful results. There are also problems with variant spellings and other issues such as the metaphorical use of words. Tony pointed out that using Google’s n-gram viewer to look at the usage of musical instruments over history has problems as ‘trumpet’ can be used metaphorically and when people use ‘drum up’ they’re not talking about the actual instrument. A good quote from Tony’s talk is “aggregated data hides meaningful variation”. He also pointed out that sometimes words change their usage, for example ‘prostitute’ changed from a verb to a noun. Part of Speech taggers can get completely thrown by this and tag things incorrectly. Word meanings also change over time and this can be a big problem for semantic taggers. Tony noted that the word ‘multiplex’ is tagged as being related to cinema but in the 17th century this word had a numerical meaning. Religion in the 17th century was of absolutely huge importance but modern day semantic taggers can give a complete miscategorisation of the 17th century worldview. Changing usage of words over time can also be a consequence of censorship rather than an actual change in language.

Other highlights of the conference for me included a session by James McCracken of the OED about integrating historical frequency data as a means of showing which new words are significant. James as looking at whether the OED entry size could be used to show the significance of a word – that generally the larger the entry the more important the word. He also demonstrated some excellent animated treemaps of the HTOED, plotting change over time. The interface allowed James to click a ‘play’ button and for various parts of the map to grow and shrink to represent significance over time. It looked like these visualisations were created in d3 and I may have to see if I can create something similar for the Thesaurus data at Glasgow.

There was also a very interesting session on the analysis of Twitter data, specifically geocoded Twtter data from the US and the UK (geocoded data is a subset of Twitter data as users have to have this feature turned on on their devices). The corpus created features 8.9 billion words and 7 million users, with every message time stamped and geocoded. This allows the researchers to map the frequency of words over time by US county – e.g. the use of the word ‘snow’. There was also some mention of YouTube subtitles, which are apparently being added automatically to videos. The Twitter visualisations used cumulative maps over time, with points being added to the map at each time period. I’ll have to look into the creation of these too.

Another project was looking at using text mining to build a corpus, and the project used a number of tools that I will have to investigate further, including an OCR normaliser (http://tedunderwood.com/2013/12/10/a-half-decent-ocr-normalizer-for-english-texts-after-1700/), a part of speech tagger called TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/), a piece of Topic Modelling software called Mallet (http://mallet.cs.umass.edu/topics.php) and a clustering algorithm for historical texts (http://www.linguistics.ucsb.edu/faculty/stgries/research/2012_STG-MH_VarNeighbClustering_OxfHBHistEngl.pdf) and a discussion of how pattern mining could be used.

Johnathan Hope of Strathclyde University gave an excellent demonstration and discussion of a web-based text tagger called Ubiquity (http://vep.cs.wisc.edu/ubiq/), which allows users to upload any texts and for these to be tagged, either based on the categories of the DocuScope dictionary (see http://www.cmu.edu/hss/english/research/docuscope.html), or by uploading your own set of rules. Jonathan showed how this could work using Hamlet, and the tagger very quickly produced a nicely tagged version of the text that allowed a user to very easily highlight different linguistic phenomena such as questions. Jonathan also showed how the results from multiple documents could then be plotted using a rather fancy 3D scatterplot diagram.

Another paper mentioned the English Short Title Catalogue (http://estc.bl.uk/F/?func=file&file_name=login-bl-estc), which I hadn’t heard of before and is a searchable collection of over 480,000 items from 1473 to 1800.   The GDELT project (http://www.gdeltproject.org/) was also mentioned – a realtime network diagram of global human society. Another paper discussed developing an interface for historical sociolinguistics, which the researches had created using the Bootstrap framework (http://getbootstrap.com/), which I really must try out some time. Their interface allowed different search results to be loaded into different tabs for comparison, which was a nice feature, and also included a very nice animated plot of data over time, with play and pause buttons and facilities to drill down into the specific data.

Gerold Schneider gave a very good demonstration of tools and methods for processing and visualising large corpora using corpora made available through the Dependency Bank project (http://www.es.uzh.ch/Subsites/Projects/dbank.html). The talk discussed a machine learning tool called Lightside (http://ankara.lti.cs.cmu.edu/side/download.html) that can be used to do things such as telling from the text of political speeches whether the speaker is a Republican or a Democrat. He also showed results that had been created using GoogleVis – an interface for ‘R’ o allow the Google Charts API to be used (see https://cran.r-project.org/web/packages/googleVis/vignettes/googleVis_examples.html). Gerold also demonstrated a nice visualisation that animated to show how points change over time with play and pause buttons – this definitely seems to be a visualisation style that is popular at the moment and I’d like to implement such an interface too. Another talk mentioned WebCorp (http://www.webcorp.org.uk/live/), a 1.3 billion word corpus that can concordance the web in real-time, while another talk discussed MorphoQuantics (http://morphoquantics.co.uk/), a corpus of type and token frequencies of affixes in the spoken component of the BNC.

Jane Winters of UCL gave an excellent talk about dealing with big humanities data, stating that the big problem with a lot of humanities data is not necessarily its scale but its ‘messiness’: very often it is not consistent, and this can be for a variety of reasons, such as the differences in print quality over time. Jane talked about ConnectedHistories (http://www.connectedhistories.org/), which is a federated search of many different online historical resources, allowing for searches of people, places ad keywords over time. She also mentioned Neo4J (http://neo4j.com/), a graph database and visualisation library that one of the projects was using. An example of the sorts of visualisations it can create can be found on this page: http://graphgist.neo4j.com/?_ga=1.193256747.785518942.1445934388#!/gists/a7ad4d70f6f993bb864f38f4825d1989. Another project Jane has been involved with is ‘Digging into Parliamentary Data’ (http://dilipad.history.ac.uk/). Fraser and I are already using some of the outputs of this project for our Hansard visualisations, for example the data the project generated that assigned political party to each speaker in Hansard, and it was very interesting to hear more about the project. Jane also mentioned a search facility that was developed from the project that allows users to search the parliamentary proceedings of the Netherlands, UK and Canada and can be accessed here: http://search.politicalmashup.nl. It’s a very useful resource.

Another speak discussed using Wmatrix (http://ucrel.lancs.ac.uk/wmatrix/), a web interface for the USAS and CLAWS taggers and the ConcGram concordance, which appears to only be available on CD-Rom (https://benjamins.com/#catalog/software/cls.1/main). Mark Davies gave a wonderful talk about the various corpora that he hosts at BYU (see http://corpus.byu.edu/), including the 1.9 billion word Wikipedia corpus. He also explained how users can create and share their own corpus using the tools he has created, focussing on a specific theme or genre, sharing the results and comparing the use of language within this corpus with the resource as a whole.

I also attended an interesting talk on Middle English alchemical texts, which mentioned the digital editions of Isaac Newton’s alchemical works that are available at Indiana University (http://webapp1.dlib.indiana.edu/newton/). The final plenary was given by Päivi Pahta and looked at multilingualism in historical English texts. She mentioned a tool called Multilingualiser, which can find and tag foreign words in historical texts. The tool doesn’t appear to be available online yet but is being developed at Helsinki. The tool was used to look for occurrences of two or more foreign words together and these were then tagged for their language. In the late modern period Latin and French were the most used, with an increase in German towards the end due to its importance in science. The project also stored a lot of metadata, such as genre and intended readership and found the types of texts with most foreign words were letters, academic writing and travelogues. Three types of usage were found – conventionalised expressions (e.g. ‘terra firma’, ‘carte blanche’), prefabricated expressions (e.g. proverbs and quotations) and free expressions (longer passages). The results of some of the queries of foreign words were plotted on a ternary plot diagram, which was a nice way to visualise the data.

Another speaker discussed the OE corpus, pointing out that it consists of 3.5 million words and represents all the evidence we have for English between 600 and 1150 and that it is all available in digital form (for a price). However, there is no metadata so nothing on genre or anything like that. The OE part of the Helsinki Corpus does include fairly rich metadata, but represents only 15% of the whole. The speaker pointed out that metadata is really important to allow grouping by genre and to enable to producer and receiver to be considered.

Another paper discussed using Twitter as a corpus, this time using the Twitter streaming API, which makes available about 1% of the total dataset. However, only 1.6% of Tweets have geocoding data and Twitter only started tagging Tweets for language in 2013. The speaker was interested in English Tweets in Finland and used the Twitter language tag and geocoding data to define a dataset. He also used a Python based language identification system (https://github.com/saffsd/langid.py) and a part of speech tagger that had been developed specifically for Twitter data (http://www.ark.cs.cmu.edu/TweetNLP/) to further classify the data.

The conference concluded with a wonderfully crowd-pleasing talk by Marc Alexander, focussed primarily on the Historical Thesaurus, but also discussing Mapping Metaphor, the Samuels project and the Linguistic DNA project. Marc demonstrated some very nice visualisations showing how the contents of the HT categories have changed over time. These were heat map based visualisations that animated to show which parts of English expanded or contracted the most at specific points in time. I would really like to develop some interactive versions of these visualisations and include them on the HT website and I’ll need to speak to Marc about whether we can do this now, or possibly get some funding to further develop the online Thesaurus resource to incorporate such interfaces.

All in all it was a really great conference and I feel that I have learned a lot that will be put to good use in current and future projects. It was also good to meet with other researchers at the conference, particularly the Varieng people as I will be working towards hosting the Helsinki Corpus on a server here at Glasgow in the coming weeks. Marc, Fraser and I had a very useful working lunch one day with Terttu Nevalainen and Matti Rissanen where we discussed the corpus and how Glasgow might host it. I’m hoping that we will be able to launch this before Christmas.