I was on holiday last week and I spent the whole of this week at the D2E conference (http://www.helsinki.fi/varieng/d2e/) in Helsinki. It was a really interesting and informative event, and I learned lots of new things about corpus linguistics, visualisation techniques, statistics and the general management and research potential of large English language datasets. The conference began with an excellent plenary by Tony McEnery of Lancaster University. It was an excellent introduction to the conference and also to corpus linguistics in general and really set the tone for the whole conference. He pointed out how collocation is psychologically meaningful, that words appearing near each other are significant and that there is proof that this is how the brain works. He also stated that corpora can provide the evidence for historians – the actual proof that words were actually used as historians think they are. A very important point Tony made was that ‘distant reading’ via n-grams and other visualisations is not enough. ‘Close reading’ of the actual data is also vital. Distant reading is a useful tool to get ideas but it is important to link to the real data to check that theories actually pan out. Tony also noted that the problem of working with language data is that language is not a static data point and this can make the management and use of the data tricky. He gave the example of variation in the use of the word ‘rogue’. Its meaning has changed significantly from the 1700s to the present day so just searching for occurrences of the word will not give meaningful results. There are also problems with variant spellings and other issues such as the metaphorical use of words. Tony pointed out that using Google’s n-gram viewer to look at the usage of musical instruments over history has problems as ‘trumpet’ can be used metaphorically and when people use ‘drum up’ they’re not talking about the actual instrument. A good quote from Tony’s talk is “aggregated data hides meaningful variation”. He also pointed out that sometimes words change their usage, for example ‘prostitute’ changed from a verb to a noun. Part of Speech taggers can get completely thrown by this and tag things incorrectly. Word meanings also change over time and this can be a big problem for semantic taggers. Tony noted that the word ‘multiplex’ is tagged as being related to cinema but in the 17th century this word had a numerical meaning. Religion in the 17th century was of absolutely huge importance but modern day semantic taggers can give a complete miscategorisation of the 17th century worldview. Changing usage of words over time can also be a consequence of censorship rather than an actual change in language.
Other highlights of the conference for me included a session by James McCracken of the OED about integrating historical frequency data as a means of showing which new words are significant. James as looking at whether the OED entry size could be used to show the significance of a word – that generally the larger the entry the more important the word. He also demonstrated some excellent animated treemaps of the HTOED, plotting change over time. The interface allowed James to click a ‘play’ button and for various parts of the map to grow and shrink to represent significance over time. It looked like these visualisations were created in d3 and I may have to see if I can create something similar for the Thesaurus data at Glasgow.
There was also a very interesting session on the analysis of Twitter data, specifically geocoded Twtter data from the US and the UK (geocoded data is a subset of Twitter data as users have to have this feature turned on on their devices). The corpus created features 8.9 billion words and 7 million users, with every message time stamped and geocoded. This allows the researchers to map the frequency of words over time by US county – e.g. the use of the word ‘snow’. There was also some mention of YouTube subtitles, which are apparently being added automatically to videos. The Twitter visualisations used cumulative maps over time, with points being added to the map at each time period. I’ll have to look into the creation of these too.
Another project was looking at using text mining to build a corpus, and the project used a number of tools that I will have to investigate further, including an OCR normaliser (http://tedunderwood.com/2013/12/10/a-half-decent-ocr-normalizer-for-english-texts-after-1700/), a part of speech tagger called TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/), a piece of Topic Modelling software called Mallet (http://mallet.cs.umass.edu/topics.php) and a clustering algorithm for historical texts (http://www.linguistics.ucsb.edu/faculty/stgries/research/2012_STG-MH_VarNeighbClustering_OxfHBHistEngl.pdf) and a discussion of how pattern mining could be used.
Johnathan Hope of Strathclyde University gave an excellent demonstration and discussion of a web-based text tagger called Ubiquity (http://vep.cs.wisc.edu/ubiq/), which allows users to upload any texts and for these to be tagged, either based on the categories of the DocuScope dictionary (see http://www.cmu.edu/hss/english/research/docuscope.html), or by uploading your own set of rules. Jonathan showed how this could work using Hamlet, and the tagger very quickly produced a nicely tagged version of the text that allowed a user to very easily highlight different linguistic phenomena such as questions. Jonathan also showed how the results from multiple documents could then be plotted using a rather fancy 3D scatterplot diagram.
Another paper mentioned the English Short Title Catalogue (http://estc.bl.uk/F/?func=file&file_name=login-bl-estc), which I hadn’t heard of before and is a searchable collection of over 480,000 items from 1473 to 1800. The GDELT project (http://www.gdeltproject.org/) was also mentioned – a realtime network diagram of global human society. Another paper discussed developing an interface for historical sociolinguistics, which the researches had created using the Bootstrap framework (http://getbootstrap.com/), which I really must try out some time. Their interface allowed different search results to be loaded into different tabs for comparison, which was a nice feature, and also included a very nice animated plot of data over time, with play and pause buttons and facilities to drill down into the specific data.
Gerold Schneider gave a very good demonstration of tools and methods for processing and visualising large corpora using corpora made available through the Dependency Bank project (http://www.es.uzh.ch/Subsites/Projects/dbank.html). The talk discussed a machine learning tool called Lightside (http://ankara.lti.cs.cmu.edu/side/download.html) that can be used to do things such as telling from the text of political speeches whether the speaker is a Republican or a Democrat. He also showed results that had been created using GoogleVis – an interface for ‘R’ o allow the Google Charts API to be used (see https://cran.r-project.org/web/packages/googleVis/vignettes/googleVis_examples.html). Gerold also demonstrated a nice visualisation that animated to show how points change over time with play and pause buttons – this definitely seems to be a visualisation style that is popular at the moment and I’d like to implement such an interface too. Another talk mentioned WebCorp (http://www.webcorp.org.uk/live/), a 1.3 billion word corpus that can concordance the web in real-time, while another talk discussed MorphoQuantics (http://morphoquantics.co.uk/), a corpus of type and token frequencies of affixes in the spoken component of the BNC.
Jane Winters of UCL gave an excellent talk about dealing with big humanities data, stating that the big problem with a lot of humanities data is not necessarily its scale but its ‘messiness’: very often it is not consistent, and this can be for a variety of reasons, such as the differences in print quality over time. Jane talked about ConnectedHistories (http://www.connectedhistories.org/), which is a federated search of many different online historical resources, allowing for searches of people, places ad keywords over time. She also mentioned Neo4J (http://neo4j.com/), a graph database and visualisation library that one of the projects was using. An example of the sorts of visualisations it can create can be found on this page: http://graphgist.neo4j.com/?_ga=1.193256747.785518942.1445934388#!/gists/a7ad4d70f6f993bb864f38f4825d1989. Another project Jane has been involved with is ‘Digging into Parliamentary Data’ (http://dilipad.history.ac.uk/). Fraser and I are already using some of the outputs of this project for our Hansard visualisations, for example the data the project generated that assigned political party to each speaker in Hansard, and it was very interesting to hear more about the project. Jane also mentioned a search facility that was developed from the project that allows users to search the parliamentary proceedings of the Netherlands, UK and Canada and can be accessed here: http://search.politicalmashup.nl. It’s a very useful resource.
Another speak discussed using Wmatrix (http://ucrel.lancs.ac.uk/wmatrix/), a web interface for the USAS and CLAWS taggers and the ConcGram concordance, which appears to only be available on CD-Rom (https://benjamins.com/#catalog/software/cls.1/main). Mark Davies gave a wonderful talk about the various corpora that he hosts at BYU (see http://corpus.byu.edu/), including the 1.9 billion word Wikipedia corpus. He also explained how users can create and share their own corpus using the tools he has created, focussing on a specific theme or genre, sharing the results and comparing the use of language within this corpus with the resource as a whole.
I also attended an interesting talk on Middle English alchemical texts, which mentioned the digital editions of Isaac Newton’s alchemical works that are available at Indiana University (http://webapp1.dlib.indiana.edu/newton/). The final plenary was given by Päivi Pahta and looked at multilingualism in historical English texts. She mentioned a tool called Multilingualiser, which can find and tag foreign words in historical texts. The tool doesn’t appear to be available online yet but is being developed at Helsinki. The tool was used to look for occurrences of two or more foreign words together and these were then tagged for their language. In the late modern period Latin and French were the most used, with an increase in German towards the end due to its importance in science. The project also stored a lot of metadata, such as genre and intended readership and found the types of texts with most foreign words were letters, academic writing and travelogues. Three types of usage were found – conventionalised expressions (e.g. ‘terra firma’, ‘carte blanche’), prefabricated expressions (e.g. proverbs and quotations) and free expressions (longer passages). The results of some of the queries of foreign words were plotted on a ternary plot diagram, which was a nice way to visualise the data.
Another speaker discussed the OE corpus, pointing out that it consists of 3.5 million words and represents all the evidence we have for English between 600 and 1150 and that it is all available in digital form (for a price). However, there is no metadata so nothing on genre or anything like that. The OE part of the Helsinki Corpus does include fairly rich metadata, but represents only 15% of the whole. The speaker pointed out that metadata is really important to allow grouping by genre and to enable to producer and receiver to be considered.
Another paper discussed using Twitter as a corpus, this time using the Twitter streaming API, which makes available about 1% of the total dataset. However, only 1.6% of Tweets have geocoding data and Twitter only started tagging Tweets for language in 2013. The speaker was interested in English Tweets in Finland and used the Twitter language tag and geocoding data to define a dataset. He also used a Python based language identification system (https://github.com/saffsd/langid.py) and a part of speech tagger that had been developed specifically for Twitter data (http://www.ark.cs.cmu.edu/TweetNLP/) to further classify the data.
The conference concluded with a wonderfully crowd-pleasing talk by Marc Alexander, focussed primarily on the Historical Thesaurus, but also discussing Mapping Metaphor, the Samuels project and the Linguistic DNA project. Marc demonstrated some very nice visualisations showing how the contents of the HT categories have changed over time. These were heat map based visualisations that animated to show which parts of English expanded or contracted the most at specific points in time. I would really like to develop some interactive versions of these visualisations and include them on the HT website and I’ll need to speak to Marc about whether we can do this now, or possibly get some funding to further develop the online Thesaurus resource to incorporate such interfaces.
All in all it was a really great conference and I feel that I have learned a lot that will be put to good use in current and future projects. It was also good to meet with other researchers at the conference, particularly the Varieng people as I will be working towards hosting the Helsinki Corpus on a server here at Glasgow in the coming weeks. Marc, Fraser and I had a very useful working lunch one day with Terttu Nevalainen and Matti Rissanen where we discussed the corpus and how Glasgow might host it. I’m hoping that we will be able to launch this before Christmas.
I worked on quite a number of different projects this week. My first task was to set up a discussion forum for Sean Adams’ Academic Publishing conference website. The website is another WordPress powered site and I hadn’t worked with any forum plugins before so it was interesting to learn about this. I settled for the widely adopted ‘bbpress’ plugin, which turned out to be very straightforward to set up and integrates nicely with WordPress. I had to tweak the University theme I’d created a little so that the various sections displayed properly, but after that all appeared to be working well. I also spent some time continuing to contribute to the new Burns bid for Gerry Carruthers. I’d received some feedback on my first version of the Technical Plan and based on this and some updated bid documentation I created a second version. I also participated in some email discussions about other parts of the bid too. It seems to be shaping up very well. Ann Fergusson from SND contacted me this week as someone had spotted a missing section of text in one of the explanatory pages. I swiftly integrated the missing text and all is now well again.
On Tuesday I had a meeting with Susan and Magda about the Scots Thesaurus. We went through some of the outstanding tasks and figured out how and when these would be implemented. The biggest one is the creation of a search variants table, which will allow any number of different spellings to be associated with a lexeme, enabling it to be found by the search option. However, Magda is going to rework a lot of the lexemes over the coming weeks so I’m going to hold off on implementing this feature until this work has been completed.
I also had a Mapping Metaphor task to do this week: updating the database with new data. Wendy has been continuing to work with the data, adding in directionality, dates and sample lexemes and Ellen sent me a new batch. It has been a while since I’d last uploaded new data and it took me a while to remember how my upload script worked, but after I’d figured that out everything went smoothly. We now have information about 16,378 metaphorical connections in the system and 12,845 sample lexemes linked into the Historical Thesaurus.
I’m going to be on holiday next week and the following week I’m going to be at a conference so there will be no more from me until after I return.
This week was a return to something like normality after the somewhat hectic time I had in the run-up to the launch of the Scots Thesaurus website last week. I spent a bit of further time on the Scots Thesaurus project, making some tweaks to things that were noticed last week and adding in some functionality that I didn’t have time to implement before the launch. This include differentiating between regular SND entries and supplemental entries in the ‘source’ links and updating the advanced search functionality to enable users to limit their search by source. I also spent the best part of a day working on the Technical Plan for the Burns people, submitting a first draft and a long list of questions to them on Monday. Gerry and Pauline got back to me with some replies by the end of the week and I’ll be writing a second version of the plan next week.
On Friday we had a team meeting for the Metaphor in the Curriculum project. We spent a couple of hours going over the intended outputs of the project and getting some more concrete ideas about how they might be structured and interconnected, and also about timescales for development. It’s looking like I will be creating some mockups of possible exercise interfaces in early November, based on content that Ellen is going to send to me this month. I will then start to develop the app and the website in December with testing and refinement in January, or there abouts.
I also spent some time this week working on the Medical Humanities Network website for Megan Coyer. I have now completed the keywords page, the ‘add and edit keywords’ facilities and I’ve added in options to add and edit organisations and units. I think that means all the development work is now complete! I’ll still need to add in any site text when this has been prepared and I’ll need to remove the ‘log in’ pop-up when the site is ready to go live. Other than that my development work on this project is now complete.
Continuing on a Medical Humanities them, I spent a few hours this week working on some of the front end features for the ScifFiMedHums website, specifically features that will allow users to browse the bibliographical items for things like years and themes. There’s still a lot to implement but it’s coming along quite nicely. I also helped Alison Wiggins out with a new website she’s wanting to set up. It’s another WordPress based site and the bare-bones site is now up and running and ready for her to work with when she has the time available.
On Friday afternoon I received my new desktop PC for my office and I spent quite a bit of the afternoon getting it set up, installing software, copying files across from my old PC and things like that. It’s going to be so good to have a PC that doesn’t crash if you tell it to open Excel in the afternoons!