Week Beginning 26th November 2018

I continued to work on the outstanding ‘stories’ for the Romantic National Song Network this week, completing work on another story using the storymap.js library.  I have now completed seven of these stories, which is more than half of the total the project intends to make.

On Monday I met with E Jamieson, the new RA on the SCOSYA project, to discuss the maps we are going to make available to the public.  We discussed various ways in which the point-based data might be extrapolated, such as heat maps and Voronoi diagrams.  I found a nice example of a Leaflet.js / D3.js based Voronoi diagram that I think could work very well for the project (see https://chriszetter.com/voronoi-map/examples/uk-supermarkets/) so I might start to investigate using such an approach.  I think we’d want to be able to colour-code the cells, although other D3.js examples of Voronoi diagrams suggest that this is possible (see this one: ).  We also discussed how the more general public views of the data (as opposed to expert view) might work.  The project team like the interface offered by this site: https://ygdp.yale.edu/phenomena/done-my-homework, although we want something that presents more of the explanatory information (including maybe videos) via the map interface itself.  It looks like the storymap.js library (https://storymap.knightlab.com/) I’m using for RNSN might actually work very well for this.  For RNSN I’m using the library with images rather than maps, but it was primarily designed for use with maps, and could hopefully be adapted to work with a map showing data points or even Voronoi layers.

I spent a further couple of days this week working on the HT / OED data linking task.  This included reading through and giving feedback on Fraser’s abstract for DH2019 and updating the v1 / v2 comparison script to add in an additional column to show whether the v1 match was handled automatically or manually.  I also created a new script to look at the siblings of categories that contain monosemous forms to see whether any of these might have matches at the same level.  This script takes all of the monosemous matches as listed in the  monosemous QA script and for each OED and HT category finds their unmatched siblings that don’t otherwise also appear in the list.  The script then iterates through the OED siblings and for each of these compares the contents to the contents of each of the HT siblings.  If there is a match (matches for this script being anything that’s green, lime green, yellow or orange) the row is displayed on screen.  Where there are multiple monosemous categories at the same level the siblings will be analysed for each of the categories, so there is some duplication.  E.g. the first monosemous link is ‘OED category 2797|03 (n) deep place or part matches HT category 1017|03 (n) deep place/part’ and there are two unmatched OED siblings (‘shallow place’ and ‘accumulation of water behind barrier’), so these are analysed.  But the next monosemous category (OED category 2803|07 (n) bed of matches HT category 1024|07 (n) bed of) is at the same level, so the two siblings are analysed again.  This happens quite a lot, but even so there are still some matches that this script finds that wouldn’t otherwise have been found due to changes is category number.  I’ve made a count of the total unique matches (all colours) and it’s 162.  I fear we are getting to the point where the amount of time it takes to write scripts to identify matches is taking longer than the time it would take to manually identify matches, though.  It took several hours to write this script for 162 potential matches.

I also created a script that lists all of the non-matched OED and HT categories, split into various smaller lists, such as main categories or sub-categories, and on Wednesday I attended a meeting with Marc and Fraser to discuss our next steps.  I came out of the meeting with another long list of items to try and tackle, and I spent some of the rest of the week going through the list.  I ticked off the outstanding green, lime green and yellow matches on the lexeme pattern matching, sibling matching and monosemous matching scripts.

I then updated the sibling matching script to look for matches at any subcat level, but unfortunately this didn’t really uncover much new, at least initially.  It found just one extra green and three yellows, although the 86 oranges look like they would mostly be ok too, with manual checking.  I went over my script and it was definitely doing what I’m expecting it to do, namely:  Get all of the unmatched OED cats (e.g.|05.04 (vt)); for subcats get all of the unmatched HT subcats of the maincat in the same POS (e.g. all the unmatched subcats of that are vt); list all of the subcats; if one of the stripped headings matches or has a Levenshtein score of 1 then this is highlighted in green and its contents are compared.

I then updated the script so that it didn’t compare category headings at all, but instead only looked at the contents.  In this script each possible match appears in its own row (e.g. cat 120031 appears 4 times, once as an orange, 3 times as purple).  It has brought back 8 greens, 1 lime green, 4 yellows and 1617 oranges.

I then updated the monosemous QA script to identify categories where the monosemous form has dates that match and one further date matches, the idea being if these criteria are met the category match is likely to be legitimate.  This was actually really difficult to implement and took most of a day to do.  This is because the identification of monosemous forms was done at a completely different point (and actually by a completely different script) to the listing and comparing of the full category contents.  I had to rewrite large parts of the function that gets and compares lexemes in order to integrate the monosemous forms.  The script now makes all monosemous forms in the OED word list for each category bold and compares these forms and their dates to all of the HT words in the category.  A count of all of the monsemous forms that match an HT form in terms of stripped / pattern matched content and start date is stored.  If this count is 1 or more and the count of ‘Matched stripped lexemes (including dates)’ is 2 or more then the match is bumped up to yellow.  This has identified 512 categories, which is about a sixth of the total OED unmatched categories with words, which is pretty good.

Other tasks this week included creating a new (and possibly final) blog post for the REELS project, dealing with some App related questions from someone in MVLS, having a brief meeting with Clara Cohen from English Language to discuss the technical aspects of a proposal she’s putting together and making a few further tweaks to the Bilingual Thesaurus website.

On Friday I attended the Corpus Linguistics in Scotland event at Edinburgh University.  There were 12 different talks over the course of the day on a broad selection of subjects.  As I’m primarily interested in the technologies used rather than the actual subject matter, here are some technical details.  One presenter used Juxta (https://www.juxtaeditions.com/) to identify variation in manuscripts.  Another used TEI to mark up pre-modern manuscripts for lexicographical use (looking at abbreviations, scribes, parts of speech, gaps, occurrences of particular headwords).  Another speaker had created a plugin for the text editor Emacs that allows you to look at things like word frequencies, n-grams and collocations.  A further speaker handled OCR using Google Cloud Vision (https://cloud.google.com/vision/) that can take images and analyse them in lots of ways, including extracting the text.  A couple of speakers used AntConc (http://www.laurenceanthony.net/software/antconc/) and another couple used the newspaper collections available through LexisNexis (https://www.lexisnexis.com/ap/academic/form_news_wires.asp) as source data.  Other speakers used Wordsmith tools (https://www.lexically.net/wordsmith/), Sketch Engine (https://www.sketchengine.eu) and WMatrix (http://ucrel.lancs.ac.uk/wmatrix/).  It was very interesting to learn about the approaches taken by the speakers.