Week Beginning 28th January 2019

Last Friday afternoon I met with Charlotte Methuen to discuss a proposal she’s putting together.  It’s an AHRC proposal, but not a typical one as it’s in collaboration with a German funding body and it has its own template.  I had agreed to write the technical aspects of the proposal, which I had assumed would involve a typical AHRC Data Management Plan, but the template didn’t include such a thing.  It did however include other sections where technical matters could be added, so I wrote some material for these sections.  As Charlotte wanted to submit the proposal for internal review by the end of the week I needed to focus on my text at the start of the week, and spent most of Monday and Tuesday working on it.  I sent my text to Charlotte on Tuesday afternoon, and made a few minor tweaks on Wednesday and everything was finalised soon after that.  Now we’ll just need to wait and see whether the project gets funded.

I also continued with the HT / OED linking process this week as well.  Fraser had clarified which manual connections he wanted me to tick off, so I ran these through a little script that resulted in another 100 or so matched categories.  Fraser had also alerted me to an issue with some OED categories.  Apparently the OED people had duplicated an entire branch of the thesaurus (03.01.07.06 and 03.01.04.06) but had subsequently made changes to each of these branches independently of the other.  This means that for a number of HT categories there are two potential OED category matches, and the words (and information relating to words such as dates) found in each of these may differ.  It’s going to be a messy issue to fix.  I spent some time this week writing scripts that will help us to compare the contents of the two branches to work out where the differences lie.  First of all I wrote a script that displays the full contents (categories and words) contained in an OED category in tabular format.  For example, passing the category 03.01.07.06 then lists the 207 categories found therein, and all of the words contained in these categories.  For comparison, 03.01.04.06 contains 299 categories.

I then created another script that compares the contents of any two OED categories.  By default, it compares the two categories mentioned above, but any two can be passed, for example to compare things lower down the hierarchy.  The script extracts the contents of each chosen category and looks for exact matches between the two sets.  The script looks for an exact match of the following in combination (i.e. all must be true):

  1. length of path (so xx.xx and yy.yy match but xx.xx and yy.yy.yy don’t)
  2. length of sub (so a sub of xx matches yy but a sub of xx doesn’t match xx.yyy)
  3. POS
  4. Stripped heading

In such cases the categories are listed in a table together with their lexemes, and the lexemes are also then compared.  If a lexeme from cat1 appears in cat2 (or vice-versa) it is given a green background.  If a lexeme from one cat is not present in the other it is given a red background, and all  lexemes are listed with their dates.  Unmatched categories are listed in their own tables below the main table, with links at the top of the page to each.  03.01.04.06 has 299 categories and 03.01.07.06 has 207 categories.  Of these there would appear to be 209 matches, although some of these are evidently duplicates.  Some further investigation is required, but it does at least look like the majority of categories in each branch can be matched.

I also updated the lists of unmatched categories to incorporate the number of senses for each word.  The overview page now gives a list of the number of times words appear in the unmatched category data.  Of the 2155 OED words that are currently in unmatched OED categories we have 1763 words with 1 unmatched sense, 232 words with 2 unmatched senses, 75 words with 3 unmatched senses, 18 words with 6 unmatched senses, 36 words with 4 unmatched senses, 15 words with 5 unmatched senses and 16 words with 8 unmatched senses.  I also updated the full category lists linked to from this summary information to include the count of senses (unmatched) for each individual OED word, so for example for ‘extra-terrestrial’ the following information is now displayed: extra-terrestrial (1868-1969 [1963-]) [1 unmatched sense].

Also this week I tweaked some settings relating to Rob Maslen’s ‘Fantasy’ blog, investigated some categories that had been renumbered erroneously in the Thesaurus of Old English and did a bit more investigation into travel and accommodation for the Bergamo conference.

I split the remainder of my time between RNSN and SCOSYA.  For RNSN I had been sent a sizable list of updates that needed to be made to the content of a number of song stories, so I made the necessary changes.  I had also been sent an entirely new timeline-based song story, and I spent a couple of hours extracting the images, text and audio from the PowerPoint presentation and formatting everything for display in the timeline.

For SCOSYA I spent some time further researching Voronoi diagrams and began trying to update my code to work with the current version of D3.js.  It turns out that there have been many changes to the way in which D3 implements Voronoi diagrams since the code I based my visualisations on was released.  For one thing, ‘d3-voronoi’ is going to be deprecated and replaced by a new module called d3-delaunay.  Information about this can be found here: https://github.com/d3/d3-voronoi/blob/master/README.md.  There is also now a specific module for applying Voronoi diagrams to spheres using coordinates, called d3-geo-voronoi (https://github.com/Fil/d3-geo-voronoi).  I’m now wondering whether I should start again from scratch with the visualisation.  However, I also received an email from Jennifer raising some issues with Voronoi diagrams in general so we might need an entirely different approach anyway.  We’re going to meet next week to discuss this.