Week Beginning 11th February 2019

I continued with the HT / OED linking tasks for a lot of this week, dealing not only with categories but also the linking of lexemes within linked categories.  We’d previously discovered that the OED had duplicated an entire branch of the HT:  their 03.01.07.06 was structurally the same as their 03.01.04.06, but the lexemes contained in the two branches didn’t match up exactly due to subsequent revisions.  We had decided to ‘quarantine’ the 03.01.07.06 so as to ensure no contents from this branch are accidentally matched up.  I did so by adding a new ‘quarantined’ column to the ‘category_oed’ table.  It’s ‘N’ by default and ‘Y’ for the 207 categories in this branch.  All future lexeme matching scripts will be set to ignore this branch.

I also created a ‘gap matching’ script.  This grabs every unmatched OED category that has a POS and contains words (not including the quarantined categories).  There are 950 in total.  For each of these the script grabs the OED categories with an ID one lower and one higher than the category ID and only returns them if they are both the same POS and contain words.  So for example with OED 2560 ‘relating to dry land’ (aj) the previous category is 2559 ‘partially’ and the next category is 2561 ‘spec’.  It then checks to see whether these are both matched up to HT categories.  In this case they are, the former to 910 ‘partially’, the latter to 912 ‘specific’.  The script then notes whether there is a gap in the HT numbering, which there is here.  It also checks to make sure the category in the gap is of the same POS.  So in this example, 911 is the gap and the category (‘pertaining to dry land’) is an Aj.  So this category is returned in its own column, along with a count of the number of words and a list of the words.

There are, however, some things to watch out for.  There are a few occasions where there is more than one HT category in the gap.  For example, for the OED category 165009 ‘enter upon command’ the ‘before’ category matches HT category 157423 and the ‘after’ category matches 157445, meaning there are several categories in the gap.  Currently in such cases the script just grabs the first HT category in the gap.  Linked to this (but not always due to this) some HT categories in the gap are already linked to other OED categories.  I’ve put in a check for this so they can be manually checked.

There are 169 gaps to explore and of these 14 HT categories in the gap are already matched to something else.  There are also two categories where the identified HT category in the gap is the wrong POS, and these are also flagged.  Many of the potential matches are ones that have fallen through the cracks due to lexemes being too different to automatically match up, generally due to there being only 1-3 matching words in the category.  The matches look pretty promising, and will just need to be manually checked over before I tick a lot of them off.

Also this week, I updated the ‘match lexemes’ script output to ignore final ‘s’ and initial ‘to’.  I also added in counts of matched and unmatched words.  We were right to be concerned about duplicate words as the ‘total matched’ figures for OED and HT lexemes are not the same, meaning a word in OED matches multiple in HT (or vice-versa).  After running the script here are some stats:

For ’01’ there are 347312 matched HT words and 40877 unmatched HT words, and 347947 matched OED words and 27840 unmatched OED words.  For ’02’ there are 110510 matched HT words and 13184 unmatched HT words, and 110651 matched OED words and 9226 unmatched OED words.  For ’03’ there are 201653 matched HT words and 22319 unmatched HT words, and 201994 matched OED words and 15777 unmatched OED words.

I then created a script that lists all duplicate lexemes in HT and OED categories.  There shouldn’t really be any duplicate lexemes in categories as each word should only appear once in each sense.  However, my script uncovered rather a lot of duplicates.  This is going to have an impact on our lexeme matching scripts as our plans were based on the assumption that each lexeme form would be unique in a category.  My script gives four different lists for both HT and OED categories:  All categories comparing citation form, all categories comparing stripped form, matched categories comparing citation form and matched categories comparing stripped form.  The output lists the lexeme ID and either fulldate in the case of HT or GHT dates 1 and 2 in the case of OED so it’s easier to compare forms.

For all HT categories there are 576 duplicates using citation form and 3316 duplicates using the stripped form.  The majority of these are in matched categories (550 and 3264 respectively).  In the OED data things get much, much worse.  For all OED categories there are 5662 duplicates using citation form and 6896 duplicates using the stripped form.  Again, the majority of these are in matched categories (5634 and 6868 respectively).  This is going to need some work in the coming weeks.

As we can’t currently rely on the word form in a category to be unique, I decided to make a new script that matches lexemes in matched categories using both their word form and their date.  It matches both stripped word form and start date (the first bit of HT fulldate against the GHT1 date) and is looking pretty promising, with matched figures not too far off those found when comparing stripped word form on its own.  The script lists the HT word / ID and date and its corresponding OED word / ID and date in both the HT and OED word columns.  Any unmatched HT or OED words are then listed in red underneath

Here are some stats (with those for the ‘only matching by stripped form’ in brackets for comparison)

01: There are 335195 (347312) matched HT words and 52994 (40877) unmatched HT words, and 335196 (347947) matched OED words and 40591 (27840) unmatched OED words.

02: There are 106845 (110510) matched HT words and 16849 (13184) unmatched HT words, and 106842 (110651) matched OED words and 13035 (9226) unmatched OED words.

03:  There are 193187 (201653) matched HT words and 30785 (22319) unmatched HT words, and 193186 (201994) matched OED words and 24585 (15777) unmatched OED words.

I’m guessing that the reason the number of HT and OED matches aren’t exactly the same is because of duplicates with identical dates somewhere.  But still, the matches are much more reliable.  However, there would still appear to be several issues relating to duplicates.  Some OED duplicates are carried over from HT duplicates – e.g. ‘stalagmite’ in HT 3142 ‘stalagmite/stalactite’.  Duplicates appear in both HT and OED, and the forms in each set have matching dates so are matched up without issue.  But sometimes the OED has changed a form, which has resulted in a duplicate being made.  E.g. For HT 5750 ‘as seat of planet’ there are two OED ‘term’ words.  The second one (ID 252, date a1625) should actually match the HT word ‘termin’ (ID 19164, date a1625).  In HT 6506 ‘Towards’ the OED has two ‘to the sun-ward’, but the latter (ID 1806, date a1711) seems to have been changed from the HT’s ‘sunward’ (ID 20940, date a1711), which is a bit weird.  There are also some cases where the wrong duplicate is still being matched, often due to OE dates.  For example, in HT category 5810 (Sky, heavens (n)), ‘heaven’ (HT 19331 with dates OE-1860) is set to match OED 399 ‘heaven’ (with dates OE-).  But HT ‘heavens’ (19332 with dates OE-) is also set to match OED 399 ‘heaven’ (as the stripped form is ‘heaven’ and the start date matches).  The OED duplicate ‘heaven’ (ID 433, dates OE-1860) doesn’t get matched as the script finds the 399 ‘heaven’ first and goes no further.  Also in this case the OED duplicate ‘heaven’ appears to have been created by the OED removing ‘final -s’ from the second form.

On Friday I met with Marc to discuss all of the above, and we made a plan about what to focus on next.  I’ll be continuing with this next week.

Also this week I did some more work for the DSL people.  I reviewed some documents Ann had sent me relating to IT infrastructure, spoke to Rhona about some future work I’m going to be doing for the DSL that I can’t really go into any detail about at this stage, created a couple of new pages for the website that will go live next week and updated the way the DSL’s entry page works to allow dictionary ID (e.g. ‘dost24821’) to be passed to the page in addition to the current way of passing dictionary (e.g. ‘dost’) and entry href (e.g. ‘milnare’).

I also gave some advice to the RA of the SCOSYA project who is working on reshaping the Voronoi cells to more closely fit the coastline of Scotland, gave some advice to a member of staff in History who is wanting to rework an existing database, spoke to Gavin Miller about his new Glasgow-wide Medical Humanities project and completed the migration of the RNSN timeline data from Google Docs to locally hosted JSON files.