Week Beginning 18th March 2019

This week I spent a lot of time continuing with the HT/OED linking task, tackling the outstanding items on my ‘to do’ list before I met with Marc and Fraser on Friday.  This included the following:

Re-running category pattern matching scripts on the new OED categories:  The bulk of the category matching scripts rely on matching the HT’s oedmaincat field against the OED’s path field (and then doing other things like comparing category contents).  However, these scripts aren’t really very helpful with the new OED category table as the path has changed for a lot of the categories.  The script that seemed the most promising was number 17 in our workflow document, which compares first dates of all lexemes in all unmatched OED and HT categories and doesn’t check anything else.  I’ve created an updated version of this that uses the new OED data, and the script only brings back unmatched categories that have at least one word that has a GHT date, and interestingly the new data has less unmatched categories featuring GHT dates than the old data (591 as opposed to 794).  I’m not really sure why this is, or what might have happened to the GHT dates.  The script brings back five 100% matches (only 3 more than the old data, all but one containing just one word) and 52 matches that don’t meet our criteria (down from 56 with the old data) so was not massively successful.

Ticking off all matching HT/OED lexemes rather than just those within completely matched categories: 627863 lexemes are now matched.  There are 731307 non-OE words in the HT, so about 86% of these are ticked off.  There are 751156 lexemes in the new OED data, so about 84% of these are ticked off.  Whilst doing this task I noticed another unexpected thing about the new OED data:  the number of categories in ’01’ and ‘02’ have decreased while the number in ‘03’ has increased.  In the old OED data we have the following number of matched categories:

01: 114968

02: 29077

03: 79282

In the new OED data we have the following number of matched categories:

01: 109956

02: 29069

03: 84260

The totals match up, other than the 42 matched categories that have been deleted in the new data, so (presumably) some categories have changed their top level.  Matching up the HT and OED lexemes has introduced a few additional duplicates, caused when a ‘stripped’ form means multiple words within a category match.  There aren’t too many, but they will need to be fixed manually.

Identifying all words in matched categories that have no GHT dates and see which of these can be matched on stripped form alone: I created a script to do this, which lists every unmatched OED word that doesn’t have a GHT date in every matched OED category and then tries to find a matching HT word from the remaining unmatched words within the matched HT category.  Perhaps I misunderstood what was being requested because there are no matches returned in any of the top-level categories.  But then maybe OED words that don’t have a GHT date are likely to be new words that aren’t in the HT data anyway?

Create a monosemous script that finds all unmatched HT words that are monosemous and sees whether there are any matching OED words that are also monosemous: Again, I think the script I created will need more work.  It is currently set to only look at lexemes within matched categories.  It finds all the unmatched HT words that are in matched categories, then checks how many times each word appears amongst the unmatched HT words in matched categories of the same POS. If the word only appears once then the script looks within the matched OED category to find a currently unmatched word that matches.  At the moment the script does not check to see if this word is monosemous as I figured that if the word matches and is in a matched category it’s probably a correct match.  Of the 108212 unmatched HT words in matched categories, 70916 are monosemous within their POS and of these 14474 can be matched to an OED lexeme in the corresponding OED category.

Deciding which OED dates to use: I created a script that gets all of the matched HT and OED lexemes in one of the top-level categories (e.g. 01) and then for each matched lexeme works out the largest difference between OED sortdate and HT firstd (if sortdate is later then sortdate-firstd, otherwise firstd-sortdate); works out the largest difference between OED enddate and HT lastd in the same way; adds these two differences together to work out the largest overall difference.  It then sorts the data on the largest difference and then displays all lexemes in a table ordered by largest difference, with additional fields containing the start difference, end difference and total difference for info.  I did, however, encounter a potential issue:  Not all HT lexemes have a firstd and lastd.  E.g. words that are ‘OE-‘ have nothing in firstd and lastd but instead have ‘OE’ in the ‘oe’ column and ‘_’ in the ‘current’ column.  In such cases the difference between HT and OED dates are massive, but not accurate.  I wonder whether using HT’s apps and appe columns might work better.

Looking at lexemes that have an OED citation after 1945, which should be marked as ‘current’:  I created a script that goes through all of the matched lexemes and lists all of the ones that either have an OED sortdate greater than 1945 or an OED enddate greater than 1945 where the matched HT lexeme does not have the ‘current’ flag set to ‘_’.  There are 73919 such lexemes.

On Friday afternoon I had a meeting with Marc and Fraser where we discussed the above and our next steps.  I now have a further long ‘to do’ list, which I will no doubt give more information about next week.

Other than HT duties I helped out with some research proposals this week.  Jane Stuart-Smith and Eleanor Lawson are currently putting a new proposal together and I helped to write the data management plan for this.  I also met with Ophira Gamliel in Theology to discuss a proposal she’s putting together.  This involved reading through a lot of materials and considering all the various aspects of the project and the data requirements of each, as it is a highly multifaceted project.  I’ll need to spend some further time next week writing a plan for the project.

I also had a chat to Wendy Anderson about updating the Mapping Metaphor database, and also the possibility of moving the site to a different domain.  I also met with Gavin Miller to discuss the new website I’ll be setting up for his new Glasgow-wide Medical Humanities Network, and I ran some queries on the DSL database in order to extract entries that reference the OED for some work Fraser is doing.

Finally, I had to make some changes to the links from the Bilingual Thesaurus to the Middle English dictionary website.  The site has had a makeover, and is looking great, but unfortunately when they redeveloped the site they didn’t put redirects from the old URLs to the new ones.  This is pretty bas as it means anyone who has cited or bookmarked a page will end up with broken links, not just BTh.  I would imagine entries have been cited in countless academic papers and all these citations will now be broken, which is not good.  Anyway, I’ve fixed the MED links in BTh now.  Unfortunately there are two forms of link in the database, for example: http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=id&id=MED6466 and http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=byte&byte=24476400&egdisplay=compact.  I’m not sure why this is the case and I’ve no idea what the ‘byte’ number refers to in the second link type.  The first type includes the entry ID, which is still used in the new MED URLs.  This means I can get my script to extract the ID from the URL in the database and then replace the rest with the new URL, so the above becomes https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary/MED6466 as the target for our MED button and links directly through to the relevant entry page on their new site.

Unfortunately there doesn’t seem to be any way to identify an individual entry page for the second type of link.  This means there is no way to link directly to the relevant entry page.  However, I can link to the search results page by passing the headword, and this works pretty well.  So, for example the three words on this page: https://thesaurus.ac.uk/bth/category/?type=search&hw=2&qsearch=catourer&page=1#id=1393 have the second type of link, but if you press on one of the buttons you’ll find yourself at the search results page for that word on the MED website, e.g. https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary?utf8=%E2%9C%93&search_field=hnf&q=Catourer.