Week Beginning 10th September 2018

This was the sort of week when I worked on many different projects.  I created new ‘copyright’ and ‘terms of use’ pages for the DSL website and also made a few other tweaks that had been requested.  I created a second version of the Data Management Plan for Matthew Sangster’s project, based on feedback from him and the PI at Stirling, Katie Halsey.  I had an email discussion with a member of staff in MVLS about an app one of her students would like to publish, and I spoke to Zanne Domoney-Lyttle about a website she would like to factor into a funding proposal she is putting together.

On Wednesday I met with the project team for the Kirkcudbrightshire place-names project, which is just starting up.  I’d already set up a version of the REELS system for the project to use, so this was an opportunity to meet the team and go through how to use the content management system.  It was a useful session as a couple of technical issues cropped up that I needed to fix after the meeting, namely:

  1. The ‘add element’ feature wasn’t working.  It turned out that this was because I’d forgotten to migrate the contents of the ‘language’ table over from the REELS system, and as the system expected some data that was not there the element boxes didn’t load.  This has now been sorted.
  2. I was asked to migrate the contents of the ‘sources’ table over from REELS, which I did. These can now be viewed through the KCB CMS by pressing on the ‘browse sources’ link.  However, a lot of these are not going to be relevant as they’re specifically about Berwickshire.
  3. When demonstrating the REELS place-name search facilities I noted that a quick search for ‘t*’ was bringing back place-names that didn’t start with ‘t’. I found out why:  The quick search also searches elements, so any place-name that has an element starting with ‘t’ was also returned.  This is a bit confusing so perhaps we want to limit the quick search to headwords only.  However, you can use the advanced search to search specifically for ‘Current place-names’, or indeed you can use the ‘browse’ feature to bring back current place-names starting with a particular letter.
  4. I noticed at the meeting that the CMS automatically calculates the altitude of a place and I had a feeling that this was using Google Maps.  As it has been months since I set the facility up I had to check to make sure. It turns out this part of the site does indeed use Google Maps, and there are issues with using this service now, as I discussed last week.  The CMS connects to the Google Maps API, passes the latitude and longitude to the service and Google returns the altitude for that location.  However, I realised that there is no need to worry about this feature (or the Google Map embedded in the ‘edit record’ page) breaking as the system is already set up to use my Google account, which has an associated credit card.  I wasn’t aware that it would potentially be using my credit card until now, but there you go.  However, as the only place we use Google Maps is in the CMS, which can only be accessed by the project teams of REELS and KCB I don’t think I’ll ever face a bill.  The stats show that in the past 30 days there have been 278 calls to the Google Maps API and 8 calls to the Elevation API and the free tier allows up to 28,000 calls to the former and 40,000 calls to the latter.  So unless we have a particularly malicious member of staff who sits and refreshes their page thousands of times I think I’m safe!

I also spent some time this week going through the updates to all of the ‘Seeing Speech’ and ‘Dynamic Dialects’ pages that Eleanor had sent me and setting up the content.  This included creating new versions of image files that don’t have big, thick borders, creating new MP4 versions of some video files that were in a different format that couldn’t be supported natively by HTML5, and formatting all of the text for the new pages.  The latter also included amalgamating many small pages into single longer pages, as these tend to be preferred these days due to touchscreens.  The new site isn’t live yet, and there are still some changes to be made to the homepage text and other pages, but the bulk of the new site is now in place.  Hopefully we’ll be able to go live with the new design in the coming weeks.

The rest of my time this week was spent on Historical Thesaurus duties.  I had a productive meeting with Marc and Fraser on Tuesday, and devoted a lot of my time this week to writing scripts to help in the matching up of the HT and OED data.  This included creating a new statistics page that lists stats about the HT and OED categories and lexemes and what still needs to be matched up.  As part of this task Marc wanted to know how many HT categories only contain OE words, and how many are empty.  The latter was easy to do but the former was rather tricky, as it meant going through every HT category and then every lexeme in each of these categories to check for the presence of non-OE words.  This took too long to do on the fly so instead I updated the database to include a new ‘OE only’ field.  Running the script to generate data for this field took about 20 minutes, but now the data is in the database it’s really quick to query.  It turns out there are 3175 HT categories that only contain OE words.

I also wrote a script that address the issue of lexemes not being matched up because of pesky apostrophes.  We’ve also matched up lots of new categories since I last did a lexeme match so I thought I’d run one.  The script finds every HT category that has been matched to an OED category, brings back all of the unmatched words in both HT and OED categories and then compares the ‘stripped’ fields for each to identify words that should be linked together. I ran the script across all matched categories and it has identified 24,795 words that are not currently matched but should be (i.e. their category is matched and the contents of the ‘stripped’ field in the HT and OED word tables are identical).  I haven’t ticked these off yet, but it’s a nice bit number of new matches.

I also created a script that for each unmatched OED subcategory finds its parent category.  If this is matched to an HT category then the script finds this and returns all of its subcategories to see if there is one with the same name as the unmatched OED subcategory.  This has actually worked very well.  There are 4666 OED subcats that have a POS.  Of these there are 3158 that have a parent that has been matched to an HT category.  When looking at unmatched subcats in each of these HT maincats and comparing ‘stripped’ headings of each subcat to the OED subcat there are 2992 that match.  I updated the script to mark off the matches but then something odd happened.  When it marked off the matches it only reported 2710 subcat matches, which was a bit concerning, so I’ve reverted to a backup version of the category table that I’d made.

In order to investigate this discrepancy I updated the script so that any OED subcat that matches multiple HT subcats is now logged and is listed at the bottom of the page, together with counts of the duplicates and the total number of duplicates that are found (391).  If you search the page for one of these IDs you can see where the duplicates occur.  E.g. The OED subcat with ID 58953 (types of) within ‘clothing for body or trunk’ matches nine subcats within the joined HT maincat.  This is because we’re looking at all subcats at all levels, and ‘types of’ crops up several times at different levels.  I have therefore added in another check that identifies whether a match has the same subcat number.  If there is one then ‘Subs match too’ appears in purple next to the green ‘Match’ text.  This text appears for both single matches and multiple matches.

I’ve also added in some counts at the bottom of the page but above the list of duplicates.  These appear in purple.  There is a count of the matches where there are no duplicates.  These are probably safe to tick off as proper matches.  There are 2601 in total, out of 2992 subcat matches.  Exact occurrences of these are marked in the output with the purple text ‘One match.  Safe to log?’.

There is also a count of the possible matches where the subcat number is the same in both HT and OED data (where ‘subs match too’ appears in the output, as mentioned above).  This is useful in identifying which of the duplicates might be the correct ones.  There are 1732 matches where the sub numbers match, including both where there are duplicates and individual matches.  If the subs don’t match where there is one match (e.g. 143009 “one’s lot” matching 136436 “one’s lot”) it is because the subcat order has been messed about with (in this example the OED subcat number is 02 while the HT subcat number is 02.02).

I think it should be relatively safe to log all occurrences where there is one match, whether the subcat number is the same or not.  This would tick off 2601 categories.  I think it should also be pretty safe to tick off matches where there are duplicates but the subcat number also matches.  I’m not entirely sure how many that would tick off, but I would imagine it would be a fairly sizable portion of the 391 duplicates.

I also updated the script I had created last week that displays unmatched HT categories that have an ‘oedmaincat’ and therefore should be possible to match up to an OED category.  Content is now displayed as a table to hopefully make it easier to read.  I’ve added in a count of the words in the HT and OED categories and also the last word in each category, together with its dates.  Where a category has multiple potential matches the first column has a red background colour and a ‘Y’ in it.  I think it will be possible to automatically figure out the correct one for most of these multiples based on the words.  E.g. the first category is HT 39514 ‘one who’ and its last word (well, only word) is ‘Malacologist’.  Of the nine possible OED matches there is one whose last word is also ‘Malacologist’ so is no doubt the correct match.  However, adding in the words shows that some potential direct matches have different contents, e.g. the first row ‘causing discomfort’ has 4 words but the matching OED category only has 3 (OED omits ‘discomfortable’).  There is also often variation in the final words too, usually in spelling or use of punctuation, e.g. 15759 by occult methods has ‘point the bone’ while in the OED it’s ‘to point the (death) bone’.  Using the ‘stripped’ field will catch a lot of these (e.g. ‘R.S.P.B’ and ‘RSPB’) but not all of them.  Sometimes the word is completely different – e.g. 31915 pediculus corporis/body-louse has as its last word ‘typhus-louse’ while the corresponding category has the rather wonderful ‘pants rabbits’.

I made some further updates to this script to give cells a green background if the HT and OED numbers of words match and also if the last word (stripped) matches so you can see where the strong potential matches are.  This works for categories where there are duplicate possibilities too.  I’ve also added some stats to the bottom of the page.  There are a total of 920 potential matches and of these 43 have multiple possibilities.  Of these 32 have identical last words and are therefore probably the correct matches.  Overall there are 708 strong matches (i.e. with the same number of words and the same last word), including going through the multiples.  I would say it is probably safe to tick these 708 off.  However, the output of this script overlaps with the output of the previous one.  It is possible that most or possibly all of the matches identified by this script are already identified by the parent category match script.  E.g. OED 43618 ‘shells’ is matched to HT 39522 ‘shells’ while it is also matched by the parent category match script

I also created a script that lists all matched maincats and gives a count of the total number of subcats in each (not differentiating between matched and unmatched subcats).  Note that for HT data I’ve used the full ‘T’ numbers of the maincat to find its subcats rather than using the ‘oedmaincat’ field.  I’ve highlighted the rows where the numbers of subcats in the HT and OED data don’t match.  Where there are more HT subcats than OED subcats the background colour is the green of the HT header.  Where there are more OED subcats than HT subcats the background colour is the blue of the OED header.

The final script I created identifies gaps in the matched OED categories.  Currently the script orders the matched categories in the HT category table by the OED catid.  Where there is a gap between the previous OED catid and the current OED catid (e.g. OED catid 24 and 26) the script displays the HT and OED category information for the previous and next matched categories and then lists the unmatched OED categories that appear in the gap.  However, this is complicated by two things:

  1. Quite often the gap in OED numbering is caused by OED categories that have no POS and will therefore never be matched. I’ve marked these in the output of the script with a bold ‘No POS’.
  2. The ‘next’ matched category is often of a different part of speech. I guess where this happens then we should be able to figure out whether the missing categories that have a POS are likely to be connected to the ‘previous’ or ‘next’ category as their POS will likely match one or the other.

This will need further discussion when I meet with Marc and Fraser again next week.  My final HT task of the week was to set up a basic interface for the new ‘Thesaurus’ portal site that we’re going to launch.  It still needs a lot of work (and some content) but it’s beginning to take shape.