Week Beginning 5th October 2020

This was a four-day week for me as I’d taken Friday off as it was an in-service day at my son’s school before next week’s half-term, which I’ve also taken off.  I had rather a lot to try and get done before my holiday so it was a pretty intense week, split mostly between the Historical Thesaurus and the Anglo-Norman Dictionary.

For the Historical Thesaurus I continued with the preparations for the second edition, starting off by creating a little statistics page that lists all of the words and categories that have been updated for the second edition and the changelog code that have been applied to them.  Marc had sent a list of all of the category number sequences that we have updated so I then spent a bit of time updating the database to apply the changelog codes to all of these categories.  It turns out that almost 200,000 categories have been revised and relocated (out of about 235,000) so it’s pretty much everything.  At our meeting last week we had proposed updating the ‘new OED words’ script I’d written last week to separate out some potential candidates into an eighth spreadsheet (these are words that have a slash in them, which now get split up on the slash and each part is compared against the HT’s search words table to see whether they already exist).  Whilst working through some of the other tasks I realised that I hadn’t included the unique identifiers for OED lexemes in the output, which was going to make it a bit difficult to work with the files programmatically, especially since there are some occasions where the OED has two identical lexemes in a category.  I therefore updated my script and regenerated the output to include the lexeme ID making it possible to differentiate identical lexemes and also making it easier to grab dates for the lexeme in question.

The issue of there being multiple identical lexemes in an OED category was a strange one.  For example, one category had two ‘Amber pudding’ lexemes.  I wrote a script that extracted all of these duplicates and there are possibly a hundred or so of them, and also other OED lexemes that appear to have no associated dates.  I passed these over to Marc and Fraser for them to have a look at.  After that I worked on a script to go through each of the almost 12,000 lexemes that we have identified as OED lexemes that are definitely not present in the HT data, extract their OED dates and then format these as HT dates.

The script generates date entries as they would be added to the HT lexeme dates table (used for timelines), the HT fulldate field (used for display) and the HT firstdate and lastdate fields (used for searching).  Dates earlier than 1150 are stored as their actual values in the dates table, but are stored at ‘650’ in the ‘firstdate’ field and are displayed as ‘OE’ in the ‘fulldate.  Dates after 1945 are stored as ‘9999’ in both the dates table and the ‘lastdate’ field.  Where there is a ‘yearend’ in the OED date (i.e. the date is a range) this is stored as the ‘year_b’ in the HT date and appears after a slash in the ‘fulldate’, following the rules for slashes.  If the date is the last date then the ‘year_b’ is used as the HT lastdate.  If the ‘year_b’ is after 1945 but the ‘year’ isn’t then ‘9999’ is used.  So for example ‘maiden-skate’ has a last date of ‘1880/1884’, which appears in the ‘fulldate’ as ‘1880/4’ and the ‘lastdate’ is ‘1884’.  Where there is a gap of more than 150 years between dates the connector between dates is a dash and where the gap is less then this it is a plus.  One thing that needed further work was how we handle multiple post 1945 dates.  In my initial script if there are multiple post 1945 dates then only one of these is carried over as an HT date, and it’s set to ‘9999’.  The is because all post-1945 dates are stored as ‘9999’ and having several of these didn’t seem to make sense and confused the generation of the fulldate.  There was also an issue with some OED lexemes only having dates after 1945.  In my first version of the script these ended up with only one HT date entry of 9999 and 9999 as both firstdate and lastdate, and a fulldate consisting of just a dash, which was not right.  After further discussion with Marc I updated the script so that in such cases the date information that is carried over is the first date (even if it’s after 1945) and a dash to show that it is current.  For example, ‘ecoregion’ previously had a ‘full date’ of ‘-‘, one HT date of ‘9999’ and a start date of ‘9999’ and in the updated output has a full date of ‘1962-‘, two HT dates and a start date of 1962.  Where a lexeme has a single date this also now has a specific end date rather than it being ‘9999’.  I passed the output of the script over the Marc and Fraser for them to work with whilst I was on holiday.

For the Anglo-Norman Dictionary I continued to work on the entry page.  I added in the cognate references (i.e. references to other dictionaries), which proved to be rather tricky due to the way they have been structured in the Editors’ XML files (in the current live site the cognate references are stored in a separate hash file and are somehow injected into the entry page when it is generated, but we wanted to rationalise this so that the data that appears on the site is all contained in the Editors’ XML where possible).  The main issue was with how the links to other online dictionaries were stored, as it was not entirely clear how to generate actual links to specific pages in these resources from them.  This was especially true for links to FEW (I have no idea what FEW stands for as the acronym doesn’t appear to be expanded anywhere, even on the FEW website).

They appear in the Editors’ XML like this:

<FEW_refs siglum=”FEW” linkable=”yes”><link_form>A</link_form><link_loc>24,1a</link_loc></FEW_refs>

Which ends up linking to here:

https://apps.atilf.fr/lecteurFEW/lire/volume/240/page/1

And this:

<FEW_refs siglum=”FEW” linkable=”yes”> <link_form>posse</link_form><link_loc>9,231b</link_loc> </FEW_refs>

Which ends up linking to here:

https://apps.atilf.fr/lecteurFEW/lire/volume/90/page/231

Based on this my script for generating links needed to:

  1. Store the base URL https://apps.atilf.fr/lecteurFEW/lire/volume
  2. Split the <link_loc> on the comma
  3. multiply the part before the comma by 10 (so 24 becomes 240, 9 becomes 90 etc)
  4. strip out any non-numeric character from the part after the comma (i.e. getting rid of ‘a’ and ‘b’)
  5. generate the full URL, such as https://apps.atilf.fr/lecteurFEW/lire/volume/240/page/1 using these two values.

After discussion with Heather and Geert at the AND it turned out to be even more complicated than this, as some of the references are further split into subvolumes using a slash and a Roman numeral, so we have things like ‘15/i,108b’ which then needs to link to https://apps.atilf.fr/lecteurFEW/lire/volume/151/page/108.  It took some time to write a script that could cope with all of these quirks, but I got there in the end.

Also this week I updated the citation dates so they now display their full information with ‘MS:’ where required and superscript text.  I then finished work on the commentaries, adding in all of the required formatting (bold, italic, superscript etc) and links to other AND entries and out to other dictionaries.  Where the commentaries are longer than a few lines they are cut off and an ‘expand’ button is shown.  I also updated the ‘Results’ tab so it shows you the number of results in the tab header and have added in the ‘entry log’ feature that tracks which entries you have looked at in a session.  The number of these also appears in the tab header and I’m personally finding it a very useful feature as I navigate around the entries for test purposes.  The log entries appear in the order you opened them and there is no scrolling of entries as I would imagine most people are unlikely to have more than 20 or so listed.  You can always clear the log by pressing on the ‘Clear’ button.  I also updated the entry page so that the cross references in the ‘browse’ now work.  If the entry has a single cross reference then this is automatically displayed when you click on its headword in the ‘browse’, with a note at the top of the page stating it’s a cross reference.  If the entry has multiple cross references these are not all displayed but instead links to each entry are displayed.  There are two reasons for this:  Firstly, displaying multiple entries can result in long and complicated pages that may be hard to navigate; secondly, the entry page as it currently stands was designed to display one entry, and uses HTML IDs to identify certain elements.  An HTML ID must be unique on a page so if multiple entries were displayed things would break.  There is still a lot of work to do on the site, but the entry page is at least nearing completion.  Below is a screenshot showing the entry log, the cognate references and the commentary complete with formatting and the ‘Expand’ option:

I did also work on some other projects this week as well.  For Books and Borrowing I set up a user account for a volunteer and talked her through getting access to the system.  For the Mull / Ulva site I automatically generated historical forms for all of the place-names that had come from the GB1900 crowdsourced data.  These are now associated with the ‘OS 6 inch 2nd edn’ source and about 1670 names have been updated, although many of these are abbreviations like ‘F.P.’.  I also updated the database and the CMS to incorporate a new field for deciding which ‘front-end’ the place-name will be displayed on.  This is a drop-down list that can be selected when adding or editing a place-name, allowing you to choose from ‘Both’, ‘Mull / Ulva only’ and ‘Iona only’.  There is still a further option for stating whether the place-name appears on the website or not (‘on website: Y/N’) so it will be possible to state that a place-name is associated with one project but shouldn’t appear on that project’s website.  I also updated the search option on the ‘Browse placenames’ page to allow a user to limit the displayed placenames to those that have ‘front-end display’ set to one of the options.  Currently all place-names are set to ‘Mull / Ulva only’.  With this all in place I then created user accounts for the CMS for all of the members of the Iona project team who will be using this CMS to work with the data.  I also made a few further tweaks to the search results page of the DSL.  After all of this I was very glad to get away for a holiday.