Week Beginning 14th September 2020

This was another busy week involving lots of projects.  For the Books and Borrowing project I wrote an import script to process the Glasgow Professors borrowing records, comprising of more than 7,000 rows in a spreadsheet.  It was tricky to integrate this with the rest of the project’s data and it took about a day to write the necessary processing scripts.  I can only run the scripts on the real data in the evening as I need to take the CMS offline to do so, otherwise changes made to the database whilst I’m integrating the data will be lost and unfortunately it took three attempts to get the import to work properly.  There are a few reasons why this data has been particularly tricky.  Firstly, it needs to be integrated with existing Glasgow data, rather than being a ‘fresh’ upload to a new library.  This caused some problems as my scripts that match up borrowing records and borrowers were getting confused with the existing Student borrowers.  Secondly, the spreadsheet order was not in page order for each register – the order appears to have been ‘10r’, ‘10v’, then ‘11r’ etc then after ‘19v’ came ‘1r’.  This is presumably to do with Excel ordering numbers as text.  I tried reordering on the ‘sort order’ column but this also ordered things weirdly (all the numbers beginning with 1, then all the numbers beginning with 2 etc).  I tried changing the data type of this field to a number rather than text but that just resulted in Excel giving errors in all of the fields.  What this meant was I needed to sort the data in my own script before I could use it (otherwise the ‘next’ and ‘previous’ page links would all have been wrong), and it took time to implement this.  However, I got there in the end.

I also continued working on the Historical Thesaurus database and front-end to allow us to use the new date fields and to enable us to keep track of what lexemes and categories had been updated in the new Second Edition.  I have now fully migrated my second edition test site to using the new date system, including the advanced search for labels and both the ‘simple’ and ‘advanced’ date search. I have also now created the database structure for dealing with second edition updates.  As we agreed at the last team meeing, the lexeme and category tables have been updated to each have two new fields – ‘last_updated’, which holds a human-readable date (YYYY-MM-DD) that will be automatically populated when rows are updated and ‘changelogcode’ which holds the ID of the row in the new ‘changelog’ table that applies to the lexeme or category.  This new table consists of an ID, a ‘type’ (lexeme or category) and the text of the changelog.  I’ve created two changelogs for test purposes: ‘This word was antedated in the second edition’ and ‘This word was postdated in the second edition’.  I’ve realised that this structure means only one changelog can be associated with a lexeme, with a new one overwriting the old one.  A more robust system would record all of the changelogs that have been applied to a lexeme or category and the dates these were applied, and depending on what Marc and Fraser think I may update the system with an extra joining table that would allow this papertrail to be recorded.

For now I’ve updated two lexemes in category 1 to use the two changelogs for test purposes.  I’ve updated the category browser in the front end to add in a ‘2’ in a circle where ‘second edition’ changelog IDs are present.  These have tooltips that when hovered over display the changelog text and the following screenshot demonstrates:

I haven’t added these circles to the search results yet or the full timeline visualisations, but it is likely that they will need to appear there too.

I also spent some time working on a new script for Fraser’s Scots Thesaurus project.  This script allows a user to select an HT category to bring back all of the words contained in it.  It then queries the DSL for each of these words and returns a list of those entries that contain at least two of the category’s words somewhere in the entry text.  The script outputs the name of the category that was searched for, a list of returned HT words so you can see exactly what is being searched for, and the DSL entries that feature at least two of the words in a table that contains fields such as source dictionary, parts of speech, a link through to the DSL entry, headword etc.  I may have to tweak this further next week, but it seems to be working pretty well.

I spent most of the rest of the week working on the redevelopment of the Anglo-Norman Dictionary.  We had a bit of a shock at the start of the week because the entire old site was offline and inaccessible.  It turned out that the domain name subscription had expired, and thankfully it was possible to renew it and the site became available again.  I spent a lot of time this week continuing to work on the entry page, trying to untangle the existing XSLT script and work out how to apply the necessary rules to the editors’ version of the XML, which differs from the system version of the XML that was generated by an incomprehensible and undocumented series of processes in the old system.

I started off with the references located within the variant forms.  In the existing site these link through to source texts, with information appearing in a pop-up when the reference is clicked on.  To get these working I needed to figure out where the list of source texts was being stored and also how to make the references appear properly.  The Editors’ XML and the System XML differ in structure, and only the latter actually contains the text that appears as the link.  So, for example, while the latter has:

<cit> <bibl siglum=”Secr_waterford1″ loc=”94.787″><i>Secr</i> <sc>waterford</sc><sup>1</sup> 94.787</bibl> </cit>

The former only has:

<varref> <reference><source siglum=”Secr_waterford1″ target=””><loc>94.787</loc></source></reference> </varref>

This meant that the text to display and its formatting (<i>Secr</i> <sc>waterford</sc><sup>1</sup>) is not available to me.  Thankfully I managed to track down an XML file that contained the list of texts, which contained this formatting and also all of the information that should appear in the pop-up that is opened when the link is clicked on, e.g.

<item id=”Secr_waterford1″ cits=”552″>

<siglum><i>Secr</i> <span class=”sc”>WATERFORD</span><sup>1</sup></siglum>

<bibl>Yela Schauwecker, <i>Die Diätetik nach dem ‘Secretum secretorum’ in der Version

von Jofroi de Waterford: Teiledition und lexikalische Untersuchung</i>, Würzburger medizinhistorische Forschungen 92, Würzburg, 2007

</bibl>

<date>c.1300 (text and MS)</date>

<deaf><a href=”/cgi-bin/deaf?siglum=SecrSecrPr2S”>SecrSecrPr<sup>2</sup>H</a></deaf>

<and><a href=”/cgi-bin-s/and-lookup-siglum?term=’Secr_waterford1′”>citations</a></and>

</item>

I imported all of the data from this XML file into the new system and have created new endpoints in the API to access the data.  There is one that brings back the slug (the text used in the URL) and the ‘siglum’ text of all sigla and another that brings back all of the information about a given siglum.  I then updated the ‘Entry’ page to add in the references in the forms and used a bit of JavaScript to grab the ‘all sigla’ output of the API and then adding the full siglum text in for each reference as required.  You can also now click on a reference to open a pop-up that contains the full information.

I then turned my attention cognate references section, and there were also some issues here with the Editors’ XML not including information that is in the system XML.  The structure of the cognate references in the system XML is like this:

<xr_group type=”cognate” linkable=”yes”> <xr><ref siglum=”FEW” target=”90/page/231″ loc=”9,231b”>posse</ref></xr> </xr_group>

Note that there is a ‘target’ attribute that provides a link.  The Editor’s XML does not include this information – here’s the same reference:

<FEW_refs siglum=”FEW” linkable=”yes”> <link_form>posse</link_form><link_loc>9,231b</link_loc> </FEW_refs>

There’s nothing in there that I can use to ascertain the correct link to add in.  However, I have found a ‘hash’ file called ‘cognate_hash’ that when extracted I found contains a list of cognate references and targets.  These don’t include entry identifiers so I’m not sure how they were connected to entries, but by combining the ‘siglum’ and the ‘loc’ it looks like it might be possible to find the target, e.g:

<xr_group type=”cognate” linkable=”yes”>

<xr>

<ref siglum=”FEW” target=”90/page/231″ loc=”*9,231b”>posse</ref>

</xr>

</xr_group>

I’m not sure why there’s an asterisk, though.  I also found another hash file called ‘commentary_hash’ that I guess contains the commentaries that appear in some entries but not in their XML.  We’ll probably need to figure out whether we want to properly integrate these with the editor’s XML as well.

I completed work on the ‘cognate references’ section, omitting the links out for now (I’ll add these in later) and then moved on to the ‘summary’ box that contains links through to lower sections of the entry.  Unfortunately the ‘sense’ numbers are something else that are not present in any form in the Editor’s XML.  In the System XML each entry has a number, e.g. ‘<sense n=”1″>’ but in the Editor’s XML there is no such number.  I spent quite a bit of time trying to increment a number in XSLT and apply it to each sense but it turns out you can’t increment a number in XSLT, even though there are ‘for’ loops where such an incrementing number would be easy to implement in other languages.

For now the numbers all just say ‘n’.  Again, I’m not sure how best to tackle this.  I could update the Editors’ XML to add in numbers, meaning I would then have to edit the XML the editors are working on when you’re ready to send it to me.  Or I could dynamically add in the numbers to the XML when the API displays it.  Or I could dynamically add in the numbers in the client’s browser using JavaScript.  I’m also still in the process of working on the senses, subsenses and locutions.  A lot of the information is now being displayed, for example part of speech, the number (again just ‘n’ for now), the translations, attestations, references, variants, legiturs, edglosses, locutions, xrefs.

I still need to add in the non-locution xrefs, labels and some other things, but overall I’m very happy with the progress I’ve made this week.  Below is an example of an entry in the old site, with the entry as it currently looks in the new test site I’m working on (be aware that the new interface is only a placeholder).  Before:

After: