Week Beginning 17th August 2020

I lost most of Tuesday this week to root canal surgery, which was uncomfortable and exhausting but thankfully not too painful.  Unfortunately my teeth are still not right and I now have a further appointment booked for next week, but at least the severe toothache that I had previously has now stopped.

I continued to work on the requirements document for the redevelopment of the Anglo-Norman Dictionary this week, and managed to send a first completed version of it to Heather Pagan for feedback.  It will no doubt need some further work, but it’s good to have a clearer picture of how the new version of the website will function.  Also this week I investigated another bizarre situation with the AND’s data.  I have access to the full dataset as is used to power the existing website as a single XML file containing all of the entries.  The Editors are also working on individual entries as single XML files that are then uploaded to the existing website using a content management system.  What we didn’t realise up until now is that the structure of the XML files is transformed when an entry is ingested into the online system.  For example, the ‘language’ tag is changed from <language lang=”M.E.”/> to <lbl type=”lang”>M.E.</lbl>.  Similarly, part of speech has been transformed from <pos type=”s.”/> to <gramGrp><pos>s.</pos></gramGrp>.  We have no idea why the developer of the system chose to do this, as it seems completely unnecessary and it’s a process that doesn’t appear to be documented anywhere.  The crazy thing is that the transformed XML still then needs to be further transformed to HTML for display so what appears on screen is two steps removed from the data the editors work with.  It also means that I don’t have access to the data in the form that the editors are working with, meaning I can’t just take their edits and use them in the new site.

As we ideally want to avoid the situation where we have two structurally different XML datasets for the dictionary I wanted to try and find a way to transform the data I have into the structure used by the editors.  I attempted to do this by looking at the code for the existing content management system to try to decipher where the XML is getting transformed.  There is an option for extracting an entry from the online system for offline editing and this transforms the XML into the format used by the editors.  I figured that if I can understand how this process works and replicate it then I will be able to apply this to the full XML dictionary file and then I will have the complete dataset in the same format as the editors are working with and we can just use this in the redevelopment.

It was not easy to figure out what the system is up to, but I managed to ascertain that when you enter a headword for export this then triggers a Perl script and this in turn uses an XSLT stylesheet, which I managed to track down a version of that appears to have been last updated in 2014.  I then wrote a little script that takes the XML of the entry for ‘padlock’ as found in online data and applies this stylesheet to it, in the hope that it would give me an XML file identical to the one exported by the CMS.

The script successfully executed, but the resulting XML was not quite identical to the file exported by the CMS.  There was no ‘doctype’ and DTD reference, the ‘attestation’ ID was the entry ID with an auto-incrementing ‘C’ number appended to it (AND-201-02592CE7-42F65840-3D2007C6-27706E3A-C001) rather than the ID of the <cit> element (C-11c4b015), and <dateInfo> was not processed, and only the contents of the tags within <dateInfo> were being displayed.

I’m not sure why these differences exist.  It’s possible I only have access to an older version of the XSLT file.  I’m guessing this must be the case because the missing or differently formatted data is does not appear to be instated elsewhere (e.g. in the Perl script).  What I then did was to modify the XSLT file to ensure that the changes are applied: doctype is added in, the ‘attestation’ ID is correct and the <dateInfo> section contains the full data.

I could try applying this script to every entry in the full data file I have, although I suspect there will be other situations that the XSLT file I have is not set up to successfully process.

I therefore tried to investigate another alternative, which was to write a script that will pass the headword of every dictionary item to the ‘Retrieve an entry for editing’ script in the CMS, saving the results of each.  I considered that might be more likely to work reliably for every entry, but that we may run into issues with the server refusing so many requests.  After a few test runs, I set the script loose on all 53,000 or so entries in the system and although it took several hours to run, the process did appear to work for the most part.  I now have the data in the same structure as the editors work with, which should mean we can standardise on this format and abandon the XML structure used by the existing online system.

Also this week I fixed an issue with links through to the Bosworth and Toller Old English dictionary from the Thesaurus of Old English.  Their site has been redeveloped and they’ve changed the way their URLs work without putting redirects from the old URLs, meaning all our links for words in the TOE to words on their site are broken.  URLs for their entries now just use a unique ID rather than the word (e.g. http://bosworthtoller.com/28286), which seems like a bit of a step backwards.  They’ve also got rid of length marks and are using acute accents on characters instead, which is a bit strange.  The change to an ID in the URL means we can no longer link to a specific entry as we can’t possibly know what IDs they’re using for each word.  However, we can link to their search results page instead, e.g. http://bosworthtoller.com/search?q=sōfte works and I updated TOE to use such links.

I also continued with the processing of OED dates for use in the Historical Thesaurus, after my date extraction script finished executing over the weekend.  This week I investigated OED dates that have a dot in them instead of a full date.  There are 4,498 such dates and these mostly all have the lower date as the one recorded in the ‘year’ attribute by the OED.  E.g. 138. Is 1380, 17.. is 1700.  However, sometimes a specific date is given in the ‘year’ attribute despite the presence of a full stop in the date tag.  For example, one entry has ‘1421’ in the ‘year’ attribute but ’14..’ in the date tag.  There are just over a thousand dates where there are two dots but the ‘year’ given does not end in ‘00’.  Fraser reckons this is to do with ordering the dates in the OED and I’ll need to do some further work on this next week.

In addition to the above I continued to work on the Books and Borrowing project.  I made some tweaks to the CMS to make is easier to edit records.  When a borrowing record is edited the page automatically scrolls down to the record that was edited.  This also happens for books and borrowers when accessed and edited from the ‘Books’ and ‘Borrowers’ tabs in a library.  I also wrote an initial script that will help to merge some of the duplicate author records we have in the system due to existing data with different formats being uploaded from different libraries.  What it does is strip all of the non-alpha characters from the forename and surname fields, makes them lower case then joins them together.  So for example, author ID (AID) 111 has ‘Arthur’ as forename and ‘Bedford’ as surname while AID 1896 has nothing for forename and ‘Bedford, Arthur, 1668-1745’ as surname.  When stripped and joined together these both become ‘bedfordarthur’ and we have a match.

There are 162 matches that have been identified, some consisting of more than two matched author records.  I exported these as a spreadsheet.  Each row includes the author’s AID, title, forename, surname, othername, born and died (each containing ‘c’ where given), a count of the number of books the record is associated with and the AID of the record that is set to be retained for the match.  This defaults to the first record, which also appears in bold, to make it easier to see where a new batch of duplicates begins.

The editors can then go through this spreadsheet and reassign the ‘AID to keep’ field to a different row.  E.g. for Francis Bacon the AID to keep is given as 1460.  If the second record for Francis Bacon should be kept instead the editor would just need to change the value in this column for all three Francis Bacons to the AID for this row, which is 163.  Similarly, if something has been marked as a duplicate and it’s wrong, then set the ‘AID to keep’ accordingly.  E.g. There are four ‘David Hume’ records, but looking at the dates at least one of these is a different person.  To keep the record with AID 1610 separate, replace the AID 1623 in the ‘AID to keep’ column with 1610.  It is likely that this spreadsheet will be used to manually split up the imported authors that just have all their data in the surname column.  Someone could, for example take the record that has ‘Hume, David, 1560?-1630?’ in the surname column and split this into the correct columns.

I also generated a spreadsheet containing all of the authors that appear to be unique.  This will also need checking for other duplicates that haven’t been picked up as there are a few.  For example AID 1956 ‘Heywood, Thomas, d. 1641’ and 1570 ‘Heywood, Thomas, -1641.’ Haven’t been matched because of that ‘d’.  Similarly, AID 1598 ‘Buffon, George Louis Leclerc, comte de, 1707-1788’ and 2274 ‘Buffon, Georges Louis Leclerc, comte de, 1707-1788.’ Haven’t been matched up because one is ‘George’ and the other ‘Georges’.  Accented characters have also not been properly matched, e.g. AID 1457 ‘Beze, Theodore de, 1519-1605’ and 397 ‘Bèze, Théodore de, 1519-1605.’.  I could add in a Levenshtein test that matches up things that are one character different and update the script to properly take into account accented characters for matching purposes, or these are things that could just be sorted manually.

Ann Fergusson of the DSL got back to me this week after having rigorously tested the search facilities of our new test versions of the DSL API (V2 containing data from the original API and V3 containing data that has been edited since the original API was made).  Ann had spotted some unexpected behaviour in some of the searches and I spent some time investigating these and fixing things where possible.  There were some cases where incorrect results were being returned when a ‘NOT’ search was performed on a selected source dictionary, due to the positioning of the source dictionary in the query.  This was thankfully easy to fix.  There was also an issue with some exact searches of the full text failing to find entries.  When the full text is ingested into Solr all of the XML tags are stripped out.  If there are no spaces between tagged words then words have ended up squashed together. For example: ‘Westminster</q></cit></sense><sense><b>B</b>. <i>Attrib</i>’.  With the tags (and punctuation) stripped out we’re left with ‘WestminsterB’.  So an exact search for ‘westminster’ fails to find this entry.  A search for ‘westminsterb’ finds the entry, which confirms this.  I suspect this situation is going to crop up quite a lot, so I will need to update the script that prepares content for Solr to add spaces after tags before stripping tags and then removes multiple spaces between words.