Week Beginning 26th April 2021

I continued with the import of new data for the Dictionary of the Scots Language this week.  Raymond at Arts IT Support has set up a new collection and had imported the full-text search data into the Solr server, and I tested this out via the new front-end I’d configured to work with the new data source.  I then began working on the import of the bibliographical data, but noticed that the file exported from the DSL’s new editing system didn’t feature an attribute denoting what source dictionary each record is from.  We need this as the bibliography search allows users to limit their search to DOST or SND.  The new IDs all start with ‘bib’ no matter what the source is.  I had thought I could use the ‘oldid’ to extract the source (db = DOST, sb = SND) but I realised there are also composite records where the ‘oldid’ is something like ‘a200’.  In such cases I don’t think I have any data that I can use to distinguish between DOST and SND records.  The person in charge of exporting the data from the new editing system very helpfully agreed to add in a ‘source dictionary’ attribute to all bibliographical records and sent me an updated version of the XML file.  Whilst working with the data I realised that all of the composite records are DOST records anyway, so I didn’t need the ‘sourceDict’ attribute, but I think it’s better to have this explicitly as an attribute as differentiating between dictionaries is important.

I imported all of the bibliographical records into the online system, including the composite ones as these are linked to from dictionary entries and are therefore needed, even though their individual parts are also found separately in the data.  However, I decided to exclude the composite records from the search facilities, otherwise we’d end up with duplicates in the search results.  I updated the API to use the new bibliography tables and I updated the new front-end so that bibliographical searches use the new data.  One thing that needs some further work is the display of individual bibliographies.  These are now generated from the bibliography XML via an XSLT whereas previously they were generated from a variety of different fields in the database.   The display doesn’t completely match up with the display on the live and Sienna versions of the bibliography pages and I’m not sure exactly how the editors would like entries to be displayed.  I’ll need further input from them on this matter, but the import of data from the new editing system has now been completed successfully.  I’d been documenting the process as I worked through it and I sent the documentation and all scripts I wrote to handle the workflow to the editors to be stored for future use.

I also worked on the Books and Borrowing project this week.  I received the last of the digitised images of borrowing registers from Edinburgh (other than one register which needs conservation work), and I uploaded these to the project’s content management system, creating all of the necessary page records.  We have a total of 9,992 page images as JPEG files from Edinburgh, totalling 105GB.  Thank goodness we managed to set up an IIIF server for the image files rather than having to generate and store image tilesets for each of these page images.  Also this week I uploaded the images for 14 borrowing registers from St Andrews and generated page records for each of these.

I had a further conversation with GIS expert Piet Gerrits for the Iona project and made a couple of tweaks to the Comparative Kingship content management systems, but other than that I spent the remainder of the week returning to the Anglo-Norman Dictionary, which I hadn’t worked on since before Easter.  To start with I went back through old emails and documents and wrote a new ‘to do’ list containing all of the outstanding tasks for the project, some 20 items of varying degrees of size and intricacy.  After some communication with the editors I began tackling some of the issues, beginning with the apparent disappearance of <note> tags from certain entries.

In the original editor’s XML (the XML as structured before uploaded into the old DMS) there were ‘edGloss’ notes tagged as ‘<note type=”edgloss” place=”inline”>’ that were migrated to <edGloss> elements during whatever processing happened with the old DMS.  However, there were also occasionally notes tagged as ‘<note place=”inline”>’ that didn’t get transformed and remained tagged as this.

I’m not entirely sure how or where, but at some point during my processing of the data these ‘<note place=”inline”>’ notes have been lost.  It’s very strange as the new DMS import script is based entirely on the scripts I wrote to process the old DMS XML entries, but I tested the DMS import by uploading the old DMS XML version of ‘poer_1’ to the new DMS and the ‘<note place=”inline”>’ have been retained, yet in the live entry for ‘poer_1’ the <note> text is missing.

I searched the database for all entries where the DMS XML as exported from the old DMS system contains the text ‘<note place=”inline”>’ and there are 323 entries, which I added to a spreadsheet and sent to the editors.  It’s likely that the new XML for these entries will need to be manually corrected to reinstate the missing <note> elements.  Some entries (as with ‘poer_1’) have several of these.  II still have the old DMS XML for these so it is at least possible to recover the missing tags.  I wish I could identify exactly when and how the tags were removed, but that would quite likely require many hours of investigation, as I already spent a couple of hours trying to get to the bottom of the issue without success.

Moving on to a different issue, I changed the upload scripts so that the ‘n’ numbers are always fully regenerated automatically when a file is uploaded, as previously there were issues when a mixture of senses with and without ‘n’ numbers were included in an entry.  This means that any existing ‘n’ values are replaced, so it’s no longer possible to manually set the ‘n’ value.  Instead ‘n’ values for senses within a POS will always increment from 1 depending on the order they appear in the file, with ‘n’ being reset to 1 whenever a new POS is encountered.

Main senses in locutions were not being assigned an ‘n’ on upload, and I changed this so that they are assigned an ‘n’ in exactly the same way as regular main senses.  I tested this with the ‘descendre’ entry and it worked, although I encountered an issue.  The final locution main sense (to descend to (by way of inheritance)) had a POS of ‘sbst._inf.’ In its <senseInfo> whereas it should have been (based on the POS of the previous two senses) ‘sbst. Inf.’.  The script was therefore considering this to be a new POS and gave the sense an ‘n’ of 1.  In my test file I updated the POS and re-uploaded the file and the sense was assigned the correct value of 3 to its ‘n’, but we’ll need to investigate why a different form of POS was recorded for this sense.

I also updated the front-end so that locution main senses with an ‘n’ now have the ‘n’ displayed, (e.g. https://anglo-norman.net/entry/descendre) and wrote a script that will automatically add missing ‘n’ attributes to all locution main senses in the system.  I haven’t run this on the live database yet as I need further feedback from the editors before I do.  As the week drew to a close I worked on a method to hide sense numbers in the front-end in case where there was only one sense in a part of speech, but I didn’t manage to get this completed and will continue with it next week.