Week Beginning 14th August 2023

I was back at work this week after a lovely two-week holiday (although I did spend a couple of hours making updates to the Speech Star website whilst I was away).  After catching up with emails, getting back up to speed with where I’d left off and making a helpful new ‘to do’ list I got stuck into fixing the language tags in the Anglo-Norman Dictionary.

In June the editor Geert noticed that language tags had disappeared from the XML files of many entries.  Further investigation by me revealed that this probably happened during the import of data into the new AND system and had affected entries up to and including the import of R; entries that were part of the subsequent import of S had their language tags intact.  It is likely that the issue was caused by the script that assigns IDs and numbers to <sense> and <senseInfo> tags as part of the import process, as this script edits the XML.  Further testing revealed that the updated import workflow that was developed for S retained all language tags, as does the script that processes single and batch XML uploads as part of the DMS.  This means the error has been rectified, but we still need to fix the entries that have lost their language tags.

I was able to retrieve a version of the data as it existed prior to batch updates being applied to entry senses and from this I was able to extract the missing language tags for these entries.  I was also able to run this extraction process on the R data as it existed prior to upload.  I then ran the process on the live database to extract language tags from entries that featured them, for example entries uploaded during the import of S.  The script was also adapted to extract the ‘certainty’ attribute from the tags if present.  This was represented in the output as the number 50, separated from the language by a bar character (e.g. ‘Arabic|50’).  Where an entry featured multiple language tags these were separated by a comma (e.g. ‘Latin,Hebrew’).

Geert made the decision that language tags, which were previously associated with specific senses or subsenses, should instead be associated with entries as a whole.  This structural change will greatly simplify the reinstatement of missing tags and it will also make it easier to add language tags to entries that do not already feature them.

The language data that I compiled was stored in a spreadsheet featuring three columns: Slug: the unique form of a headword used in entry URLs; Live Langs: language tags extracted from the live database; Old Langs: language tags extracted from the data prior to processing.  A fourth column was also added where manual overrides to the preceding two columns could be added by Geert.  This column could also be used to add entries that did not previously have a language tag but needed one.

Two further issues were addressed at this stage.  The first related to compound words, where the language applied to one part of the word.  In the original data these were represented by combining the language with ‘A.F.’, for example ‘M.E._and_A.F.’.  Continuing with this approach would make it more difficult to search for specific languages and the decision was made to only store the non-A.F. language with a note that the word is a compound.  This was encoded in the spreadsheet with a bar character followed by ‘Y’.  To ensure the data could be more easily machine-readable the compound character would always be the third part of the language data, whether or not certainty was present in the second part.  For example ‘M.E.|50|Y’ represents a word that is possibly from M.E. and is a compound while ‘M.E.||Y’ represents a word that is definitely from M.E and is a compound.

The second issue to be addressed was how to handle entries that featured languages but whose language tags were not needed.  In such cases Geert added the characters ‘$$’ to the fourth column.

The spreadsheet was edited by Geert and currently features 2741 entries that are to be updated.  Each entry in the spreadsheet will be edited using the following workflow:

  1. All existing language tags in the entry will be deleted. These generally occur in senses or subsenses, but some entries feature them in the <head> element.
  2. If the entry has ‘$$’ in column 4 then no further updates will be made
  3. If there is other data in column 4 this will be used
  4. If there is no data in column 4 then data from column 2 will be used
  5. If there is no data in columns 4 or 2 then data from column 3 will be usedWhere there are multiple languages separated by a comma these will be split and treated separately.
  6. For each language the presence of a certainty value and / or a compound will be ascertained
  7. In the XML the new language tags will appear below the <head> tag.
  8. An entry will feature one language tag for each language specified
  9. The specific language will be stored in the ‘lang’ attribute
  10. Certainty (if present) will be stored in the ‘cert’ attribute which may only contain ‘50’ to represent ‘uncertain’.
  11. Compound (if present) will be stored in a new ‘compound’ attribute which may only contain ‘true’ to denote the word is a compound.
  12. For example, ‘Latin|50,Greek|50’ will be stored as two <language> tags beneath the <head> tag as follows: <language lang=”Latin” cert=”50” /><language lang=”Greek” cert=”50” /> while ‘M.E.||Y’ will be stored as: <language lang=”M.E.” compound=”true” />

I ran and tested the update on a local version of the data and the output was checked by Geert and me.  After backing up the live database I then ran the update on it and all went well.  The dictionary’s DTD also needed to be updated to ensure the new language tag can be positioned as an optional child element of the ‘main_entry’ element.  The DTD was also updated to remove language as a child of ‘sense’, ‘subsense’ and ‘head’.

Previously the DTD had a limited list of languages that can appear in the ‘lang’ attribute, but I’m uncertain whether this ever worked as the XML definitely included languages that were not in the list.  Instead I created a ‘picklist’ for languages that pulls its data from a list of languages stored in the online database.  We use this approach for other things such as semantic labels so it was pretty easy to set up.  I also added in the new optional ‘compound’ attribute.

With all of this in place I then updated the XSLT and some of the CSS in order to display the new language tags, which now appear as italicised text above any part of speech.  For example, an entry with multiple languages, one of which is uncertain: https://anglo-norman.net/entry/ris_3 and an entry that’s a compound with another language: https://anglo-norman.net/entry/rofgable.  Eventually I will update the site further to enable searches for language tags, but this will come at a later date.

Also this week I spent a bit of time in email conversations with the Dictionaries of the Scots Language people, discussing updates to bibliographical entries, the new part of speech system, DOST citation dates that were later than 1700 and making further tweaks to my requirements document for the date and part of speech searches based on feedback received from the team.  We’re all in agreement about how the new feature will work now, which means I’ll be able to get started on the development next week, all being well.

I also gave some advice to Gavin Miller about a new proposal he’s currently putting together, helped out Matthew Creasy with the website for his James Joyce Symposium website, spoke to Craig Lamont about the Burns correspondents project and checked how the stats are working on sites that were moved to our newer server a while back (all thankfully seems to be working fine).

I spent the remainder of the week implementing a ‘cite this page’ feature for the Books and Borrowing project, and the feature now appears on every page that features data.  A ‘Cite this page’ button appears in the right-hand corner of the page title.  Pressing the button brings up a pop-up containing citation options in a variety of styles.  I’ve taken this from other projects I’ve been involved with (e.g. the Historical Thesaurus) and we might want to tweak it, but at the moment something along the lines of the following is displayed (full URL crudely ‘redacted’ as the site isn’t live yet):

Developing this feature has taken a bit of time due to the huge variation in the text that describes the page.  This can also make the citation rather long, for example:

Advanced search for ‘Borrower occupation: Arts and Letters, Borrower occupation: Author, Borrower occupation: Curator, Borrower occupation: Librarian, Borrower occupation: Musician, Borrower occupation: Painter/Limner, Borrower occupation: Poet, Borrower gender: Female, Author gender: Female’. 2023. In Books and Borrowing: An Analysis of Scottish Borrowers’ Registers, 1750-1830. University of Stirling. Retrieved 18 August 2023, from [very long URL goes here]

I haven’t included a description of selected filters and ‘order by’ options, but these are present in the URL.  I may add filters and orders to the description, or we can just leave it as it is and let people tweak their citation text if they want.

The ‘cite this page’ button appears on all pages that feature data, not just the search results.  For example register pages and the list of book editions.  Hopefully the feature will be useful once the site goes live.