Week Beginning 16th January 2023

I divided my time primarily between the Anglo-Norman Dictionary and Books and Borrowing this week.  For the AND I implemented a new ‘citation editing’ feature that I’d written the specification for before Christmas.  This new feature allows an editor to bring up a list of all of the citations for a source text (similar to the how this page in the front-end works: https://anglo-norman.net/search/citation/null/null/A-N_Falconry) and to then manually edit the XML for one or more citations or apply a batch edit to any selected citations, enabling the citation’s date, source text reference and/or location reference to be edited, potentially updating the XML for thousands of entries in one process.  It took a fair amount of time to implement the feature and then further time to test it.  This was especially important as I didn’t want to risk an error corrupting thousands of dictionary entries.  I set up a version of the AND system and database on my laptop so I could work on the new code there without risk to the live site.

The new feature works pretty much exactly as I’d specified in the document I wrote before Christmas, but one difference is that I realised we already had a page in the Dictionary Management System that listed all sources – the ‘Browse Sources’ page.  Rather than have an entirely new ‘Edit citations’ page that would also begin by listing the sources I decided to update the existing ‘Browse Sources’ page.  This page now features the same tabular view of the source text, but the buttons beside each text now include ‘Edit citations’.  Pressing on this will open the ‘Edit citations’ page for the source in question.  By default this lists all citations for the source ordered by headword.  Where an entry has more than one citation for a source these appear in the order they are found in the entry.  At the top of the page there is a button you can press to change the sorting to location in the source text.  This sorts the citations by the contents of the <loc> tag, displaying the headword for each entry alongside the citation.  Note that sorting currently doesn’t work logically to a human user.  The field can contain mixtures of numbers and text and therefore the field is sorted as text.  When this occurs numbers are sorted alphabetically, meaning all of the ones come before all of the twos etc.  E.g. 1,10,1002 all come before 2.  I’ll need to investigate whether I can do something about this, maybe next week.

As my document had specified, you can batch edit and / or manually edit any listed citations.  Batch editing is controlled by the checkboxes beside each citation – any that are checked will have the batch edit applied to them.  The dark blue ‘Batch Edit Options’ section allows you to decide what details to change.  You can specify a new date (ideally using the date builder feature in the DMS to generate the required XML).  You can select a different siglum, which uses an autocomplete – start typing and select the matching siglum.  However, the problem with autocompletes is what happens if you manually edit or clear the field after selecting a value.  if you manually edit the text in this field after selecting a siglum the previously selected siglum will still be used as it’s not the contents of the text field that are used in the edit but a hidden field containing the ‘slug’ of the selected siglum.  An existing siglum selected from the autocomplete should always be used here to avoid this issue.  You can also specify new contents for the <loc> tag.  Any combination of the three fields can be used – just leave the ones you don’t want to update blank.

To manually edit one or more citations you can press the ‘Edit’ button beside the citation.  This displays a text area with the current XML for the citation in it.  You can edit this XML as required, but the editors will need to be careful to ensure the updated XML is valid or things might break.  The ‘Edit’ button changes to a ‘Cancel Edit’ button when the text area opens.  Pressing on this removes the text area.  Any changes you made to the XML in the text area will be lost and pressing the ‘Edit’ button again will reopen the text area with a fresh version of the citation’s XML.

It is possible to combine manual and batch edits but manual edits are applied first meaning if you manually edit some information that is also to be batch edited the batch edit will replace the manual edit for that information.  E.g. if you manually edit the <quotation> and the <loc> and you also batch edit the <loc> the quotation and loc fields will be replaced with your manual edit first and then the loc field will be overwritten with your batch edit.  Here’s a screenshot of the citation editor page, with one manual edit section open:

Once the necessary batch / manual changes have been made, pressing the ‘Edit Selected Citations’ button at the bottom of the page submits the data and at this point the edits will be made.  This doesn’t actually edit the live entry but takes the live entry XML, edits it and then creates a new Holding Area entry for each entry in question (Holding Area entries are temporary versions of entries stored in the DMS for checking before publication).  Th process of making these holding area entries includes editing all relevant citations for each entry (e.g. the contents of each relevant <attestation> element) and checking and (if necessary) regenerating the ‘earliest date’ field for the entry as this may have changed depending on the date information supplied.  After the script has run you can then find new versions of the entries in the Holding Area, where you can check and approve the versions, making them live or deleting them as required.  I’ll probably need to add in a ‘Delete all’ option to the Holding Area as currently entries that are to be deleted need to be individually deleted, which would be annoying if there’s an entire batch to remove.

Through the version on my laptop I fully tested the process out and it all worked fine.  I didn’t actually test publishing any live entries that have passed through the citation edit process, but I have previewed them in the holding area and all look fine.  Once the entries enter the holding area they should be structurally identical to entries that end up in the holding area from the ‘Upload’ facility so there shouldn’t be any issues in publishing them.

After that I uploaded the new code to the AND server and began testing and tweaking things there before letting the AND Editor Geert loose on the new system.  All seemed to work fine with his first updates, but then he noticed something a bit strange.  He’d updated the date for all citations in one source text, meaning more than 1000 citations needed to be updated.  However, the new date (1212) wasn’t getting applied to all of the citations, and somewhere down the list the existing date (1213) took over.

After much investigation it turned out the issue was caused by a server setting rather than any problem with my code.  The server has a setting that limits the number of variables that can be inputted from a form to 1000.  The batch edit was sending more variables than this so only the first 1000 were getting through.  As the cutoff of input variables was automatically and silently made by the server my script was entirely unaware that there was any problem, hence the lack of visible errors.

I can’t change the server settings myself but I managed to get someone in IT Support to update it for me.  With the setting changed the form submitted, but unfortunately after submission all it gave was a blank page so I had another issue to investigate.  It turned out to be an issue with the data.  There were two citations in the batch that had no dateInfo tag.  When specifying a date the script expects to find an existing dateInfo tag that then gets replaced.  As it found no such tag the script quit with a fatal error.  I therefore updated the script so that it can deal with citations that have no existing dateInfo tag.  In such cases the script now inserts a new dateInfo element at the top of the <attestation> XML.  I also added a count of the number of new holding area entries the script generates so it’s easier to check if any have somehow been lost during processing (which hopefully won’t happen now).

Whilst investigating this I also realised that when batch editing a date any entry that has more than one citation that is being edited will end up with the same ID used for each <dateInfo> element.  An ID should be unique and while this won’t really cause any issues when displaying the entries it might lead to errors or warnings in Oxygen.  I therefore updated the code to add the attestation ID to the supplied dateInfo ID when batch editing dates to ensure the uniqueness of the dataInfo ID.

With all of this in place the new feature was up and running and Geert was able to batch edit the citations for several source texts.  However, he sent me a panicked email on Saturday to say that after submitting an edit every single entry in the AND was now not displaying anything other than the headword.  This was obviously a serious problem so I spent some time on Saturday investigating and fixing the issue.

The issue turned out to be nothing to do with my new system but was caused by an issue with one of the entry XML files that was updated through the citation editing system.  The entry in question was Assensement (https://anglo-norman.net/entry/assensement) which has an erroneous <label> element: <semantic value=”=assentement?”/>.  This should not be a label and attributes are not allowed to start with an equals sign.  I must have previously stripped out such errors from our list of labels, but when the entry was published the label was reintroduced.  The DTD dynamically pulls in the labels and these are then used when validating the XML.  But as this list now included ‘=assentement?’ the DTD broke.  With the DTD broken the XSLT that transforms the entry XML into HTML wouldn’t run, meaning every single entry on the site failed to load.  Thankfully after identifying the issue it was quick to fix.  I simply deleted the erroneous label and things started working again, and Geert has updated the entry’s XML to remove the error.

For the Books and Borrowing project I had a Zoom call with project PI Katie and Co-I Matt on Monday to discuss the front-end developments and some of the outstanding tasks left to do.  The main one is to implement a genre classification system for books, and we now have a plan for how to deal with these.  Genres will be applied at work level and will then filter down to lower levels.  I also spent some time speaking to Stirling’s IT people about setting up a Solr instance for the project, as discussed in posts before Christmas.  Thankfully it was possible to get this set up and by the end of the week we had a Solr instance set up that I was able to query from a script on our server.  Next week I will begin to integrate Solr queries with the front-end that I’m working on.  I also generated spreadsheets containing all of the book edition and book work data that Matt had requested and engaged in email discussions with Matt and Katie about how we might automatically generate Book Work records from editions and amalgamate some of the many duplicate book edition records that Matt had discovered whilst looking through the data.

Also this week I made a small tweak to the Dictionaries of the Scots Language, replacing the ‘email’ option in the ‘Share’ icons with a different option as the original option was no longer working.  I also had a chat with Jane Stuart-Smith about the website for the VARICS project, replied to a query from someone in Philosophy who had a website that was no longer working, replied to an email from someone who had read my posts about Solr and had some questions and replied to Sara Pons-Sanz, the organiser of last week’s Zurich event who was asking about the availability of some visualisations of the Historical Thesaurus data.  I was able to direct her to some visualisations I’d made a while back that we still haven’t made public (see https://digital-humanities.glasgow.ac.uk/2021-12-06/).

Next week I aim to focus on the development of the Books and Borrowing front-end and the integration of Solr into this.