Week Beginning 24th May 2021

I had my first dose of the Covid vaccine on Tuesday morning this week (the AstraZeneca one), so I lost a bit of time whilst going to get that done.  Unfortunately I had a bit of a bad reaction to it and ended up in bed all day Wednesday with a pretty nasty fever.  I had Covid in October last year but only experienced mild symptoms and wasn’t even off work for a day with it, so in my case the cure has been much worse than the disease.  However, I was feeling much better again by Thursday, so I guess I lost a total of about a day and a half of work, which is a small price to pay if it helps to ensure I don’t catch Covid again and (what would be worse) pass it on to anyone else.

In terms of work this week I continued to work on the Anglo-Norman Dictionary, beginning with a few tweaks to the data builder that I had completed last week.  I’d forgotten to add a bit of processing to the MS date that was present in the Text Date section to handle fractions, so I added that in.  I also updated the XML output so that ‘pref’ and ‘suff’ only appear if they have content now, as the empty attributes were causing issues in the XML editor.

I then began work on the largest outstanding task I still have to tackle for the project: the migration of the textbase texts to the new site.  There are about 80 lengthy XML digital editions on the old site that can be searched and browsed, and I need to ensure these are also available on the new site.  I managed to grab a copy of all of the source XML files and I tracked down a copy of the script that the old site used to process the files.  At least I thought I had.  It turned out that this file actually references another file that must do most of the processing, including the application of an XSLT file to transform the XML into HTML, which is the thing I really could do with getting access to.  Unfortunately this file was no in the data from the server that I had been given access to, which somewhat limited what I could do.  I still have access to the old site and whilst experimenting with the old textbase I managed to make it display an error message that gives the location of the file: [DEBUG: Empty String at /var/and/reduce/and-fetcher line 486. ].  With this location available I asked Heather, the editor who has access to the server, if she might be able to locate this file and others in the same directory.  She had to travel to her University in order to be able to access the server, but once she did she was able to track the necessary directory down and get a copy to me.  This also included the XSLT file, which will help a lot.

I wrote a script to process all of the XML files, extracting titles, bylines, imprints, dates, copyright statements and splitting each file up into individual pages.  I then updated the API to create the endpoints necessary to browse the texts and navigate through the pages, for example the retrieval of summary data for all texts, or information about a specified texts, or information about a specific page (including its XML).  I also began working on a front-end for the textbase, which is still very much in progress.  Currently it lists all texts with options to open a text at the first available page or select a page from a drop-down list of pages.  There are also links directly into the AND bibliography and DEAF where applicable, as the following screenshot demonstrates:

It is also possible to view a specific page, and I’ve completed work on the summary information about the text and a navbar through which it’s possible to navigate through the pages (or jump directly to a different page entirely).  What I haven’t yet tackled is the processing of the XML, which is going to be tricky and I hope to delve into next week.   Below is a screenshot of the page view as it currently looks, with the raw XML displayed.

I also investigated and fixed an issue the editor Geert spotted, whereby the entire text of an entry was appearing in bold.  The issue was caused by an empty <link_form/> tag.  In the XSLT each <link_form> becomes a bold tag <b> with the content of the link form in the middle.  As there was no content it became a self-closed tag <b/> which is valid in XML but not valid in HTML, where it was treated as an opening tag with no corresponding closing tag, resulting in the remainder of the page all being bold.  I got around this by placing the space that preceded the bold tag “ <b></b>” within the bold tag instead “<b> </b>” meaning the tag is no longer considered empty and the XSLT doesn’t self-close it, but ideally if there is no <link_form> then the tag should just be omitted, which would also solve the problem.

I also looked into an issue with the proofreader that Heather encountered.  When she uploaded a ZIP file with around 50 entries in it some of the entries wouldn’t appear in the output, but would just display their title.  The missing entries would be random without any clear reason as to why some were missing.    After some investigation I realised what the problem was:  each time an XML file is processed for display the DTD referenced in the file was being checked.  When processing lots of files all at once this was exceeding the maximum number of file requests the server was allowing from a specific client and was temporarily blocking access to the DTD, causing the processing of some of the XML files to silently fail.  The maximum number would be reached at a different point each time, thus meaning a different selection of entries would be blank.  To fix this I updated the proofreader script to remove the reference to the DTD from the XML files in the uploaded ZIP before they are processed for display.  The DTD isn’t actually needed for the display of the entry – all it does is specify the rules for editing it.  With the DTD reference removed it looks like all entries are getting properly displayed.

Also this week I gave some further advice to Luca Guariento about a proposal he’s working on, fixed a small display issue with the Historical Thesaurus and spoke to Craig Lamont about the proposal he’s putting together.  Other than that I spent a bit of time on the Dictionary of the Scots Language, creating four different mockups of how the new ‘About this entry’ box could look and investigating why some of the bibliographical links in entries in the new front-end were not working.  The problem was being caused by the reworking of cref contents that the front-end does in order to ensure only certain parts of the text become a link.  In the XML the bib ID is applied to the full cref, (e.g. <cref refid=”bib018594″><geo>Sc.</geo> <date>1775</date> <title>Weekly Mag.</title> (9 Mar.) 329: </cref>) but we wanted the link to only appear around titles and authors rather than the full text.  The issue with the missing links was cropping up where there is no author or title for the link to be wrapped around (e.g. <cit><cref refid=”bib017755″><geo>Ayr.</geo><su>4</su> <date>1928</date>: </cref><q>The bag’s fu’ noo’ we’ll sadden’t.</q></cit>).  In such cases the link wasn’t appearing anywhere.  I’ve updated this now so that if no author or title is found then the link gets wrapped around the <geo> tag instead, and if there is no <geo> tag the link gets wrapped around the whole <cref>.

I also fixed a couple of advanced search issues that had been encountered with the new (and as yet not publicly available) site.  There was a 404 error that was being caused by a colon in the title.  The selected title gets added into the URL and colons are special characters in URLs, which was causing a problem.  However, I updated the scripts to allow colons to appear and the search now works.  It also turned out that the full-text searches were searching the contents of the <meta> tag in the entries, which is not something that we want.  I knew there was some other reason why I stripped the <meta> section out of the XML and this is it.  The contents of <meta> end up in the free-text search and are therefore both searchable and returned in the snippets.  To fix this I updated my script that generates the free-text search data to remove <meta> before the free-text search is generated.  This doesn’t remove it permanently, just in the context of the script executing.  I regenerated the free-text data and it no longer includes <meta>, and I then passed this on to Arts IT Support who have the access rights to update the Solr collection.  With this in place the advanced search no longer does anything with the <meta> section.