Week Beginning 23rd January 2023

I spent much of the week working on the Books and Borrowing project, working with the new Solr instance that the Stirling IT people set up for me last week.  I spent some time creating a new version of the API that connects to Solr and then setting up the Solr queries necessary to search all fields of the Solr index for a regular quick search and the ‘borrowed’ fields for date searches.  This included returning the data necessary to provide the facetted search options on the search results page (i.e. filters for the search results).  I also set up a new development version of the front-end, leaving my existing pre-Solr version in place in case things go wrong and I need to revert to it.

As with the previous version of the site, you can perform a quick search, which searches the numerous fields that are specified in the requirements document.  Dates can either be a single date or a range.  Text searches can use wildcards * to match any characters (e.g. tes* will match all words beginning ‘tes’) and ? to match a single character (e.g. h?ll matching ‘hill’ and ‘hell’).

Currently the results still display the full records with 100 records per page.  I did consider changing this to a more compact view with a link to open the full record, but I haven’t implemented this as of yet.  I might add an option to switch between a ‘compact’ and ‘full’ view of the records instead, as I think having to click through to the full record each time you’re interested would get a bit annoying.

There have been a lot of changes to the back end, even if the front-end doesn’t look that different.  Behind the scenes the API now connects to the Solr instance and queries are formatted for and passed to Solr, which then returns the data.  Solr is very fast, but the loading of 100 results which are full records does still take some time.  Queries to Solr currently return the 100 relevant borrowing IDs and then another API call retrieves all of the data for these 100 IDs from the regular database.  A compact view could potentially rely solely on data stored in Solr, which would be a lot quicker to load, if we want to pursue that option.

In addition to returning the IDs for the 100 borrowing records that are to be displayed on any one results page, Solr also returns the total number of matching borrowing records plus the facetted search information.  For the moment the following information is included in the facetted data: Borrowing year, library name, borrower gender and occupation, author name and book language, place of publication and format.  These appear as ‘Filter’ options down the left-hand side of the results page, currently as a series of checkboxes, the name of the item in question and the number of results in the overall results that the item is found in.  Pressing on a checkbox filters the results in question and the results page reloads as soon as you press a checkbox.  This causes both the results and the filters to narrow, displaying only those that continue to match.  You can then click on other checkboxes to narrow things further, or deselect a checkbox to return to the non-filtered view.

I think the filters are going to be hugely useful, but they’re not perfect yet.  There are issues with the data for occupations and authors.  This is because the data has been stemmed by Solr for search purposes, meaning the field is broken down into individual word stems (e.g. ‘educ’ for ‘education’).  I will fix this but it will require me to regenerate the data and get the Stirling IT people to replace the existing data.  I’ve also noticed that data from test libraries is in Solr too and I’ll need to ensure this gets removed.

With all of this in place I then moved on to providing different sorting options for the search results, for example ordering the results by borrowed date, library or author name.  This required some tweaking of the Solr queries and the API and then some updates to the front-end to ensure the selected sorting option is dealt with and remembered.  However, I did come across a limitation in Solr, in that it is not possible for Solr to order data by fields that contain multiple values.  This means that for now sorting by things like author name and borrower occupation won’t work as each of these can contain multiple values per record.  I’ll therefore have to make concatenated versions of these fields for sort purposes and will do this when I regenerate the data.

This initial version of the facetted search results page displayed years in the same way as other search filters:  as a series of checkboxes, year labels and counts of results in each year.  What I really wanted to do was to display this as a bar chart instead, using the HighCharts library that I use for other visualisations in the front-end.  I wanted to group years into decades where the range of years is greater than a decade and enable the user to then press on a decade bar to view the results for individual years within the decade, with the bar chart then displaying the individual years.  I managed to get the ‘by decade’ bar chart working this week.  You can hover over a bar to view the exact total for the decade.  You can also click on a decade to filter the search to that decade.  This is the bit I’m still working on.  Currently no bar chart is displayed and you need to use your browser’s ‘back’ button to return, but the filter does actually work.  Eventually a bar chart with borrowings for each year in the decade will be displayed, together with a button for returning back.  In this view you will be able to further click on a year bar to filter the results to the selected year.  I’ll continue with this next week and the ‘Year borrowed’ checkboxes will be removed once the bar chart is fully working.  It’s took quite a while to get the bar chart working as there was a lot of logic that needed to be worked out in order for borrowings to be grouped into decades and to accommodate gaps in the data (e.g. if there is no data for a decade we still need this decade to get displayed otherwise the graph looks odd).  Below is a screenshot of the new front-end with facetted searching and ‘year borrowed’ bar chart:

Also for the Books and Borrowing project this week I had a Zoom call with Katie and Matt to discuss genre classification (which has now been decided upon) and batch editing the book edition records to fix duplicates and to auto-generate book work records for any editions that need them.  I also sent on some data exports containing the distinct book formats and places of publication that are in the system as they will need some editorial work as well.  I also responded to a few queries from one of the project RAs who wanted some queries run on the data for a library he has worked on.

Also this week I created an initial WordPress site for the VARICS project after the domain was set up by Russell McInnes, an IT guy from the College of Engineering who is helping out with Arts IT Support due to their staffing issues.  Russell has been hugely helpful and it’s such a relief to have someone to work with again.  Also this week I spoke to Marc Alexander about some financial issues relating to a number of projects I’m involved with and spoke with him about equipment that I might need in the coming year and conferences I’d like to attend.  I also made a tweak to the ‘email entry’ feature I’d changed o the DSL website last week.  Next week I’ll be continuing to work on the B&B front-end.

 

 

Week Beginning 16th January 2023

I divided my time primarily between the Anglo-Norman Dictionary and Books and Borrowing this week.  For the AND I implemented a new ‘citation editing’ feature that I’d written the specification for before Christmas.  This new feature allows an editor to bring up a list of all of the citations for a source text (similar to the how this page in the front-end works: https://anglo-norman.net/search/citation/null/null/A-N_Falconry) and to then manually edit the XML for one or more citations or apply a batch edit to any selected citations, enabling the citation’s date, source text reference and/or location reference to be edited, potentially updating the XML for thousands of entries in one process.  It took a fair amount of time to implement the feature and then further time to test it.  This was especially important as I didn’t want to risk an error corrupting thousands of dictionary entries.  I set up a version of the AND system and database on my laptop so I could work on the new code there without risk to the live site.

The new feature works pretty much exactly as I’d specified in the document I wrote before Christmas, but one difference is that I realised we already had a page in the Dictionary Management System that listed all sources – the ‘Browse Sources’ page.  Rather than have an entirely new ‘Edit citations’ page that would also begin by listing the sources I decided to update the existing ‘Browse Sources’ page.  This page now features the same tabular view of the source text, but the buttons beside each text now include ‘Edit citations’.  Pressing on this will open the ‘Edit citations’ page for the source in question.  By default this lists all citations for the source ordered by headword.  Where an entry has more than one citation for a source these appear in the order they are found in the entry.  At the top of the page there is a button you can press to change the sorting to location in the source text.  This sorts the citations by the contents of the <loc> tag, displaying the headword for each entry alongside the citation.  Note that sorting currently doesn’t work logically to a human user.  The field can contain mixtures of numbers and text and therefore the field is sorted as text.  When this occurs numbers are sorted alphabetically, meaning all of the ones come before all of the twos etc.  E.g. 1,10,1002 all come before 2.  I’ll need to investigate whether I can do something about this, maybe next week.

As my document had specified, you can batch edit and / or manually edit any listed citations.  Batch editing is controlled by the checkboxes beside each citation – any that are checked will have the batch edit applied to them.  The dark blue ‘Batch Edit Options’ section allows you to decide what details to change.  You can specify a new date (ideally using the date builder feature in the DMS to generate the required XML).  You can select a different siglum, which uses an autocomplete – start typing and select the matching siglum.  However, the problem with autocompletes is what happens if you manually edit or clear the field after selecting a value.  if you manually edit the text in this field after selecting a siglum the previously selected siglum will still be used as it’s not the contents of the text field that are used in the edit but a hidden field containing the ‘slug’ of the selected siglum.  An existing siglum selected from the autocomplete should always be used here to avoid this issue.  You can also specify new contents for the <loc> tag.  Any combination of the three fields can be used – just leave the ones you don’t want to update blank.

To manually edit one or more citations you can press the ‘Edit’ button beside the citation.  This displays a text area with the current XML for the citation in it.  You can edit this XML as required, but the editors will need to be careful to ensure the updated XML is valid or things might break.  The ‘Edit’ button changes to a ‘Cancel Edit’ button when the text area opens.  Pressing on this removes the text area.  Any changes you made to the XML in the text area will be lost and pressing the ‘Edit’ button again will reopen the text area with a fresh version of the citation’s XML.

It is possible to combine manual and batch edits but manual edits are applied first meaning if you manually edit some information that is also to be batch edited the batch edit will replace the manual edit for that information.  E.g. if you manually edit the <quotation> and the <loc> and you also batch edit the <loc> the quotation and loc fields will be replaced with your manual edit first and then the loc field will be overwritten with your batch edit.  Here’s a screenshot of the citation editor page, with one manual edit section open:

Once the necessary batch / manual changes have been made, pressing the ‘Edit Selected Citations’ button at the bottom of the page submits the data and at this point the edits will be made.  This doesn’t actually edit the live entry but takes the live entry XML, edits it and then creates a new Holding Area entry for each entry in question (Holding Area entries are temporary versions of entries stored in the DMS for checking before publication).  Th process of making these holding area entries includes editing all relevant citations for each entry (e.g. the contents of each relevant <attestation> element) and checking and (if necessary) regenerating the ‘earliest date’ field for the entry as this may have changed depending on the date information supplied.  After the script has run you can then find new versions of the entries in the Holding Area, where you can check and approve the versions, making them live or deleting them as required.  I’ll probably need to add in a ‘Delete all’ option to the Holding Area as currently entries that are to be deleted need to be individually deleted, which would be annoying if there’s an entire batch to remove.

Through the version on my laptop I fully tested the process out and it all worked fine.  I didn’t actually test publishing any live entries that have passed through the citation edit process, but I have previewed them in the holding area and all look fine.  Once the entries enter the holding area they should be structurally identical to entries that end up in the holding area from the ‘Upload’ facility so there shouldn’t be any issues in publishing them.

After that I uploaded the new code to the AND server and began testing and tweaking things there before letting the AND Editor Geert loose on the new system.  All seemed to work fine with his first updates, but then he noticed something a bit strange.  He’d updated the date for all citations in one source text, meaning more than 1000 citations needed to be updated.  However, the new date (1212) wasn’t getting applied to all of the citations, and somewhere down the list the existing date (1213) took over.

After much investigation it turned out the issue was caused by a server setting rather than any problem with my code.  The server has a setting that limits the number of variables that can be inputted from a form to 1000.  The batch edit was sending more variables than this so only the first 1000 were getting through.  As the cutoff of input variables was automatically and silently made by the server my script was entirely unaware that there was any problem, hence the lack of visible errors.

I can’t change the server settings myself but I managed to get someone in IT Support to update it for me.  With the setting changed the form submitted, but unfortunately after submission all it gave was a blank page so I had another issue to investigate.  It turned out to be an issue with the data.  There were two citations in the batch that had no dateInfo tag.  When specifying a date the script expects to find an existing dateInfo tag that then gets replaced.  As it found no such tag the script quit with a fatal error.  I therefore updated the script so that it can deal with citations that have no existing dateInfo tag.  In such cases the script now inserts a new dateInfo element at the top of the <attestation> XML.  I also added a count of the number of new holding area entries the script generates so it’s easier to check if any have somehow been lost during processing (which hopefully won’t happen now).

Whilst investigating this I also realised that when batch editing a date any entry that has more than one citation that is being edited will end up with the same ID used for each <dateInfo> element.  An ID should be unique and while this won’t really cause any issues when displaying the entries it might lead to errors or warnings in Oxygen.  I therefore updated the code to add the attestation ID to the supplied dateInfo ID when batch editing dates to ensure the uniqueness of the dataInfo ID.

With all of this in place the new feature was up and running and Geert was able to batch edit the citations for several source texts.  However, he sent me a panicked email on Saturday to say that after submitting an edit every single entry in the AND was now not displaying anything other than the headword.  This was obviously a serious problem so I spent some time on Saturday investigating and fixing the issue.

The issue turned out to be nothing to do with my new system but was caused by an issue with one of the entry XML files that was updated through the citation editing system.  The entry in question was Assensement (https://anglo-norman.net/entry/assensement) which has an erroneous <label> element: <semantic value=”=assentement?”/>.  This should not be a label and attributes are not allowed to start with an equals sign.  I must have previously stripped out such errors from our list of labels, but when the entry was published the label was reintroduced.  The DTD dynamically pulls in the labels and these are then used when validating the XML.  But as this list now included ‘=assentement?’ the DTD broke.  With the DTD broken the XSLT that transforms the entry XML into HTML wouldn’t run, meaning every single entry on the site failed to load.  Thankfully after identifying the issue it was quick to fix.  I simply deleted the erroneous label and things started working again, and Geert has updated the entry’s XML to remove the error.

For the Books and Borrowing project I had a Zoom call with project PI Katie and Co-I Matt on Monday to discuss the front-end developments and some of the outstanding tasks left to do.  The main one is to implement a genre classification system for books, and we now have a plan for how to deal with these.  Genres will be applied at work level and will then filter down to lower levels.  I also spent some time speaking to Stirling’s IT people about setting up a Solr instance for the project, as discussed in posts before Christmas.  Thankfully it was possible to get this set up and by the end of the week we had a Solr instance set up that I was able to query from a script on our server.  Next week I will begin to integrate Solr queries with the front-end that I’m working on.  I also generated spreadsheets containing all of the book edition and book work data that Matt had requested and engaged in email discussions with Matt and Katie about how we might automatically generate Book Work records from editions and amalgamate some of the many duplicate book edition records that Matt had discovered whilst looking through the data.

Also this week I made a small tweak to the Dictionaries of the Scots Language, replacing the ‘email’ option in the ‘Share’ icons with a different option as the original option was no longer working.  I also had a chat with Jane Stuart-Smith about the website for the VARICS project, replied to a query from someone in Philosophy who had a website that was no longer working, replied to an email from someone who had read my posts about Solr and had some questions and replied to Sara Pons-Sanz, the organiser of last week’s Zurich event who was asking about the availability of some visualisations of the Historical Thesaurus data.  I was able to direct her to some visualisations I’d made a while back that we still haven’t made public (see https://digital-humanities.glasgow.ac.uk/2021-12-06/).

Next week I aim to focus on the development of the Books and Borrowing front-end and the integration of Solr into this.

Week Beginning 9th January 2023

I attended the workshop ‘The impact of multilingualism on the vocabulary and stylistics of Medieval English’ in Zurich this week.  The workshop ran on Tuesday and Wednesday and I travelled to Zurich with my colleagues Marc Alexander and Fraser Dallachy on Monday.  It was really great to travel to a workshop in a different country again as I’d not been abroad since before Lockdown.  I’d never been to Zurich before and it was a lovely city.  The workshop itself was great, with some very interesting papers and good opportunities to meet other researchers and discuss potential future projects.  I gave a paper on the Historical Thesaurus, its categories and data structures and how semantic web technologies may be used to more effectively structure, manage and share the Historical Thesaurus’s semantically arranged dataset.  It was a half-hour paper with 10 minutes for questions afterwards and it went pretty well.  The audience wasn’t especially technical and I’m not sure how interesting the topic was to most people, but it was well received and I’m glad I had the opportunity to both attend the event and to research the topic as I have greatly increased my knowledge of semantic web technologies such as RDF, graph databases and SPARQL, and as part of the research I managed to write a script that generated an RDF version of the complete HT category data, which may come in handy one day.

I got back home just before midnight on the Wednesday and returned to normal work first thing on Thursday.  This included submitting my expenses from the workshop and replying to a few emails that had come in regarding my office (it looks like the dry rot work is going to take a while to resolve and it also looks like I’ll have to share my temporary office) and attempting to set up web hosting for the VARICS project, which Arts IT Support seem reluctant to do.  I also looked into an issue with the DSL that Ann Ferguson had spotted and spoke to the IT people at Stirling about their current progress with setting up a Solr instance for the Books and Borrowing project.  I also replaced a selection of library register images with better versions for that project and arranged a meeting for next Monday with the project’s PI and Co-I to discuss progress with the front-end.

I spent most of Friday writing a Data Management Plan and attending a Zoom call for a new speech therapy project I’m involved with.  It’s an ESRC funding proposal involving Glasgow and Strathclyde and I’ll be managing the technical aspects.  We had a useful call and I managed to complete an initial version of the DMP that the PI is going to adapt if required.

Week Beginning 2nd January 2023

The first week back after the Christmas holidays was supposed to be a three-day week, but unfortunately after returning to work on Wednesday I started with some sort of winter vomiting virus that affected me throughout Wednesday night and I was off work on Thursday.  I was still feeling very shaky on Friday but I managed to do a full day’s work nonetheless.

My two days were mostly spent creating my slides for the talk I’m giving at a workshop in Zurich next week and then practising the talk.  I also engaged in an email conversation about the state of Arts IT Support after the database on the server that hosts many of our most important websites went down on the first day of the Christmas holidays and remained offline for the best part of two weeks.  This took down websites such as the Historical Thesaurus, Seeing Speech, The Glasgow Story and the Emblems websites and I had to spend time over the holidays replying and apologising to people who contacted me about the sites being unavailable.  As I don’t have command-line access to the servers there was nothing I could do to fix the issue and despite several members of staff contacting Arts IT Support no response was received from them.  The issue was finally resolved on the 3rd of January but we have still received no communication from Arts IT Support to either inform us that the issue has been resolved, to let us know what caused the issue or to apologise for the incident, which is really not good enough.  Arts IT Support are in a shocking state at the moment due to critical staff leaving and not being replaced and I’m afraid it looks like the situation may not improve for several months yet, meaning issues with our website are likely to continue in 2023.