Week Beginning 16th January 2023

I divided my time primarily between the Anglo-Norman Dictionary and Books and Borrowing this week.  For the AND I implemented a new ‘citation editing’ feature that I’d written the specification for before Christmas.  This new feature allows an editor to bring up a list of all of the citations for a source text (similar to the how this page in the front-end works: https://anglo-norman.net/search/citation/null/null/A-N_Falconry) and to then manually edit the XML for one or more citations or apply a batch edit to any selected citations, enabling the citation’s date, source text reference and/or location reference to be edited, potentially updating the XML for thousands of entries in one process.  It took a fair amount of time to implement the feature and then further time to test it.  This was especially important as I didn’t want to risk an error corrupting thousands of dictionary entries.  I set up a version of the AND system and database on my laptop so I could work on the new code there without risk to the live site.

The new feature works pretty much exactly as I’d specified in the document I wrote before Christmas, but one difference is that I realised we already had a page in the Dictionary Management System that listed all sources – the ‘Browse Sources’ page.  Rather than have an entirely new ‘Edit citations’ page that would also begin by listing the sources I decided to update the existing ‘Browse Sources’ page.  This page now features the same tabular view of the source text, but the buttons beside each text now include ‘Edit citations’.  Pressing on this will open the ‘Edit citations’ page for the source in question.  By default this lists all citations for the source ordered by headword.  Where an entry has more than one citation for a source these appear in the order they are found in the entry.  At the top of the page there is a button you can press to change the sorting to location in the source text.  This sorts the citations by the contents of the <loc> tag, displaying the headword for each entry alongside the citation.  Note that sorting currently doesn’t work logically to a human user.  The field can contain mixtures of numbers and text and therefore the field is sorted as text.  When this occurs numbers are sorted alphabetically, meaning all of the ones come before all of the twos etc.  E.g. 1,10,1002 all come before 2.  I’ll need to investigate whether I can do something about this, maybe next week.

As my document had specified, you can batch edit and / or manually edit any listed citations.  Batch editing is controlled by the checkboxes beside each citation – any that are checked will have the batch edit applied to them.  The dark blue ‘Batch Edit Options’ section allows you to decide what details to change.  You can specify a new date (ideally using the date builder feature in the DMS to generate the required XML).  You can select a different siglum, which uses an autocomplete – start typing and select the matching siglum.  However, the problem with autocompletes is what happens if you manually edit or clear the field after selecting a value.  if you manually edit the text in this field after selecting a siglum the previously selected siglum will still be used as it’s not the contents of the text field that are used in the edit but a hidden field containing the ‘slug’ of the selected siglum.  An existing siglum selected from the autocomplete should always be used here to avoid this issue.  You can also specify new contents for the <loc> tag.  Any combination of the three fields can be used – just leave the ones you don’t want to update blank.

To manually edit one or more citations you can press the ‘Edit’ button beside the citation.  This displays a text area with the current XML for the citation in it.  You can edit this XML as required, but the editors will need to be careful to ensure the updated XML is valid or things might break.  The ‘Edit’ button changes to a ‘Cancel Edit’ button when the text area opens.  Pressing on this removes the text area.  Any changes you made to the XML in the text area will be lost and pressing the ‘Edit’ button again will reopen the text area with a fresh version of the citation’s XML.

It is possible to combine manual and batch edits but manual edits are applied first meaning if you manually edit some information that is also to be batch edited the batch edit will replace the manual edit for that information.  E.g. if you manually edit the <quotation> and the <loc> and you also batch edit the <loc> the quotation and loc fields will be replaced with your manual edit first and then the loc field will be overwritten with your batch edit.  Here’s a screenshot of the citation editor page, with one manual edit section open:

Once the necessary batch / manual changes have been made, pressing the ‘Edit Selected Citations’ button at the bottom of the page submits the data and at this point the edits will be made.  This doesn’t actually edit the live entry but takes the live entry XML, edits it and then creates a new Holding Area entry for each entry in question (Holding Area entries are temporary versions of entries stored in the DMS for checking before publication).  Th process of making these holding area entries includes editing all relevant citations for each entry (e.g. the contents of each relevant <attestation> element) and checking and (if necessary) regenerating the ‘earliest date’ field for the entry as this may have changed depending on the date information supplied.  After the script has run you can then find new versions of the entries in the Holding Area, where you can check and approve the versions, making them live or deleting them as required.  I’ll probably need to add in a ‘Delete all’ option to the Holding Area as currently entries that are to be deleted need to be individually deleted, which would be annoying if there’s an entire batch to remove.

Through the version on my laptop I fully tested the process out and it all worked fine.  I didn’t actually test publishing any live entries that have passed through the citation edit process, but I have previewed them in the holding area and all look fine.  Once the entries enter the holding area they should be structurally identical to entries that end up in the holding area from the ‘Upload’ facility so there shouldn’t be any issues in publishing them.

After that I uploaded the new code to the AND server and began testing and tweaking things there before letting the AND Editor Geert loose on the new system.  All seemed to work fine with his first updates, but then he noticed something a bit strange.  He’d updated the date for all citations in one source text, meaning more than 1000 citations needed to be updated.  However, the new date (1212) wasn’t getting applied to all of the citations, and somewhere down the list the existing date (1213) took over.

After much investigation it turned out the issue was caused by a server setting rather than any problem with my code.  The server has a setting that limits the number of variables that can be inputted from a form to 1000.  The batch edit was sending more variables than this so only the first 1000 were getting through.  As the cutoff of input variables was automatically and silently made by the server my script was entirely unaware that there was any problem, hence the lack of visible errors.

I can’t change the server settings myself but I managed to get someone in IT Support to update it for me.  With the setting changed the form submitted, but unfortunately after submission all it gave was a blank page so I had another issue to investigate.  It turned out to be an issue with the data.  There were two citations in the batch that had no dateInfo tag.  When specifying a date the script expects to find an existing dateInfo tag that then gets replaced.  As it found no such tag the script quit with a fatal error.  I therefore updated the script so that it can deal with citations that have no existing dateInfo tag.  In such cases the script now inserts a new dateInfo element at the top of the <attestation> XML.  I also added a count of the number of new holding area entries the script generates so it’s easier to check if any have somehow been lost during processing (which hopefully won’t happen now).

Whilst investigating this I also realised that when batch editing a date any entry that has more than one citation that is being edited will end up with the same ID used for each <dateInfo> element.  An ID should be unique and while this won’t really cause any issues when displaying the entries it might lead to errors or warnings in Oxygen.  I therefore updated the code to add the attestation ID to the supplied dateInfo ID when batch editing dates to ensure the uniqueness of the dataInfo ID.

With all of this in place the new feature was up and running and Geert was able to batch edit the citations for several source texts.  However, he sent me a panicked email on Saturday to say that after submitting an edit every single entry in the AND was now not displaying anything other than the headword.  This was obviously a serious problem so I spent some time on Saturday investigating and fixing the issue.

The issue turned out to be nothing to do with my new system but was caused by an issue with one of the entry XML files that was updated through the citation editing system.  The entry in question was Assensement (https://anglo-norman.net/entry/assensement) which has an erroneous <label> element: <semantic value=”=assentement?”/>.  This should not be a label and attributes are not allowed to start with an equals sign.  I must have previously stripped out such errors from our list of labels, but when the entry was published the label was reintroduced.  The DTD dynamically pulls in the labels and these are then used when validating the XML.  But as this list now included ‘=assentement?’ the DTD broke.  With the DTD broken the XSLT that transforms the entry XML into HTML wouldn’t run, meaning every single entry on the site failed to load.  Thankfully after identifying the issue it was quick to fix.  I simply deleted the erroneous label and things started working again, and Geert has updated the entry’s XML to remove the error.

For the Books and Borrowing project I had a Zoom call with project PI Katie and Co-I Matt on Monday to discuss the front-end developments and some of the outstanding tasks left to do.  The main one is to implement a genre classification system for books, and we now have a plan for how to deal with these.  Genres will be applied at work level and will then filter down to lower levels.  I also spent some time speaking to Stirling’s IT people about setting up a Solr instance for the project, as discussed in posts before Christmas.  Thankfully it was possible to get this set up and by the end of the week we had a Solr instance set up that I was able to query from a script on our server.  Next week I will begin to integrate Solr queries with the front-end that I’m working on.  I also generated spreadsheets containing all of the book edition and book work data that Matt had requested and engaged in email discussions with Matt and Katie about how we might automatically generate Book Work records from editions and amalgamate some of the many duplicate book edition records that Matt had discovered whilst looking through the data.

Also this week I made a small tweak to the Dictionaries of the Scots Language, replacing the ‘email’ option in the ‘Share’ icons with a different option as the original option was no longer working.  I also had a chat with Jane Stuart-Smith about the website for the VARICS project, replied to a query from someone in Philosophy who had a website that was no longer working, replied to an email from someone who had read my posts about Solr and had some questions and replied to Sara Pons-Sanz, the organiser of last week’s Zurich event who was asking about the availability of some visualisations of the Historical Thesaurus data.  I was able to direct her to some visualisations I’d made a while back that we still haven’t made public (see https://digital-humanities.glasgow.ac.uk/2021-12-06/).

Next week I aim to focus on the development of the Books and Borrowing front-end and the integration of Solr into this.

Week Beginning 9th January 2023

I attended the workshop ‘The impact of multilingualism on the vocabulary and stylistics of Medieval English’ in Zurich this week.  The workshop ran on Tuesday and Wednesday and I travelled to Zurich with my colleagues Marc Alexander and Fraser Dallachy on Monday.  It was really great to travel to a workshop in a different country again as I’d not been abroad since before Lockdown.  I’d never been to Zurich before and it was a lovely city.  The workshop itself was great, with some very interesting papers and good opportunities to meet other researchers and discuss potential future projects.  I gave a paper on the Historical Thesaurus, its categories and data structures and how semantic web technologies may be used to more effectively structure, manage and share the Historical Thesaurus’s semantically arranged dataset.  It was a half-hour paper with 10 minutes for questions afterwards and it went pretty well.  The audience wasn’t especially technical and I’m not sure how interesting the topic was to most people, but it was well received and I’m glad I had the opportunity to both attend the event and to research the topic as I have greatly increased my knowledge of semantic web technologies such as RDF, graph databases and SPARQL, and as part of the research I managed to write a script that generated an RDF version of the complete HT category data, which may come in handy one day.

I got back home just before midnight on the Wednesday and returned to normal work first thing on Thursday.  This included submitting my expenses from the workshop and replying to a few emails that had come in regarding my office (it looks like the dry rot work is going to take a while to resolve and it also looks like I’ll have to share my temporary office) and attempting to set up web hosting for the VARICS project, which Arts IT Support seem reluctant to do.  I also looked into an issue with the DSL that Ann Ferguson had spotted and spoke to the IT people at Stirling about their current progress with setting up a Solr instance for the Books and Borrowing project.  I also replaced a selection of library register images with better versions for that project and arranged a meeting for next Monday with the project’s PI and Co-I to discuss progress with the front-end.

I spent most of Friday writing a Data Management Plan and attending a Zoom call for a new speech therapy project I’m involved with.  It’s an ESRC funding proposal involving Glasgow and Strathclyde and I’ll be managing the technical aspects.  We had a useful call and I managed to complete an initial version of the DMP that the PI is going to adapt if required.

Week Beginning 2nd January 2023

The first week back after the Christmas holidays was supposed to be a three-day week, but unfortunately after returning to work on Wednesday I started with some sort of winter vomiting virus that affected me throughout Wednesday night and I was off work on Thursday.  I was still feeling very shaky on Friday but I managed to do a full day’s work nonetheless.

My two days were mostly spent creating my slides for the talk I’m giving at a workshop in Zurich next week and then practising the talk.  I also engaged in an email conversation about the state of Arts IT Support after the database on the server that hosts many of our most important websites went down on the first day of the Christmas holidays and remained offline for the best part of two weeks.  This took down websites such as the Historical Thesaurus, Seeing Speech, The Glasgow Story and the Emblems websites and I had to spend time over the holidays replying and apologising to people who contacted me about the sites being unavailable.  As I don’t have command-line access to the servers there was nothing I could do to fix the issue and despite several members of staff contacting Arts IT Support no response was received from them.  The issue was finally resolved on the 3rd of January but we have still received no communication from Arts IT Support to either inform us that the issue has been resolved, to let us know what caused the issue or to apologise for the incident, which is really not good enough.  Arts IT Support are in a shocking state at the moment due to critical staff leaving and not being replaced and I’m afraid it looks like the situation may not improve for several months yet, meaning issues with our website are likely to continue in 2023.

Week Beginning 19th December 2022

This was the last week before the Christmas holidays, and Friday was a holiday.  I spent some time on Monday making further updates to the Speech Star data.  I fixed some errors in the data and made some updates to the error type descriptions.  I also made ‘poster’ images from the latest batch of child speech videos I’d created last week as this was something I’d forgotten to do at the time.  I also fixed some issues with the non-disordered speech data, including changing a dash to an underscore in the filenames of the files for one speaker as there had been a mismatch between filenames and metadata, causing none of the videos to open in the site.  I also created records for two projects (The Gentle Shepherd and Speak For Yersel) on this very site (see https://digital-humanities.glasgow.ac.uk/projects/last-updated/) as these are the projects I’ve been working on that have actually launched in the past year.  Other major ones such as Books and Borrowing and Speech Star are not yet ready to share.  I also updated all of the WordPress sites I manage to the latest version.

On Tuesday I travelled into the University to locate my new office.  My stuff had been moved across last week after a leak in the building resulted in water pouring through my office.  Plus work is ongoing to fix the dry rot in the building and I would have needed to move out for that anyway.  It took a little time to get the new office in order and to get my computer equipment set up, but once it was all done it was actually a very nice location – much nicer than the horrible little room I’m usually stuck in.

I spent most of Tuesday upgrading Google Analytics for all of the sites I manage that use it.  Google’s current analytics system is being retired in July next year and I decided to use the time in the run-up to Christmas to migrate the sites over to the new Google Analytics 4 platform.  This was a mostly straightforward process, although as usual Google’s systems feel clunky and counterintuitive at times.  It was also a fairly lengthy process as I had to update the code for each site un question.  Nevertheless I managed to get it done and informed all of the staff whose websites would be affected by the change.  I also had a further chat with Geert, the editor of the Anglo-Norman Dictionary about the new citation edit feature I’m planning at the moment.

On Wednesday I had a meeting with prospective project partners in Strathclyde about a speech therapy proposal we’re putting together.  It was good to meet people and to discuss things.  I’ll be working on the Data Management Plan for the proposal after the holidays.  I spent the rest of the day working on my paper for the workshop I’m attending in Zurich in the second week of January.  I have now finished the paper, which is quite a relief.

On Thursday I spent some time working for the Dictionaries of the Scots Language.  I responded to an email from Ann Fergusson about how we should handle links to ancillary pages in the XML.  There are two issues here that need to be agreed upon.  The first issue is how to represent links to things other than entries in the entry XML.  We currently have the <ref> element that is used to link from one entry to another (e.g. <ref refid=”snd00065761″>Chowky</ref>).  We could use the HTML element <a> in the XML for links to things other than entries but I personally think it’s best not to use this as (in my opinion) it’s better for XML elements to be meaningful when you look at them and the meaning of <a> isn’t especially clear.  It might be better to use <ref> with a different attribute instead of ‘refid’, for example <ref url=”https://dsl.ac.uk/geographical-labels”>.  Reusing <ref> means we don’t need to update the DTD (the rules that define which elements can be used where in the XML) to add a new element.

Of course other people may think that inventing our own way of writing HTML links is daft when everyone is already familiar with <a href=”https://dsl.ac.uk/geographical-labels”> and we could use the latter if people prefer.  If this is the case we would need to update the DTD to allow such elements to be used.  If we didn’t update the DTD the XML files would fail to validate.

Whichever way is chosen, there is a second issue that will need to be addressed:  I will need to update the XSLT that transforms the XML into HTML to tell the script how to handle either a <ref> with a ‘url’ attribute or a <a> with an ‘href’ attribute.  Without updating the XSLT the links won’t work.  I can add such a rule in when we decide how best to represent links in the XML.

I also made a couple of tweaks to the wildcard search term highlighting feature I was working on last week and then published the update on the live DSL site.  Now when you perform a search for something like ‘chr*mas’ and then select an entry to view any work that matches the wildcard pattern will be highlighted.  For example, go to this page: https://dsl.ac.uk/results/chr*mas/fulltext/withquotes/both/ and then select one of the entries and you’ll see the term highlighted in the entry page.

That’s all from me for this year.  Merry chr*mas one and all!

Week Beginning 12th December 2022

There was a problem with the server on which a lot of our major sites such as the Historical Thesaurus and Seeing Speech are hosted that started on Friday and left all of the sites offline until Monday.  This was a really embarrassing and frustrating situation and I had to deal with lots of emails from users of the sites who were unable to access them.  As I don’t have command-line access to the servers all I could do was report the issue via our IT Helpdesk system.  Thankfully by mid-morning on Monday the sites were all back up again, but the incident raised serious issues about the state of Arts IT Support, who are massively understaffed at the moment.  Arts IT also refused to set up hosting for a project that we’re collaborating with Strathclyde University on, and in fact stated that they would not set up hosting for any further websites, which will have a massive negative impact on several projects that are still in the pipeline and ultimately means I will not be able to work on any new projects until this is resolved.  The PI for the new project with Strathclyde is Jane Stuart-Smith, and thankfully she was also not very happy with the situation.  We arranged a meeting with Liz Broe, who oversees Arts IT Support, to discuss the issues and had a good discussion about how we ended up in this state and how things will be resolved.  In the short-term some additional support is being drafted in from other colleges while new staff will be recruited in the medium term, and Liz has stated that hosting for new websites (including the Strathclyde one) will continue to be offered, which is quite a relief.

I also discovered this week that there has been a leak in 13 University Gardens and water has been pouring through my office.  I was already scheduled to be moved out of the building due to the dry rot that they’ve found all the way up the back wall (which my office is on) but this has made things a little more urgent.  I’m still generally working from home every day except Tuesday and apparently all my stuff has been moved to a different building, so I’ll just need to see how the process has gone when I’m back in the University next week.

In terms of actual work this week, I spent a bit more time writing my paper about the Historical Thesaurus and Semantic Web technologies for the workshop in January.  This is coming together now, although I still need to shape it into a presentation, which will take time.  I also spent some time working on the Speech Star project, updating the speech error database to fix a number of issues with the data that Eleanor had spotted and then adding in new error type descriptions for new error types that had been included.  I also added in some ancillary page content and had a chat with Eleanor about the database system the website uses.

I also spent some time working for the DSL this week.  Rhona had noted that when you perform a full text or quotation search (i.e. a search using Solr) with wildcards (e.g. chr*mas) the search results display entries with snippets that highlight the whole word where the search string occurred (e.g. ‘Christmas’).  However, when clicking through to the entry page such highlighting was not appearing, even though highlighting in the entry page does work when performing a search without wildcards.

Highlighting in the entry page was handled by a jQuery plugin, but this was not written to take wildcards into consideration and only works on full words.  I spent some time trying to figure out how to get wildcard highlighting working myself using regular expressions, but I find regular expressions to be pretty awful to work with – an ancient relic left over from computing in the 1980s and although I managed to get something working it wasn’t ideal.  Thankfully I found an existing JavaScript library called https://markjs.io/ that can handle wildcard highlighting and I was able to replace the existing plugin with this script and update the code to work with it.  I tested this out on our test DSL site and all seems to work well.  I haven’t updated the live site yet, though, as the DSL team need to test the new approach out more fully in case they encounter any problems with it.  I also noticed that there was an issue with the quotation search whereby if you returned to the search results from an entry by clicking on the ‘return to results’ button you got an empty page.  I fixed this in both our live and test sites.

I also spent some time working for the Anglo-Norman Dictionary this week.  I updated the citation search on the public website.  Previously the citation text was only added into the search results if you also search for a specific form within a siglum, for example https://anglo-norman.net/search/citation/%22tout%22/null/A-N_Falconry and ther citation searches (e.g. just selecting a siglum and / or a siglum date) would only return the entries the siglum appeared in without the individual citations.  Now the citations appear in these searches too.  For example, all citations from A-N Falconry: https://anglo-norman.net/search/citation/null/null/A-N_Falconry and all citations where the citation date is 1400: https://anglo-norman.net/search/citation/null/1400.  This also means when you view the citations by pressing on the ‘Search AND Citations’ button for a siglum in the bibliography you now see each citation for the listed entries.

I then spent most of a day thinking through all of the issues relating to the new ‘DMS citation search and edit’ feature that the editor wants me to implement and wrote an initial document detailing how the feature will work.  There has been quite a lot to think through and I thought it wise to document the feature rather than just launching into its creation without a clear plan.  I might have some time to start work on this next week as I’m working up to and including Thursday, but it depends how I get on with some other tasks I need to do for other projects.

Also this week I attended the Christmas lunch for the Books and Borrowing project in Edinburgh.  Unfortunately there was a train strike this day and I decided to get the bus through to Edinburgh.  The journey there was fine, talking about an hour and a half, but I got the 4pm bus on the way back and it was a nightmare, taking 2 hours forty minutes.  I would never get the bus between Glasgow and Edinburgh anywhere near rush hour ever again.

Week Beginning 5th December 2022

I continued my research into RDF, semantic web and linked open data and how they could be applied to the data of the Historical Thesaurus this this week in preparation for a paper I’ll be giving at a workshop in January, and also to learn more about these technologies and concepts in general.  I followed a few tutorials about RDF, for example here https://cambridgesemantics.com/blog/semantic-university/learn-rdf/ and read up about linked open data, for example here https://www.ontotext.com/knowledgehub/fundamentals/linked-data-linked-open-data/.  I also found a site that visualises linked open data projects here https://lod-cloud.net/.

I then manually created a small sample of the HT’s category structure, featuring multiple hierarchical levels and both main and subcategories using the RDF/XML format using the Simple Knowledge Organization System model.  This is a W3C standard for representing thesaurus data in RDF.  More information about it can be found on Wikipedia here: https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System and on the W3C’s website here https://www.w3.org/TR/skos-primer/ and here https://www.w3.org/TR/swbp-skos-core-guide/ and here https://www.w3.org/TR/swbp-thesaurus-pubguide/ and here https://www.w3.org/2001/sw/wiki/SKOS/Dataset.  I also referenced a guide to SKOS for Information Professionals here https://www.ala.org/alcts/resources/z687/skos.  I then imported this manually created sample into the Apache Jena server I set up last week to test that it would work, which thankfully it did.

After that I then wrote a small script to generate a comparable RDF structure for the entire HT category system.  I ran this on an instance of the database on my laptop to avoid overloading the server, and after a few minutes of processing I had an RDF representation of the HT’s hierarchically arranged categories in an XML file that was about 100MB in size.  I fed this into my Apache Jena instance and the import was a success.  I then spent quite a bit of time getting to grips with the SPARQL querying language that is used to query RDF data and by the end of the week I had managed to replicate some of the queries we use in the HT to generate the tree browser, for example ‘get all noun main categories at this level’ or ‘get all noun main categories that are direct children of a specified category’.

I then began experimenting with other RDF tools in the hope of being able to generate some nice visualisations of the RDF data, but this is where things came a bit unstuck.  I set up a nice desktop RDF database called GraphDB (https://www.ontotext.com/products/graphdb/) and also experimented with the Neo4J graph database (https://neo4j.com/) as my assumption had been that graph databases (which store data as dots and lines, like RDF triples) would include functionality to visualise these connections.  Unfortunately I have not been able to find any tools that allow you to just plug RDF data in and visualise it.  I found a Stack Overflow page about this (https://stackoverflow.com/questions/66720/are-there-any-tools-to-visualize-a-rdf-graph-please-include-a-screenshot) but none of the suggestions on the page seemed to work.  I tried downloading the desktop visualisation tool Gephi (https://gephi.org/) as apparently it had a plugin that would enable RDF data to be used, but the plugin is no longer available and other visualisation frameworks such as D3 do not work with RDF data but require the data to be migrated to another format first.  It seems strange that data structured in such a way as to make it ideal for network style visualisations should have no tools available to natively visualise the data and I am rather disappointed by the situation.  Of course it could just be that my Google skills have failed me, but I don’t think so.

In addition to the above I spent some time actually writing the paper that all of this will go into.  I also responded to a query from a researcher at Strathclyde who is putting together a speech and language therapy proposal and wondered whether I’d be able to help out, given my involvement in several other such projects.  I also spoke to the IT people at Stirling about the Solr instance for the Books and Borrowing project and made a few tweaks to the Speech Star project’s introductory text.

 

Week Beginning 28th November 2022

There was another strike day on Wednesday this week so it was a four-day week for me.  On Monday I attended a meeting about the Historical Thesaurus, and afterwards I dealt with some issues that cropped up.  These included getting an up to date dump of the HT database to Marc and Fraser, investigating a new subdomain to use for test purposes, looking into adding a new ‘sensitive’ flag to the database for categories that contain potentially offensive content, reminding people where our latest stats page is located and looking into connections between the HT and Mapping Metaphor datasets.  I also spent some more time this week researching semantic web technologies and how these could be used for thesaurus data.  This included setting up an Apache Jena instance on my laptop with a Fuseki server for querying RDF triples using the SPARQL query language.  See https://jena.apache.org/ and https://jena.apache.org/documentation/fuseki2/index.html for more information on these.  I played around with some sample datasets and thought about how our thesaurus data might be structured to use a similar approach.  Hopefully next week I’ll migrate some of the HT data to RDF and experiment with it.

Also this week I spent quite a bit of time speaking to IT Services about the state of the servers that Arts hosts, and migrating the Cullen Project website to a new server as the server it is currently on badly needs upgrades and there is currently no-one to manage this.  Migrating the Cullen Project website took the best part of a day to complete, as all database queries in the code needed to be upgraded.  This took some investigation as it turns out ‘mysqli_’ requires a connection to be passed to it in many of its functions where ‘mysql_’ doesn’t, plus where ‘mysql_’ does require a connection to be passed ‘mysqli_’ has the connection and the string the other way round.  There were also some character encoding issues that were cropping up.  It turned out that these were caused by the database not being UTF-8 and the database connection script needed to set the character-set to ‘latin1’ for the characters to display properly.  Luca also helped with the migration, dealing with the XML and eXistDB side of things and by the end of the week we had a fully operational version of the site running at a temporary URL on a new server.  We put in a request to have the DNS for the project’s domain switched to the new server and once this takes effect we’ll be able to switch the old server off.

Also this week I fixed a couple of minor issues with a couple of the place-names resources, participated in an interview panel for a new role at college level, duplicated a section of the Seeing Speech website on the Dynamic Dialects website at the request of Eleanor Lawson and had discussions about moving out of my office due to work being carried out in the building.

Week Beginning 21st November 2022

I participated in the UCU strike action on Thursday and Friday this week, so it was a three-day week for me.  I spent much of this time researching RDF, SKOS and OWL semantic web technologies in an attempt to understand them better in the hope that they might be of some use for future thesaurus projects.  I’ll be giving a paper about this at a workshop in January so I’m not going to say too much about my investigations here.  There is a lot to learn about, however, and I can see me spending quite a lot more time on this in the coming weeks.

Other than this I returned to working on the Anglo-Norman Dictionary.  I added in a line of text that had somehow been omitted from one of the Textbase XML files and added in a facility to enable project staff to delete an entry from the dictionary.  In reality this just deactivates the entry, removing it from the front-end but still keeping a record of it in the database in case the entry needs to be reinstated.  I also spoke to the editor about some proposed changes to the dictionary management system and begin to think about how these new features will function and how they will be developed.

For the Books and Borrowing project I had a chat with IT Services at Stirling about setting up an Apache Solr system for the project.  It’s looking like we will be able to proceed with this option, which will be great.  I also had a chat with Jennifer Smith about the new Speak For Yersel project areas.  It looks like I’ll be creating the new resources around February next year.  I also fixed an issue with the Place-names of Iona data export tool and discussed a new label that will be applied to data for the ‘About’ box for entries in the Dictionaries of the Scots Language.

I also prepared for next week’s interview panel and engaged in a discussion with IT Services about the future of the servers that are hosted in the College of Arts.

Week Beginning 14th November 2022

I spent almost all of this week working with a version of Apache Solr installed on my laptop, experimenting with data from the Books and Borrowing project and getting to grips with setting up a data core and customising a schema for the data, preparing data for ingest into Solr, importing the data and running queries on it, including facetted searching.

I started the week experimenting with our existing database, creating a cache table and writing a script to import a sample of 100 records.  This cache table could hold all of the data that the quick search would need to query and would be very speedy to search, but I realised that other aspects related to the searching would still be slow.  Facetted searching would still require several other database queries to be executed, as would extracting all of the fields that would be necessary to display the search results and it seemed inadvisable to try and create all of this functionality myself when an existing package like Solr could already do it all.

Solr is considerably faster than using the database approach and its querying is much more flexible.  It also offers facetted search options that are returned pretty much instantaneously which would be hopelessly slow if I attempted to create something comparable directly with the database.  For example, I can query the Solr data to find all borrowing records that involve a book holding record with a standardised title that includes the word ‘Roman’, returning 3325 records, but Solr can then also return a breakdown of the number of records by other fields, for example publication place:

“London”,2211,

“Edinburgh”,119,

“Dublin”,100,

“Paris”,30,

“Edinburgh; London”,16,

“Cambridge”,4,

“Eton”,3,

“Oxford”,3,

“The Hague”,3,

“Naples”,2,

“Rome”,2,

“Berlin”,1,

“Glasgow”,1,

“Lausanne”,1,

“Venice”,1,

“York”,1

 

Format:

“8vo”,842,

“octavo”,577,

“4to”,448,

“quarto”,433,

“4to.”,88,

“8vo., plates, port., maps.”,88,

“folio”,76,

“duodecimo”,67,

“Folio”,33,

“12mo”,19,

“8vo.”,17,

“8vo., plates: maps.”,16

 

Borrower gender:

“Male”,3128,

“Unknown”,109,

“Female”,64,

“Unclear”,2

These would then allow me to build in the options to refine the search results further by one (or more) of the above criteria.  Although it would be possible to build such a query mechanism myself using the database it is likely that such an approach would be much slower and would take me time to develop.  It seems much more sensible to use an existing solution if this is going to be possible.

In my experiments with Solr on my laptop I Initially imported 100 borrowing records exported via the API call I created to generate the search results page.  This gave me a good starting point to experiment with Solr’s search capabilities, but the structure of the JSON file returned from the API was rather more complicated than we’d need purely for search purposes and includes a lot of data that’s not really needed either, as the returned data contains everything that’s needed to display the full borrowing record.  I therefore worked out a simpler JSON structure that would only contain the fields that we would either want to search or could be used in a simplified search results page.  Here’s an example:

{

“bnid”: 1379,

“lid”: 6,

“slug”: “glasgow-university”,

“lname”: “Glasgow University Library”,

“rid”: 2,

“rname”: “3”,

“syear”: 1760,

“eyear”: 1765,

“rtype”: “Student”,

“pid”: 107,

“fnum”: “4r”,

“transcription”: “Euseb: Eclesiastical History”,

“bday”: 17,

“bmonth”: 9,

“byear”: 1760,

“rday”: 1,

“rmonth”: 10,

“ryear”: 1760,

“borrowed”: “1760-09-17”,

“returned”: “1760-10-01”,

“bdayofweek”: “Wednesday”,

“rdayofweek”: “Wednesday”,

“originaltitle”: “”,

“standardisedtitle”: “Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”,

“brids”: [“1”],

“bfnames”: [“Charles”],

“bsnames”: [“Wilson”],

“bfullnames”: [“Charles Wilson”],

“boccs”: [“University Student”, “Education”],

“bgenders”: [“Male”],

“aids”: [“74”],

“asnames”: [“Eusebius of Caesarea”],

“afullnames”: [” Eusebius of Caesarea”],

“beids”: [“88”],

“edtitles”: [“Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”],

“estcs”: [“R21513”],

“langs”: [“English”],

“pubplaces”: [“London”],

“formats”: [“folio”]

}

I wrote a script that would export individual JSON files like the above for each active borrowing record in our system (currently 141,335 records).  I ran this on a version of the database stored on my laptop rather than running it on the server to avoid overloading the server.  I then created a Solr Core for the data and specified an appropriate schema.  This defines each of the above fields and the types of data the fields can hold (e.g. some fields can hold multiple values, such as borrower occupations, some fields are text strings, some are integers, some are dates). I then ran the Solr script that ingests the data.

It took a lot of time to get things working as I needed to experiment with the structure of the JSON files that my script generated in order to account for various complexities in the data.  I also encountered some issues with the data that only became apparent at the point of ingest when records were rejected.  These issues only affected a few records out of nearly 150,000 so I needed to tweak and re-run the data export many times until all issues were ironed out.  As both the data export and the ingest scripts took quite a while to run the whole process took several days to get right.

Some issues encountered include:

  1. Empty fields in the data resulting in no data for the corresponding JSON field (e.g. “bday”: <nothing here> ) which invalidated the JSON file structure. I needed to update the data export script to ensure such empty fields were not included.
  2. Solr’s date structure requiring a full date (e.g. 1792-02-16) and partial dates (e.g. 1792) therefore failing. I ended up reverting to an integer field for returned dates as these are generally much more vague and having to generate placeholder days and months where required for the borrowed date.
  3. Solr’s default (and required) ID field having to be a string rather than an integer, which is what I’d set it to in order to match our BNID field. This was a bit of a strange one as I would have expected an integer ID to be allowed and it took some time to investigate why my nice integer ID was failing.
  4. Realising more fields should be added to the JSON output as I went on and therefore having to regenerate the data each time (e.g. I added in borrower gender and IDs for borrowers, editions, works and authors )
  5. Issues with certain characters appearing in the text fields causing the import to break. For example, double quotes needed to be converted to the entity ‘&quote;’ as their appearance in the JSON caused the structure to be invalid.  I therefore updated the translation, original title and standardised title fields, but then the import still failed as a few borrowers also have double quotes in their names.

However, once all of these issues were addressed I managed to successfully import all 141,355 borrowing records into the Solr instance running on my laptop and was able to experiment with queries, all of which are running very quickly and will serve our needs very well.  And now that the data export script is properly working I’ll be able to re-run this and ingest new data very easily in future.

The big issue now is whether we will be allowed to install an Apache Solr instance on a server at Stirling.  We would need the latest release of Solr (v9 https://solr.apache.org/downloads.html) to be installed on a server.  This requires Java JRE version 11 or higher (https://solr.apache.org/guide/solr/latest/deployment-guide/system-requirements.html).  Solr uses the Apache Lucene search library and as far as I know it fires up a Java based server called Jetty when it runs.  The deployment guide can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html

When Solr runs a web-based admin interface is available through which the system can be managed and the data can be queried.  This would need securing, and instructions about doing so can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/securing-solr.html

I think basic authentication would be sufficient, ideally with access limited to on-campus / VPN users.  Other than for testing purposes there should only be one script that connects to the Solr URL (our API) so we could limit access to the IP address of this server,  or if Solr is going to be installed on the same server then limiting access to localhost could work.

In terms of setting up the Solr instance, we would only need a single node installation (not SolrCloud).  Once Solr is running we’d need a Core to be created.  I have the schema file the core would require and can give instructions about setting this up.  I’m assuming that I would not be given command-line access to the server, which would unfortunately mean that someone in Stirling’s IT department would need to execute a few commands for me, including setting up the Core and ingesting the data each time we have a new update.

One downside to using Solr is it is a separate system to the B&B database and will not reflect changes made to the project’s data until we run a new data export / ingest process.  We won’t want to do this too frequently as exporting the data takes at least an hour, then transferring the files to the server for ingest will take a long time (uploading hundreds of thousands of small files to a server can take hours.  Zipping them up then uploading the zip file and extracting the file also takes a long time).  Then someone with command-line access to the server will need to run the command to ingest the data.  We’ll need to see if Stirling are prepared to do this for us.

Until we hear more about the chances of using Solr I’ll hold off doing any further work on B&B.  I’ve got quite a lot to do for other projects that I’ve been putting off whilst I focus on this issue so I need to get back into that.

Other than the above B&B work I did spent a bit of time on other projects.  I answered a query about a potential training event based on Speak For Yersel that Jennifer Smith emailed me about and I uploaded a video to the Speech Star site.  I deleted a spurious entry from the Anglo-Norman Dictionary and fixed a typo on the ‘Browse Textbase’ page.  I also had a chat with the editor about further developments of the Dictionary Management System that I’m going to start looking into next week.  I also began doing some research into semantic web technologies for structuring thesaurus data in preparation for a paper I’ll be giving in Zurich in January.

Finally, I investigated potential updates to the Dictionaries of the Scots Language quotations search after receiving a series of emails from the team, who had been meeting to discuss how dates will be used in the site.

Currently the quotations are stripped of all tags to generate a single block of text that is then stored in the Solr indexing system and queried against when an advanced search ‘quotes only’ search is performed.  So for example in a search for ‘driech’ (https://dsl.ac.uk/results/dreich/quotes/full/both/) Solr looks for the term in the following block of text for the entry https://dsl.ac.uk/entry/snd/dreich (block snipped to save space):

<field name=”searchtext_onlyquotes”>I think you will say yourself it is a dreich business.

Sic dreich wark. . . . For lang I tholed an’ fendit.

Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain.

And he’ll no fin his day’s dark ae hue the dreigher for wanting his breakfast on account of sic a cause.

It’s a dreich job howkin’ tatties wi’ the caul’ win’ in yer duds.

Driche and sair yer pain.

And even the ugsome driech o’ this Auld clarty Yirth is wi’ your kiss Transmogrified.

See a blanket of September sorrows unremitting drich and drizzle permeates our light outerwear.

</field>

The way Solr handles returning snippets is described on this page: https://solr.apache.org/guide/8_7/highlighting.html and the size of the snippet is set by the hl.fragsize variable, which “Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100.”.  We don’t currently override this default so 100 characters is what we use per snippet (roughly – it can extend more than this to ensure complete words are displayed).

The hl.snippets variable specifies the maximum number of highlighted snippets that are returned per entry and this is currently set to 10.  If you look at the SND result for ‘Dreich adj’ you will see that there are 10 snippets listed and this is because the maximum number of snippets has been reached.  ‘Dreich’ actually occurs many more than 10 times in this entry.  We can change this maximum, but I think 10 gives a good sense that the entry in question is going to be important.

As the quotations block of text is just one massive block and isn’t split into individual quotations the snippets don’t respect the boundaries between quotations.  So the first snippet for ‘Dreich Adj’ is:

“I think you will say yourself it is a dreich business. Sic dreich wark. . . . For lang I tholed an”

Which actually comprises the text from almost the entire first two quotes, while the next snippet:

“’ fendit. Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain. And he’ll no fin his day’s”

 

Includes the last word of the second quote, all of the third quote and some of the fourth quote (which doesn’t actually include ‘dreich’ but ‘dreigher’ which is not highlighted).

So essentially while the snippets may look like they correspond to individual quotes this is absolutely not the case and the highlighted word is generally positioned around the middle of around 100 characters of text that can include several quotations.  It also means that it is not possible to limit a search to two terms that appear within one single quotation at the moment because we don’t differentiate individual quotations – the search doesn’t know where one quotation ends and the next begins.

I have no idea how Solr works out exactly how to position the highlighted term within the 100 characters, and I don’t think this is something we have any control over.  However, I think we will need to change the way we store and query quotations in order to better handle the snippets, allow Boolean searches to be limited to the text of specific quotes rather than the entire block and to enable quotation results to be refined by a date / date range, which is what the team wants.

We’ll need to store each quotation for an entry individually, each with its own date fields and potentially other fields later on such as part of speech.  This will ensure snippets will in future only feature text from the quotation in question and will ensure that Boolean searches will be limited to text within individual queries.  However, it is a major change and it will require some time and experimentation to get working correctly and it may introduce other unforeseen issues.

I will need to change the way the search data is stored in Solr and I will need to change how the data is generated for ingest into Solr.  The display of the search results will need to be reworked as the search will now be based around quotations rather than entries.  I’ll need to group quotations into entries and we’ll need to decide whether to limit the number of quotations that get displayed per entry as for something like ‘dreich adj’ we would end up with many tens of quotations being returned, which would swamp the results page and make it difficult to use.  It is also likely that the current ranking of results will no longer work as individual quotations will be returned rather than entire entries.  The quotations themselves will be ranked, but that’s not going to be very helpful if we still want the results to be grouped by entry.  I’ll need to look at alternatives, such as ranking entries by the number of quotations returned.

The DSL team has proposed that a date search could be provided as a filter on the search results page and we would certainly be able to do this, and incorporate other filters such as POS in future.  This is something called ‘facetted searching’ and it’s the kind of thing you see in online shops:  you view the search results then you see a list of limiting options, generally to the left of the results, often as a series of checkboxes with a number showing how many of the results the filter applies to.  The good news is that Solr has these kind of faceting options built in (in fact it is used to power many online shops).  More good news is that this fits in with the work I’m already doing for the Books and Borrowing project as discussed at the start of this post, so I’ll be able to share my expertise between both projects.

Week Beginning 7th November 2022

I participated in an event about Digital Humanities in the College of Arts that Luca had organised on Monday, at which I discussed the Books and Borrowing project.  It was a good event and I hope there will be more like it in future.  Luca also discussed a couple of his projects and mentioned that the new Curious Travellers project is using Transkribus (https://readcoop.eu/transkribus/) which is an OCR / text recognition tool for both printed text and handwriting that I’ve been interested in for a while but haven’t yet needed to use for a project.  I will be very interested to hear how Curious Travellers gets on with the tool in future.  Luca also mentioned a tool called Voyant (https://voyant-tools.org/) that I’d never heard of before that allows you to upload a text and then access many analysis and visualisation tools.  It looks like it has a lot of potential and I’ll need to investigate it more thoroughly in future.

Also this week I had to prepare for and participate a candidate shortlisting session for a new systems developer post in the College of Arts and Luca and I had a further meeting with Liz Broe of College of Arts admin about security issues relating to the servers and websites we host.  We need to improve the chain of communication from Central IT Services to people like me and Luca so that security issues that are identified can be addressed speedily.  As of yet we’ve still not heard anything further from IT Services so I have no idea what these security issues are, whether they actually relate to any websites I’m in charge of and whether these issues relate to the code or the underlying server infrastructure.  Hopefully we’ll hear more soon.

The above took a fair bit of time out of my week and I spent most of the remainder of the week working on the Books and Borrowing project.  One of the project RAs had spotted an issue with a library register page appearing out of sequence so I spent a little time rectifying that.  Other than that I continued to develop the front-end, working on the quick search that I had begun last week and by the end of the week I was still very much in the middle of working through the quick search and the presentation of the search results.

I have an initial version of the search working now and I created an index page on the test site I’m working on that features a quick search box.  This is just a temporary page for test purposes – eventually the quick search box will appear in the header of every page.  The quick search does now work for both dates using the pattern matching I discussed last week and for all other fields that the quick search needs to cover.  For example, you can now view all of the borrowing records with a borrowed date between February 1790 and September 1792 (1790/02-1792/09) which returns 3426 borrowing records.  Results are paginated with 100 records per page and options to navigate between pages appear at the top and bottom of each results page.

The search results currently display the complete borrowing record for each result, which is the same layout as you find for borrowing records on a page.  The only difference is additional information about the library, register and page the borrowing record appears on can be found at the top of the record.  These appear as links and if you press on the page link this will open the page centred on the selected borrowing record.  For date searches the borrowing date for each record is highlighted in yellow, as you can see in the screenshot below:

The non-date search also works, but is currently a bit too slow.  For example a search for all borrowing records that mention ‘Xenophon’ takes a few seconds to load, which is too long.  Currently non-date quick searches do a very simple find and replace to highlight the matched text in all relevant fields.  This currently makes the matched text upper case, but I don’t intend to leave it like this.  You can also search for things like the ESTC too.

However, there are several things I’m not especially happy about:

  1. The speed issue: the current approach is just too slow
  2. Ordering the results: currently there are no ordering options because the non-date quick search performs five different queries that return borrowing IDs and these are then just bundled together. To work out the ordering (such as by date borrowed, by borrower name)  many more fields in addition to borrowing ID would need to be returned, potentially for thousands of records and this is going to be too slow with the current data structure
  3. The search results themselves are a bit overwhelming for users, as you can see from the above screenshot. There is so much data it’s a bit hard to figure out what you’re interested in and I will need input from the project team as to what we should do about this.  Should we have a more compact view of results?  If so what data should be displayed?  The difficulty is if we omit a field that is the only field that includes the user’s search term it’s potentially going to be very confusing
  4. This wasn’t mentioned in the requirements document I wrote for the front-end, but perhaps we should provide more options for filtering the search results. I’m thinking of facetted searching like you get in online stores:  You see the search results and then there are checkboxes that allow you to narrow down the results.  For example, we could have checkboxes containing all occupations in the results allowing the user to select one or more.  Or we have checkboxes for ‘place of publication’ allowing the user to select ‘London’, or everywhere except ‘London’.
  5. Also not mentioned, but perhaps we should add some visualisations to the search results too. For example, a bar graph showing the distribution of all borrowing records in the search results over time, or another showing occupations or gender of the borrowings in the search results etc.  I feel that we need some sort of summary information as the results themselves are just too detailed to easily get an overall picture of.

I came across the Universal Short Title Catalogue website this week (e.g. https://www.ustc.ac.uk/explore?q=xenophon) it does a lot of the things I’d like to implement (graphs, facetted search results) and it does it all very speedily with a pleasing interface and I think we could learn a lot from this.

Whilst thinking about the speed issues I began experimenting with Apache Solr (https://solr.apache.org/) which is a free search platform that is much faster than a traditional relational database and provides options for facetted searching.  We use Solr for the advanced search on the DSL website so I’ve had a bit of experience with it.  Next week I’m going to continue to investigate whether we might be better off using it, or whether creating cached tables in our database might be simpler and work just as well for our data.  But if we are potentially going to use Solr then we would need to install it on a server at Stirling.  Stirling’s IT people might be ok with this (they did allow us to set up a IIIF server for our images, after all) but we’d need to check.  I should have a better idea as to whether Solr is what we need by the end of next week, all being well.

Also this week I spent some time working on the Speech Star project.  I updated the database to highlight key segments in the ‘target’ field which had been highlighted in the original spreadsheet version of the data by surrounding the segment with bar characters.  I’d suggested this as when exporting data from Excel to a CSV file all Excel formatting such as bold text is lost, but unfortunately I hadn’t realised that there may be more than one highlighted segment in the ‘target’ field.  This made figuring out how to split the field and apply a CSS style to the necessary characters a little trickier but I got there in the end.  After adding in the new extraction code I reprocessed the data, and currently the key segment appears in bold red text, as you can see in the following screenshot:

I also spent some time adding text to several of the ancillary pages of the site, such as the homepage and the ‘about’ page and restructured the menus, grouping the four database pages together under one menu item.

Also this week I tweaked the help text that appears alongside the advanced search on the DSL website and fixed an error with the data of the Thesaurus of Old English website that Jane Roberts had accidentally introduced.