Week Beginning 20th April 2020

This was the fifth week of lockdown and my first full week back after the Easter holidays, which as with previous weeks I needed to split between working and home-schooling my son.  There was some issue with the database on the server that powers many of the project websites this week, meaning all of those websites stopped working.  I had to spend some time liaising with Arts IT Support to get the issue sorted (as I don’t have the necessary server-level access to fix such matters) and replying to the many emails from the PIs of projects who understandably wanted to know why their website was offline.  The server was unstable for about 24 hours, but has thankfully been working without issue since then.

Alison Wiggins got in touch with me this week to discuss the content management system for the Mary, Queen of Scots Letters sub-project, which I set up for her last year as part of her Archives and Writing Lives project.  There is now lots of data in the system and Alison is editing it and wanted me to make some changes to the interface to make the process a little swifter.  I changed how sorting works on the ‘browse documents’ and ‘browse parties’ pages.  The pages are paginated and previously the sorted only affected the subset of records found on the current page rather than sorting the whole dataset.  I updated this so that sorting now reorganises everything, and I also updated the ‘date sorting’ column so that it now uses the ‘sort_date’ field rather than the ‘display_date’ field.  Alison had also noticed that the ‘edit party’ page wasn’t working and I discovered that there was a bug on this page that was preventing updates being saved in the database, which I fixed.  I also created a new ‘Browse Collections’ page and added it to the top menu.  This is a pretty simple page that lists the distinct collections alphabetically and for each lists their associated archives and documents, each with a link through to the relevant ‘edit’ page.  Finally, I gave Alison some advice on editing the free-text fields, which use the TinyMCE editor, to strip out unwanted HTML that has been pasted into them and thought about how we might present this data to the public.

I also responded to a query from Matthew Creasy about the website for a new project he is working on.  I set up a placeholder website for this project a couple of months ago and Matthew is now getting close to the point where he wants the website to go live.  Gerry McKeever also go in touch with me to ask whether I would write some sections about the technology behind the interactive map I created for his Regional Romanticism project for a paper he is putting together.  We talked a little about the structure of this and I’ll write the required sections when he needs them.

Other than these issues I spent the bulk of the week working on the Books and Borrowing project.  Katie and Matt got back with some feedback on the data description document that I completed and sent to them before Easter and I spent some time going through this feedback and making an updated version of the document.  After sending the document off I started working on a description of the content management system.  This required a lot of thought and planning as I needed to consider how all of the data as defined in the document would be added, edited and deleted in the most efficient and easy to use manner.  By the end of the week I’d written some 2,500 words about the various features of the CMS, but there is still a lot to do.  I’m hoping to have a version completed and sent off to Katie and Matt early next week.

My other big task of the week was to work with the data for the Anglo-Norman Dictionary again.  As mentioned previously, this project’s data and systems are in a real mess and I’m trying to get it all sorted along with the project’s Editor, Heather Pagan.  Previously we’d figured out that there was a version of the AND data in a file called ‘all.xml’, but that it did not contain the updates to the online data from the project’s data management system and we instead needed to somehow extract the data relating to entries from the online database.

Back in February when looking through the documentation again I discovered where the entry data was located in a file called ‘entry_hash’ within the directory ‘/var/data’.  I stated that the data was stored in a Berkeley DB and gave Heather a link to a place where the database files could be downloaded (https://www.oracle.com/database/technologies/related/berkeleydb-downloads.html)

I spent some time this week trying to access the data.  It turns out that the download is not for a nice, standalone database program but is instead a collection of files that only seem to do anything when they are called from your own program written in something like C or Java.  There was a command called ‘db_dump’ that would supposedly take the binary hash file and export it as plain text.  This did work, in that the file could then be read in a text editor, but it was unfortunately still a hash file – just a series of massively long lines of numbers.

What I needed was just some way to view the contents of the file, and thankfully I came across this answer: https://stackoverflow.com/a/19793412 which suggests using the Python programming language to export the contents from a hash file.  However, Python dropped support for the ‘dbhash’ function years ago so I had to track down and install a version of Python from 2010 for this to work.  Also, I’ve not really used Python before so that took some getting used to.  Thankfully with a bit of tweaking I was able to write a short Python script that appears to read through the ‘entry_hash’ file and output each line as plain text.  I’m including the script here for future reference:

—-

import dbhash

f = open(“testoutput.txt”,”w+”)

for k, v in dbhash.open(“entry_hash”).iteritems():

f.write(v+”\n”)

f.close()

—-

The resulting file includes XML entries and is 3,969,350 lines long.  I sent this to Heather for her to have a look at, but Heather reckoned that some updates to the data were not present in this output.  I wrote a little script that counts the number of <entry> elements in each of the XML files (all.xml and entry_hash.xml) and the former has 50426 entries while the latter has 53945.  So the data extracted from the ‘entry_hash’ file is definitely not the same.  The former file is about 111Mb in size while the latter is 133Mb so there’s definitely a lot more content, which I think is encouraging.  Further investigation showed that the ‘all.xml’ file was actually generated in 2015 while the data I’ve exported is the ‘live’ data as it is now, which is good news.  However, it would appear that data in the data management system that has not yet been published is stored somewhere else.  As this represents two years of work it is data that we really need to track down.

I went back through the documentation of the old system, which really is pretty horrible to ready and unnecessarily complicated.  Plus there are multiple versions of the documentation without any version control stating which is the most update to date version.  I have a version in Word and a PDF containing images of scanned pages of a printout.  Both differ massively without it being clear which superseded which.  It turns out the scanned version is likely to be the most up to date, but of course being badly scanned images that are all wonky it’s not possible to search the text and attempting OCR didn’t work.  After a lengthy email conversation with Heather we realised we would need to get back into the server to try and figure out where the data for the DMS was located.  Heather needed to be in her office at work to do this and on Friday she managed to get access and via a Zoom call I was able to see the server and discuss the potential data locations with her.  It looks like all of the data that has been worked on in the DMS but has yet to be integrated with the public site is located in a directory called ‘commit-out’.  This contains more than 13,000 XML files, which I now have access to.  If we can combine this with the data from ‘entry_hash’ and the data Heather and her colleague Geert have been working on for the letters R and S then we should have the complete dataset.  Of course it’s not quite so simple as whilst looking through the server we realised that there are many different locations where a file called ‘entry_hash’ is found and no clear way of knowing which is the current version of the public data and which is just some old version that is not in use.  What a mess.  Anyway, progress has been made and our next step is to check that the ‘commit-out’ files do actually represent all of the changes made in the DMS and that the version of ‘entry_hash’ that I have so far extracted is the most up to date version.

Week Beginning 13th April 2020

I was on holiday for all of last week and Monday and Tuesday this week.  My son and I were supposed to be visiting my parents for Easter, but we were unable to do so due to the lockdown and instead had to find things to amuse ourselves with around the house.  I answered a few work emails during this time, including alerting Arts IT Support to some issues with the WordPress server and responding to a query from Ann Fergusson at the DSL.  I returned to work (from home, of course) on Wednesday and spent the three days working on various projects.

For the Books and Borrowers project I spent some time downloading and looking through the digitised and transcribed borrowing registers of St. Andrews.  They have made three registers from the second half of the 18th century available via a Wiki interface (see https://arts.st-andrews.ac.uk/transcribe/index.php?title=Main_Page) and we were given access to all of these materials that had been extracted and processed by Patrick McCann, who I used to work very closely with back when we were both based at HATII and worked for the Digital Curation Centre.  Having looked through the materials it’s clear that we will be able to use the transcriptions, which will be a big help.  The dates will probably need to be manually normalised, though, and we will need access to higher resolution images than the ones we have been given in order to make a zoom and pan interface using them.

I also updated the introductory text for Gerry McKeever’s interactive map of the novel Paul Jones, and I think this feature is now ready to go live, once Gerry want to launch it.  I also fixed an issue with the Editing Robert Burns website that was preventing the site editors (namely Craig Lamont) from editing blog posts.  I also created a further new version of the Burns Supper map for Paul Malgrati.  This version incorporates updated data, which has greatly increased the number of Suppers that appear on the map and I also changed the way videos work.  Previously if an entry had a link to a video then a button was added to the entry that linked through to the externally hosted video site (which could be YouTube, Facebook, Twitter or some other site).  Instead, the code now identifies the origin of the video and I’ve managed to embed players from YouTube, Facebook and Twitter.  These now open the videos in the same drop-down overlay as the images.  The YouTube and Facebook players are centre aligned but unfortunately Twitter’s player displays to the left and can’t be altered.  Also, the YouTube and Facebook players expect the width and height of the player to be specified.  I’ve taken these from the available videos, but ideally the desired height and width should be stored as separate columns in the spreadsheet so these can be applied to each video as required.  Currently all YouTube and all Facebook videos have the same width and height, which can mean landscape oriented Facebook videos appear rather small, for example. Also, some videos can’t be embedded due to their settings (e.g. the Singapore Facebook video).  However, I’ve added a ‘watch video’ button underneath the player so people can always click through to the original posting.

I also responded to a query from Rhona Alcorn about how DSL data exported from their new editing system will be incorporated into the live DSL site, responded to a query from Thomas Clancy about making updates to the Place-names of Kirkcudbrightshire website and responded to a query from Kirsteen McCue about an AHRC proposal she’s putting together.

I returned to looking at the ‘guess the category quiz’ that I’d created for the Historical Thesaurus before the Easter holidays and updated the way it worked.  I reworked the way the database is queried so as to make things more efficient, to ensure the same category isn’t picked as more than one of the four options and to ensure that the selected word isn’t also found in one of the three ‘wrong’ category choices.  I also decided to update the category table to include two new columns, one that holds a count of the number of lexemes that have a ‘wordoed’ and the other than holds a count of the number of lexemes that have a ‘wordoe’ in each category.  I then ran a script that generated these figures for all 250,000 or so categories.  This is really just caching information that can be gleaned from a query anyway, but it makes querying a lot faster and makes it easier to pinpoint categories of a particular size and I think these columns will be useful for tasks beyond the quiz (e.g. show me the 10 largest Aj categories).  I then created a new script that queries the database using these columns and returns data for the quiz.

This script is much more streamlined and considerably less prone to getting stuck in loops of finding nothing but unsuitable categories.  Currently the script is set to only bring back categories that have at least two OED words in them, but this could easily be changed to target larger categories only (which would presumably make the quiz more of a challenge).  I could also add in a check to exclude any words that are also found in the category name to increase the challenge further.  The actual quiz page itself was pretty much unaltered during these updates, but I did add in a ‘loading’ spinner, which helps the transition between questions.

I’ve also created an Old English version of the quiz which works in the same way except the date of the word isn’t displayed and the ‘wordoe’ column is used.  Getting 5/5 on this one is definitely more of a challenge!  Here’s an example question:

I spent the rest of the week upgrading all of the WordPress sites I manage to the latest WordPress release.  This took quite a bit of time as I had to track down the credentials for each site, many of which I didn’t already have a note of at home.  There were also some issues with some of the sites that I needed to get Arts IT Support to sort out (e.g. broken SSL certificates, sites with the login page blocked even when using the VPN).  By the end of the week all of the sites were sorted.

Week Beginning 30th March 2020

This was the second week of the Coronavirus lockdown and I followed a similar arrangement to last week, managing to get a pretty decent amount of work done in between home-schooling sessions for my son.  I spent most of my time working for the Books and Borrowing project.  I had a useful conference call with the PI Katie Halsey and Co-I Matt Sangster last week, and the main outcome of that meeting for me was that I’d further expand upon the data design document I’d previously started in order to bring it into line with our understanding of the project’s requirements.  This involved some major reworking of the entity-relationship diagram I had previously designed based on my work with the sample datasets, with the database structure increasing from 11 related tables to 21, incorporating a new system to trace books and their authors across different libraries, to include borrower cross-references and to greatly increase the data recorded about libraries.  I engaged in many email conversations with Katie and Matt over the course of the week as I worked on the document, and on Friday I sent them a finalised version consisting of 34 pages and more than 7,000 words.  This is still in ‘in progress’ version and will no doubt need further tweaks based on feedback and also as I build the system, but I’d say it’s a pretty solid starting point.  My next step will be to add a new section to the document that describes the various features of the content management system that will connect to the database and enable to project’s RAs to add and edit data in a streamlined and efficient way.

Also this week I did some further work for the DSL people, who have noticed some inconsistencies with the way their data is stored in their own records compared to how it appears in the new editing system that they are using.  I wasn’t directly involved in the process of getting their data into the new editing system but spent some time going through old emails, looking at the data and trying to figure out what might have happened.  I also had a conference call with Marc Alexander and the Anglo-Norman Dictionary people to discuss the redevelopment of their website.  It looks like this will be going ahead and I will be doing the redevelopment work.  I’ll try to start on this after Easter, with my first task being the creation of a design document that will map out exactly what features the new site will include and how these relate to the existing site.  I also need to help the AND people to try and export the most recent version of their data from the server as the version they have access to is more than a year old.  We’re going to aim to relaunch the site in November, all being well.

I also had a chat with Fraser Dallachy about the new quiz I’m developing for the Historical Thesaurus.  Fraser had a couple of good ideas about the quiz (e.g. making versions for Old and Middle English) that I’ll need to see about implementing in the coming weeks.  I also had an email conversation with the other developers in the College of Arts about documenting the technologies that we use or have used in the past for projects and made a couple of further tweaks to the Burns Supper map based on feedback from Paul Malgrati.

I’m going to be on holiday next week and won’t be back to work until Wednesday the 15th of April so there won’t be any further updates from me for a while.