Week Beginning 20th April 2020

This was the fifth week of lockdown and my first full week back after the Easter holidays, which as with previous weeks I needed to split between working and home-schooling my son.  There was some issue with the database on the server that powers many of the project websites this week, meaning all of those websites stopped working.  I had to spend some time liaising with Arts IT Support to get the issue sorted (as I don’t have the necessary server-level access to fix such matters) and replying to the many emails from the PIs of projects who understandably wanted to know why their website was offline.  The server was unstable for about 24 hours, but has thankfully been working without issue since then.

Alison Wiggins got in touch with me this week to discuss the content management system for the Mary, Queen of Scots Letters sub-project, which I set up for her last year as part of her Archives and Writing Lives project.  There is now lots of data in the system and Alison is editing it and wanted me to make some changes to the interface to make the process a little swifter.  I changed how sorting works on the ‘browse documents’ and ‘browse parties’ pages.  The pages are paginated and previously the sorted only affected the subset of records found on the current page rather than sorting the whole dataset.  I updated this so that sorting now reorganises everything, and I also updated the ‘date sorting’ column so that it now uses the ‘sort_date’ field rather than the ‘display_date’ field.  Alison had also noticed that the ‘edit party’ page wasn’t working and I discovered that there was a bug on this page that was preventing updates being saved in the database, which I fixed.  I also created a new ‘Browse Collections’ page and added it to the top menu.  This is a pretty simple page that lists the distinct collections alphabetically and for each lists their associated archives and documents, each with a link through to the relevant ‘edit’ page.  Finally, I gave Alison some advice on editing the free-text fields, which use the TinyMCE editor, to strip out unwanted HTML that has been pasted into them and thought about how we might present this data to the public.

I also responded to a query from Matthew Creasy about the website for a new project he is working on.  I set up a placeholder website for this project a couple of months ago and Matthew is now getting close to the point where he wants the website to go live.  Gerry McKeever also go in touch with me to ask whether I would write some sections about the technology behind the interactive map I created for his Regional Romanticism project for a paper he is putting together.  We talked a little about the structure of this and I’ll write the required sections when he needs them.

Other than these issues I spent the bulk of the week working on the Books and Borrowing project.  Katie and Matt got back with some feedback on the data description document that I completed and sent to them before Easter and I spent some time going through this feedback and making an updated version of the document.  After sending the document off I started working on a description of the content management system.  This required a lot of thought and planning as I needed to consider how all of the data as defined in the document would be added, edited and deleted in the most efficient and easy to use manner.  By the end of the week I’d written some 2,500 words about the various features of the CMS, but there is still a lot to do.  I’m hoping to have a version completed and sent off to Katie and Matt early next week.

My other big task of the week was to work with the data for the Anglo-Norman Dictionary again.  As mentioned previously, this project’s data and systems are in a real mess and I’m trying to get it all sorted along with the project’s Editor, Heather Pagan.  Previously we’d figured out that there was a version of the AND data in a file called ‘all.xml’, but that it did not contain the updates to the online data from the project’s data management system and we instead needed to somehow extract the data relating to entries from the online database.

Back in February when looking through the documentation again I discovered where the entry data was located in a file called ‘entry_hash’ within the directory ‘/var/data’.  I stated that the data was stored in a Berkeley DB and gave Heather a link to a place where the database files could be downloaded (https://www.oracle.com/database/technologies/related/berkeleydb-downloads.html)

I spent some time this week trying to access the data.  It turns out that the download is not for a nice, standalone database program but is instead a collection of files that only seem to do anything when they are called from your own program written in something like C or Java.  There was a command called ‘db_dump’ that would supposedly take the binary hash file and export it as plain text.  This did work, in that the file could then be read in a text editor, but it was unfortunately still a hash file – just a series of massively long lines of numbers.

What I needed was just some way to view the contents of the file, and thankfully I came across this answer: https://stackoverflow.com/a/19793412 which suggests using the Python programming language to export the contents from a hash file.  However, Python dropped support for the ‘dbhash’ function years ago so I had to track down and install a version of Python from 2010 for this to work.  Also, I’ve not really used Python before so that took some getting used to.  Thankfully with a bit of tweaking I was able to write a short Python script that appears to read through the ‘entry_hash’ file and output each line as plain text.  I’m including the script here for future reference:

—-

import dbhash

f = open(“testoutput.txt”,”w+”)

for k, v in dbhash.open(“entry_hash”).iteritems():

f.write(v+”\n”)

f.close()

—-

The resulting file includes XML entries and is 3,969,350 lines long.  I sent this to Heather for her to have a look at, but Heather reckoned that some updates to the data were not present in this output.  I wrote a little script that counts the number of <entry> elements in each of the XML files (all.xml and entry_hash.xml) and the former has 50426 entries while the latter has 53945.  So the data extracted from the ‘entry_hash’ file is definitely not the same.  The former file is about 111Mb in size while the latter is 133Mb so there’s definitely a lot more content, which I think is encouraging.  Further investigation showed that the ‘all.xml’ file was actually generated in 2015 while the data I’ve exported is the ‘live’ data as it is now, which is good news.  However, it would appear that data in the data management system that has not yet been published is stored somewhere else.  As this represents two years of work it is data that we really need to track down.

I went back through the documentation of the old system, which really is pretty horrible to ready and unnecessarily complicated.  Plus there are multiple versions of the documentation without any version control stating which is the most update to date version.  I have a version in Word and a PDF containing images of scanned pages of a printout.  Both differ massively without it being clear which superseded which.  It turns out the scanned version is likely to be the most up to date, but of course being badly scanned images that are all wonky it’s not possible to search the text and attempting OCR didn’t work.  After a lengthy email conversation with Heather we realised we would need to get back into the server to try and figure out where the data for the DMS was located.  Heather needed to be in her office at work to do this and on Friday she managed to get access and via a Zoom call I was able to see the server and discuss the potential data locations with her.  It looks like all of the data that has been worked on in the DMS but has yet to be integrated with the public site is located in a directory called ‘commit-out’.  This contains more than 13,000 XML files, which I now have access to.  If we can combine this with the data from ‘entry_hash’ and the data Heather and her colleague Geert have been working on for the letters R and S then we should have the complete dataset.  Of course it’s not quite so simple as whilst looking through the server we realised that there are many different locations where a file called ‘entry_hash’ is found and no clear way of knowing which is the current version of the public data and which is just some old version that is not in use.  What a mess.  Anyway, progress has been made and our next step is to check that the ‘commit-out’ files do actually represent all of the changes made in the DMS and that the version of ‘entry_hash’ that I have so far extracted is the most up to date version.