Week Beginning 25th May 2020

We’ve now reached week 10 of Lockdown, and I spent it in much the same way as previous weeks, dividing my time between work and homeschooling my son.  This week I continued to focus on the development of the content management system for the Books and Borrowing project.  On Tuesday I had a Zoom meeting to demonstrate the system as it currently stands to the project PI Katie Halsey and Co-I Matt Sangster.  Monday was a bank holiday but I decided to work it and take the day off at a later date in order to prepare a walkthrough and undertake a detailed testing of the system, which uncovered a number of bugs that I then tracked down and fixed.  My walkthrough went through all of the features that are so far in place:  creating, editing and deleting libraries, viewing libraries, adding ledgers and additional fields to libraries, viewing, editing and deleting these ledgers and additional fields, adding pages to ledgers, editing and deleting them, viewing a page, the automated approach to constructing navigation between pages, viewing records on pages and then the big thing: adding and editing borrowing records.  This latter process can involve adding data about the borrowing (e.g. lending date), one or more borrowers (which may be new borrowers or ones already in the system), a new or existing book holding, which may consist of one or more book items (e.g. volumes 1 and 3 of a book) and may be connected to one or more new or existing project-wide book edition records which may have a new or existing top-level book work record.

The walkthrough via Zoom went well, with me sharing my screen with Katie and Matt so they could follow my actions as I used the CMS.  I was a bit worried they would think the add / edit borrowing record form would be too complicated but although it does look rather intimidating, most of the information is optional and many parts of it will be automatically populated by linking to existing records via autocomplete drop-downs, so once there is a critical mass of existing data in the system (e.g. existing book and borrower records) the process of adding new borrowing records will be much quicker and easier.

The only major change that I needed to make following the walkthrough was to add a new ‘publication end date’ field to book edition and book work records as some books are published in parts over multiple years (especially books comprised of multiple volumes).  I implemented this after the meeting and then spent most of the remainder of the week continuing to implement further aspects of the CMS.  I made a start on the facility to view a list of all book holding records that have been created for a library, through which the project team will be able to bring up a list of all borrowing records that involve the book.  I got as far as getting a table listing the book holdings in place, but as the project team will be started next week I figured it would make more sense to try and tackle the last major part of the system that still needed to be implemented:  creating and associating author records with the four levels of book record.

A book may have any number of authors and their associations with a book record cascades down through the levels.  For example, if an author is associated with a book via its top-level ‘book work’ record then the author will automatically be associated with a related ‘book edition’ record, any ‘book holding’ records this edition is connected to and any ‘book item’ records belonging to the book holding.  But we need to be able to associate an author not just with ‘book works’ but with any level of book record, as a book may have a different author at one of these levels (e.g. a particular volume may be attributed to a different author) or the same author may be referred to by a different alias in a particular edition.  Therefore I had to update the already complicated add / edit borrowing record form to enable authors to be created, associated and disassociated with any book level.  Plus I needed to add in an autocomplete facility to enable authors already in the system to be attached to records and to ensure that the author sections clear and reset themselves if the user removes the book from the borrowing record.  It took a long time to implement this system, but by the end of the week I’d got an initial version working.  It will need a lot of testing and no doubt some fixing next week, but it’s a relief to get this major part of the system in place.  I also added in a little feature that keeps the user’s CMS session going for as long as the browser is on a page of the CMS, which is very important as the complicated forms may take a long time to complete and it would be horrible if the sessions timed out before the user was able to submit the form.

I didn’t have time to do much else this week.  I was supposed to have a Zoom call about the Historical Thesaurus on Friday but this has been postponed as we’re all pretty busy with other things.  One of the server that hosts a lot of project websites has been experiencing difficulties this week so I had to deal with emails from staff about this and to contact Arts IT Support to ask them to fix things as it’s not something I have access to myself.  The server appears to be down again as I’m writing this, unfortunately.

The interactive map I’d created for Gerry McKeever’s Regional Romanticism project was launched this week, and can now be accessed here: https://regionalromanticism.glasgow.ac.uk/paul-jones/ but be aware that this is one of the sites currently affected by the server issue so the map, or parts of the site in general may be unavailable.

Next week the project team for the Books and Borrowing project start work and I will be giving them a demonstration of the CMS on Tuesday, so no doubt I will be spending a lot of time continuing to work on this then.

Week Beginning 18th May 2020

I spent week 9 of Lockdown continuing to implement the content management system for the Books and Borrowing project.  I was originally hoping to have completed an initial version of the system by the end of this week, but this was unfortunately not possible due to having to juggle work and home-schooling, commitments to other projects and the complexity of the project’s data.  It took several days to complete the scripts for uploading a new borrowing record due to the interrelated nature of the data structure.  A borrowing record can be associated with one or more borrowers, and each of these may be new borrower records or existing ones, meaning data needs to be pulled in via an autocomplete to prepopulate the section of the form.  Books can also be new or existing records but can also have one or more new or existing book item records (as a book may have multiple volumes) and may be linked to one or more project-wide book edition records which may already exist or may need to be created as part of the upload process, and each of these may be associated with a new or existing top-level book work record.  Therefore the script for uploading a new borrowing record needs to incorporate the ‘add’ and ‘edit’ functionality for a lot of associated data as well.  However, as I have implemented all of these aspects of the system now it will make it quicker and easier to develop the dedicated pages for adding and editing borrowers and the various book levels once I move onto this.  I still haven’t working on the facilities to add in book authors, genres or borrower occupations, which I intend to move onto once the main parts of the system are in place.

After completing the scripts for processing the display of the ‘add borrowing’ form and the storing of all of the uploaded data I moved onto the script for viewing all of the borrowing records on a page.  Due to the huge number of potential fields I’ve had to experiment with various layouts, but I think I’ve got one that works pretty well, which displays all of the data about each record in a table split into four main columns (Borrowing, Borrower, Book Holding / Items, Book Edition / Works).  I’ve also added in a facility to delete a record from the page.  I then moved on to the facility to edit a borrowing record, which I’ve added to the ‘view’ page rather than linking out to a separate page.  When the ‘edit’ button is pressed on for a record its row in the table is replace with the ‘edit’ form, which is identical in style and functionality to the ‘add’ form, but is prepopulated with all of the record’s data.  As with the ‘add’ form, it’s possible to associated multiple borrowers and book items and editions, and also to manage the existing associations using this script.  The processing of the form uses the same logic as the ‘add’ script so thankfully didn’t require much time to implement.

What I still need to do is add authors and borrower occupations to the ‘view page’, ‘add record’ and ‘edit record’ facilities, add the options to view / edit / add / delete a library’s book holdings and borrowers independently of the borrowing records, plus facilities to manage book editions / works, authors, genres and occupations at the top level as opposed to when working on a record.  I also still need to add in the facilities to view / zoom / pan a page image and add in facilities to manage borrower cross-references.  This is clearly quite a lot, but the core facilities of adding, editing and deleting borrowing, borrower and book records is now in place, which I’m happy about.  Next week I’ll continue to work on the system ahead of the project’s official start date at the beginning on June.

Also this week I made a few tweaks to the interface for the Place-names of Mull and Ulva project, spoke to Matthew Creasy some more about the website for his new project, spoke to Jennifer Smith about the follow-on funding proposal for the SCOSYA project and investigated an issue that was affecting the server that hosts several project websites (basically it turned out that the server had run out of disk space).

I also spent some time working on scripts to process data from the OED for the Historical Thesaurus.  Fraser is working on incorporating new dates from the OED and needs to work out which dates in the HT data we want to replace and which should be retained.  The script makes groups of all of the distinct lexemes in the OED data.  If the group has two or more lexemes it then checks that at least one of them is revised.  It then makes subgroups of all of the lexemes that have the same date (so for example all the ‘Strike’ words with the same ‘sortdate’ and ‘lastdate’ are grouped together).  If one word in the whole group is ‘revised’ and at least two words have the same date then the words with the same dates are displayed in the table.  The script also checks for matches in the HT lexemes (based on catid, refentry, refid and lemmaid fields).  If there is a match this data is also displayed.  I then further refined the output based on feedback from Fraser, firstly highlighting in green those rows where at least two of the HT dates match, and secondly splitting the table into three separate tables, one with the green rows, one containing all other OED lexemes that have a matching HT lexeme and a third containing OED lexemes that (As of yet) do not have a matching HT lexeme.

Week Beginning 11th May 2020

This was week 8 of Lockdown and I spent the majority of it working on the content management system for the Books and Borrowing project.  The project is due to begin at the start of June and I’m hoping to have the CMS completed and ready to use by the project team by then, although there is an awful lot to try and get into place.  I can’t really go into too much detail about the CMS, but I have completed the pages to add a library and to browse a list of libraries with the option of deleting a library if it doesn’t have any ledgers.  I’ve also done quite a lot with the ‘View library’ page.  It’s possible to edit a library record, add a ledger and add / edit / delete additional fields for a library.  You can also list all of the ledgers in a library with options to edit the ledger, delete it (if it contains no pages) and add a new page to it.  You can also display a list of pages in a ledger, with options to edit the page or delete it (if it contains no records).  You can also open a page in the ledger and browse through the next and previous pages.

I’ve been trying a new approach with the CMS for this project, involving more in-page editing.  For example, the list of ledgers is tabular based with fields for things like the number of pages, the ledger name and its start and end dates.  When the ‘edit’ button is pressed on rather than taking the user away from this page to a separate page, the row in the table becomes editable.  This approach is rather more complicated to develop and relies a lot more on JavaScript, but it seems to be working pretty well.  It was further complicated by having textareas that use the TinyMCE text editing tool, which then needs to be reinitiated when the editable boxes load in.  Also, you can’t have multiple forms within a table in HTML, meaning there can be only one form wrapped around the whole table.  Initially I was thinking that when the row became editable the JavaScript would add in form tags in the row too, but this approach doesn’t work properly so instead I’ve just had to implement a single form with its type controlled by hidden inputs that change when a row is selected.  The situation is complicated as it’s not just the ledger record that needs to be edited from within the table, but there are also facilities to add and edit ledger pages, which also need to use the same form.

At the moment I’m in the middle of creating the facility to add a new borrowing record to the page.  This is the most complex part of the system as a record may have multiple borrowers, each of which may have multiple occupations, and multiple books, each of which may be associated with higher level book records.  Plus the additional fields for the library need to be taken into consideration too.  By the end of the week I was at the point of adding in an auto-complete to select an existing borrower record and I’ll continue with this on Monday.

In addition to the B&B project I did some work for other projects as well.  For Thomas Clancy’s Place-names of Kirkcudbrightshire project (now renamed Place-names of the Galloway Glens) I had a few tweaks and updates to put in place before Thomas launched the site on Tuesday.  I added a ‘Search place-names’ box to the right-hand column of every non-place-names page which takes you to the quick search results page and I added a ‘Place-names’ menu item to the site menu, so users can access the place-names part of the site. Every place-names page now features a sub-menu with access to the place-names pages (Browse, element glossary, advanced search, API, quick search).  To return to the place-name introductory page you can press on the ‘Place-names’ link in the main menu bar.  I had unfortunately introduced a bug to the ‘edit place-name’ page in the CMS when I changed the ordering of parishes to make KCB parishes appear first.  This was preventing any place-names in BMC from having their cross references, feature type and parishes saved when the form was submitted.  This has now been fixed.  I also added Google Analytics to the site.  The virtual launch on Tuesday went well and the site can now be accessed here: https://kcb-placenames.glasgow.ac.uk/.

I also added in links to the DSL’s email and Instagram accounts to the footer of the DSL site and added some new fields to the database and CMS of the Place-names of Mull and Ulva site.  I also created a new version of the Burns Supper map for Paul Malgrati that included more data and a new field for video dimensions that the video overlay now uses.  I also replied to Matthew Creasy about a query regarding the website for his new Scottish Cosmopolitanism project and a query from Jane Roberts about the Thesaurus of Old English and made a small tweak to the data of Gerry McKeever’s interactive map for Regional Romanticism.

Week Beginning 4th May 2020

Week seven of lockdown continued in much the same fashion as the preceding weeks, the only difference being Friday was a holiday to mark the 75th anniversary of VE day.  I spent much of the four working days on the development of the content management system for the Books and Borrowing project.  The project RAs will start using the system in June and I’m aiming to get everything up and running before then so this is my main focus at the moment.  I also had a Zoom meeting with project PI Katie Halsey and Co-I Matt Sangster on Tuesday to discuss the requirements document I’d completed last week and the underlying data structures I’d defined in the weeks before.  Both Katie and Matt were very happy with the document, although Matt had a few changes he wanted made to the underlying data structures and the CMS.  I made the necessary changes to the data design / requirements document and the project’s database that I’d set up last week.  The changes were:

Borrowing spans have now been removed from libraries and these will instead be automatically inferred based on the start and end dates of ledger records held in these libraries.  Ledgers now have a new ‘ledger type’ field which currently allows the choice of ‘Professorial’, ‘Student’ or ‘Town’.  This field will allow borrowing spans for libraries to be altered based on a selected ledger type.  The way occupations for borrowers is recorded has been updated to enable both original occupations from the records and a normalised list of occupations to be recorded.  Borrowers may not have an original occupation but still might have a standardised occupation so I’ve decided to use the occupations table as previously designed to hold information about standardised occupations.  A borrower may have multiple standardised occupations.  I have also added a new ‘original occupation’ field to the borrower record where any number of occupations found for the borrower in the original documentation (e.g. river watcher) can be added if necessary.  The book edition table now has an ‘other authority URL’ field and an ‘other authority type’ field which can be used if ESTC is not appropriate.  The ‘type’ currently features ‘Worldcat’, ‘CERL’ and ‘Other’ and ‘Language’ has been moved from Holding to Edition.  Finally, in Book Holding the short title is now original title and long title is now standardised title while the place and date of publication fields have been removed as the comparable fields at Edition level will be sufficient.

In terms of the development of the CMS, I created a Bootstrap-based interface for the system, which currently just uses the colour scheme I used for Matt’s pilot 18th Century Borrowing project.  I created the user authentication scripts and the menu structure and then started to create the actual pages.  So far I’ve created a page to add a new library record and all of the information associated with a library, such as any number of sources.  I then created the facility to browse and delete libraries and the main ‘view library’ page, which will act as a hub through which all book and borrowing records associated with the library will be managed.  This page has a further tab-based menu with options to allow the RA to view / add ledgers, additional fields, books and borrowers, plus the option to edit the main library information.  So far I’ve completed the page to edit the library information and have started work on the page to add a ledger.  I’m making pretty good progress with the CMS, but there is still a lot left to do.  Here’s a screenshot of the CMS if you’re interested in how it looks:

Also this week I had a Zoom meeting the Marc Alexander and Fraser Dallachy to discuss update to the Historical Thesaurus as we head towards a second edition.  This will include adding in new words from the OED and new dates for existing words.  My new date structure will also go live, so there will need to be changes to how the timelines work.  Marc is hoping to go live with new updates in August.  We also discussed the ‘guess the category’ quiz, with Marc and Fraser having some ideas about limiting the quiz to certain categories, or excluding other categories that might feature inappropriate content.  We may also introduce a difficulty level based on date, with an ‘easy’ version only containing words that were in use for a decent span of time in the past 200 years.

Other work I did this week included making some tweaks to the data for Gerry McKeever’s interactive map, fixing an issue with videos continuing to play after the video overlay was closed for Paul Malgrati’s Burns Supper map, replying to a query from Alasdair Whyte about his Place-names of Mull and Ulva project and looking into an issue for Fraser’s Scots Thesaurus project which unfortunately I can’t do anything about as the scripts I’d created for this (which needed to be let running for several days) are on the computer in my office.  If this lockdown ever ends I’ll need to tackle this issue then.

Week Beginning 27th April 2020

The sixth week of lockdown continued in much the same manner as the previous ones, dividing my time between working and home-schooling my son.  I spent the majority of the week continuing to work on the requirements document for the Books and Borrowing project.  As I worked through this I returned to the database design and made some changes as my understanding of the system increased.  This included adding in a new field for the original transcription and a new ‘order on page’ field to the borrowing table as I realised that without such a column it wouldn’t be possible for an RA to add a new record anywhere other than after the last record on a page.  It’s quite likely that an RA will accidentally skip a record and might need to reinstate it, or an RA might intentionally want to leave out some records to return to later.  The ‘order on page’ column (which will be automatically generated but can be manually edited) will ensure these situations can be handled.

As I worked through the requirements I began to realise that the amount of data the RAs may have to compile for each borrowing record is possibly going to be somewhat overwhelming.  Much of it is optional, but completing all the information could take a long time:  creating a new Book Holding and Item record, linking it to an Edition and Work or creating new records for these, associating authors or creating new authors, adding in genre information, creating a new borrower record or associating an existing one, adding in occupations, adding in cross references to other borrowers, writing out a diplomatic transcription, filling in all of the core fields and additional fields.  That’s a huge amount to do for each record and we may need to consider what is going to be possible for the RAs to do in the available time.

By the end of Tuesday I had finished working on a first version of the requirements document, weighing in at more than 11,000 words, and sent it on to Katie and Matt for feedback.  We have agreed to meet (via Zoom) next Tuesday to discuss any changes to the document.  During the rest of the week I began to develop the systems for the project.  This included implementing the database (creating each of the 23 tables that will be needed to store the project’s data) and installing and configuring WordPress, which will be used to power the simple parts of the project website.

I also continued to work with the data for the Anglo-Norman Dictionary this week.  I managed to download all of the files from the server over the weekend.  I now have two versions of the ‘entry_hash’ database file plus many thousand XML files that were in the ‘commit-out’ directory.  I extracted the different version of the ‘entry_hash’ table using the method I figured out last week and discovered that it was somewhat larger than the version I was working with last week, containing 4,556,011 lines as opposed to 3,969,350 and 54,025 entries as opposed to 53,945.  I sent this extracted file to Heather for her to have a look at.

I then decided to update the test website I’d made a few months ago.  This version used the ‘all.xml’ file as a data source and allowed a user to browse the dictionary entries using a list of all the headwords and to view each entry (well, certain aspects of the entry that I’d formatted from the XML).  Thankfully I managed to locate the scripts I’d used to extract the XML and migrate it to a database on the server and I ran both versions of the ‘entry_hash’ output through this script, resulting in three different dictionary data sources.  I then updated the simple website to add in a switcher to swap between data sources.  Extracting the data also led to me realising that there was a ‘deleted’ record type in the entry table and if I disregarded records of this type the older entry_hash data had 53,925 entries and the newer one had 54,002, so a difference of 77.  The old ‘all.xml’ data from 2015 had 50,403 entries that aren’t set to ‘deleted’.  In looking through the files from the server I had also managed to track down the XSLT file used to transform the XML into HTML for the existing website, so I added this to my test website, together with the CSS file from the existing website.  This meant the full entries could now be displayed in a fully formatted manner, which is useful.

Heather had a chance to look through and compare the three test website versions and discovered that the new ‘entry_hash’ version contained all of the data that was in their data management system that had yet to be published on the public website.  This was really good news as it meant that we now have a complete dataset without needing to integrate individual XML files.  With a full dataset now secured I am now in a position to move on to the requirements for the new dictionary website.

Also this week I made some further tweaks to the ‘guess the category’ quiz for the Historical Thesaurus.  The layout now works better on a phone in portrait mode (the choices now take up the full width of the area).  I also fixed the line-height of the quiz word, which was previously overlapping if this went over more than one line.  I updated things so that when pressing ‘next’ or restarting the quiz the page automatically scrolls so that the quiz word is in view.  I fixed the layout to ensure that there should always be enough space for the ticks and crosses now (they should no longer end up dropping down below the box if the category text is long).  Also, any very long words in the category that previously ended up breaking out of their boxes are now cut off when they reach the edge of the box.  I could auto-hyphen long words to split them over multiple lines, and I might investigate this next week.  I also fixed an issue with the ‘Next’ button: when restarting a quiz after reaching the summary the ‘Next’ button was still labelled ‘Summary’.  Finally, I’ve added a timer to the quiz so you can see how long it took you and to try and beat your fastest time.  When you reach the summary page your time is now displayed in the top bar along with your score.  I’ve also added some text above the ‘Try again’ button.  If you get less than full marks is says “Can you do better next time?” and if you did get full marks it says “You got them all right, but can you beat your time of x”.

Finally this week I helped Roslyn Potter of the DSL to get a random image loading into a new page.  This page asks people to record themselves saying a selection of Scots words and there are 10 different images featuring words.  The random feature displays one image and provides a button to load another random image, plus a feature to view all of the images.  The page can be viewed here: https://dsl.ac.uk/scotsvoices/

 

Week Beginning 20th April 2020

This was the fifth week of lockdown and my first full week back after the Easter holidays, which as with previous weeks I needed to split between working and home-schooling my son.  There was some issue with the database on the server that powers many of the project websites this week, meaning all of those websites stopped working.  I had to spend some time liaising with Arts IT Support to get the issue sorted (as I don’t have the necessary server-level access to fix such matters) and replying to the many emails from the PIs of projects who understandably wanted to know why their website was offline.  The server was unstable for about 24 hours, but has thankfully been working without issue since then.

Alison Wiggins got in touch with me this week to discuss the content management system for the Mary, Queen of Scots Letters sub-project, which I set up for her last year as part of her Archives and Writing Lives project.  There is now lots of data in the system and Alison is editing it and wanted me to make some changes to the interface to make the process a little swifter.  I changed how sorting works on the ‘browse documents’ and ‘browse parties’ pages.  The pages are paginated and previously the sorted only affected the subset of records found on the current page rather than sorting the whole dataset.  I updated this so that sorting now reorganises everything, and I also updated the ‘date sorting’ column so that it now uses the ‘sort_date’ field rather than the ‘display_date’ field.  Alison had also noticed that the ‘edit party’ page wasn’t working and I discovered that there was a bug on this page that was preventing updates being saved in the database, which I fixed.  I also created a new ‘Browse Collections’ page and added it to the top menu.  This is a pretty simple page that lists the distinct collections alphabetically and for each lists their associated archives and documents, each with a link through to the relevant ‘edit’ page.  Finally, I gave Alison some advice on editing the free-text fields, which use the TinyMCE editor, to strip out unwanted HTML that has been pasted into them and thought about how we might present this data to the public.

I also responded to a query from Matthew Creasy about the website for a new project he is working on.  I set up a placeholder website for this project a couple of months ago and Matthew is now getting close to the point where he wants the website to go live.  Gerry McKeever also go in touch with me to ask whether I would write some sections about the technology behind the interactive map I created for his Regional Romanticism project for a paper he is putting together.  We talked a little about the structure of this and I’ll write the required sections when he needs them.

Other than these issues I spent the bulk of the week working on the Books and Borrowing project.  Katie and Matt got back with some feedback on the data description document that I completed and sent to them before Easter and I spent some time going through this feedback and making an updated version of the document.  After sending the document off I started working on a description of the content management system.  This required a lot of thought and planning as I needed to consider how all of the data as defined in the document would be added, edited and deleted in the most efficient and easy to use manner.  By the end of the week I’d written some 2,500 words about the various features of the CMS, but there is still a lot to do.  I’m hoping to have a version completed and sent off to Katie and Matt early next week.

My other big task of the week was to work with the data for the Anglo-Norman Dictionary again.  As mentioned previously, this project’s data and systems are in a real mess and I’m trying to get it all sorted along with the project’s Editor, Heather Pagan.  Previously we’d figured out that there was a version of the AND data in a file called ‘all.xml’, but that it did not contain the updates to the online data from the project’s data management system and we instead needed to somehow extract the data relating to entries from the online database.

Back in February when looking through the documentation again I discovered where the entry data was located in a file called ‘entry_hash’ within the directory ‘/var/data’.  I stated that the data was stored in a Berkeley DB and gave Heather a link to a place where the database files could be downloaded (https://www.oracle.com/database/technologies/related/berkeleydb-downloads.html)

I spent some time this week trying to access the data.  It turns out that the download is not for a nice, standalone database program but is instead a collection of files that only seem to do anything when they are called from your own program written in something like C or Java.  There was a command called ‘db_dump’ that would supposedly take the binary hash file and export it as plain text.  This did work, in that the file could then be read in a text editor, but it was unfortunately still a hash file – just a series of massively long lines of numbers.

What I needed was just some way to view the contents of the file, and thankfully I came across this answer: https://stackoverflow.com/a/19793412 which suggests using the Python programming language to export the contents from a hash file.  However, Python dropped support for the ‘dbhash’ function years ago so I had to track down and install a version of Python from 2010 for this to work.  Also, I’ve not really used Python before so that took some getting used to.  Thankfully with a bit of tweaking I was able to write a short Python script that appears to read through the ‘entry_hash’ file and output each line as plain text.  I’m including the script here for future reference:

—-

import dbhash

f = open(“testoutput.txt”,”w+”)

for k, v in dbhash.open(“entry_hash”).iteritems():

f.write(v+”\n”)

f.close()

—-

The resulting file includes XML entries and is 3,969,350 lines long.  I sent this to Heather for her to have a look at, but Heather reckoned that some updates to the data were not present in this output.  I wrote a little script that counts the number of <entry> elements in each of the XML files (all.xml and entry_hash.xml) and the former has 50426 entries while the latter has 53945.  So the data extracted from the ‘entry_hash’ file is definitely not the same.  The former file is about 111Mb in size while the latter is 133Mb so there’s definitely a lot more content, which I think is encouraging.  Further investigation showed that the ‘all.xml’ file was actually generated in 2015 while the data I’ve exported is the ‘live’ data as it is now, which is good news.  However, it would appear that data in the data management system that has not yet been published is stored somewhere else.  As this represents two years of work it is data that we really need to track down.

I went back through the documentation of the old system, which really is pretty horrible to ready and unnecessarily complicated.  Plus there are multiple versions of the documentation without any version control stating which is the most update to date version.  I have a version in Word and a PDF containing images of scanned pages of a printout.  Both differ massively without it being clear which superseded which.  It turns out the scanned version is likely to be the most up to date, but of course being badly scanned images that are all wonky it’s not possible to search the text and attempting OCR didn’t work.  After a lengthy email conversation with Heather we realised we would need to get back into the server to try and figure out where the data for the DMS was located.  Heather needed to be in her office at work to do this and on Friday she managed to get access and via a Zoom call I was able to see the server and discuss the potential data locations with her.  It looks like all of the data that has been worked on in the DMS but has yet to be integrated with the public site is located in a directory called ‘commit-out’.  This contains more than 13,000 XML files, which I now have access to.  If we can combine this with the data from ‘entry_hash’ and the data Heather and her colleague Geert have been working on for the letters R and S then we should have the complete dataset.  Of course it’s not quite so simple as whilst looking through the server we realised that there are many different locations where a file called ‘entry_hash’ is found and no clear way of knowing which is the current version of the public data and which is just some old version that is not in use.  What a mess.  Anyway, progress has been made and our next step is to check that the ‘commit-out’ files do actually represent all of the changes made in the DMS and that the version of ‘entry_hash’ that I have so far extracted is the most up to date version.

Week Beginning 13th April 2020

I was on holiday for all of last week and Monday and Tuesday this week.  My son and I were supposed to be visiting my parents for Easter, but we were unable to do so due to the lockdown and instead had to find things to amuse ourselves with around the house.  I answered a few work emails during this time, including alerting Arts IT Support to some issues with the WordPress server and responding to a query from Ann Fergusson at the DSL.  I returned to work (from home, of course) on Wednesday and spent the three days working on various projects.

For the Books and Borrowers project I spent some time downloading and looking through the digitised and transcribed borrowing registers of St. Andrews.  They have made three registers from the second half of the 18th century available via a Wiki interface (see https://arts.st-andrews.ac.uk/transcribe/index.php?title=Main_Page) and we were given access to all of these materials that had been extracted and processed by Patrick McCann, who I used to work very closely with back when we were both based at HATII and worked for the Digital Curation Centre.  Having looked through the materials it’s clear that we will be able to use the transcriptions, which will be a big help.  The dates will probably need to be manually normalised, though, and we will need access to higher resolution images than the ones we have been given in order to make a zoom and pan interface using them.

I also updated the introductory text for Gerry McKeever’s interactive map of the novel Paul Jones, and I think this feature is now ready to go live, once Gerry want to launch it.  I also fixed an issue with the Editing Robert Burns website that was preventing the site editors (namely Craig Lamont) from editing blog posts.  I also created a further new version of the Burns Supper map for Paul Malgrati.  This version incorporates updated data, which has greatly increased the number of Suppers that appear on the map and I also changed the way videos work.  Previously if an entry had a link to a video then a button was added to the entry that linked through to the externally hosted video site (which could be YouTube, Facebook, Twitter or some other site).  Instead, the code now identifies the origin of the video and I’ve managed to embed players from YouTube, Facebook and Twitter.  These now open the videos in the same drop-down overlay as the images.  The YouTube and Facebook players are centre aligned but unfortunately Twitter’s player displays to the left and can’t be altered.  Also, the YouTube and Facebook players expect the width and height of the player to be specified.  I’ve taken these from the available videos, but ideally the desired height and width should be stored as separate columns in the spreadsheet so these can be applied to each video as required.  Currently all YouTube and all Facebook videos have the same width and height, which can mean landscape oriented Facebook videos appear rather small, for example. Also, some videos can’t be embedded due to their settings (e.g. the Singapore Facebook video).  However, I’ve added a ‘watch video’ button underneath the player so people can always click through to the original posting.

I also responded to a query from Rhona Alcorn about how DSL data exported from their new editing system will be incorporated into the live DSL site, responded to a query from Thomas Clancy about making updates to the Place-names of Kirkcudbrightshire website and responded to a query from Kirsteen McCue about an AHRC proposal she’s putting together.

I returned to looking at the ‘guess the category quiz’ that I’d created for the Historical Thesaurus before the Easter holidays and updated the way it worked.  I reworked the way the database is queried so as to make things more efficient, to ensure the same category isn’t picked as more than one of the four options and to ensure that the selected word isn’t also found in one of the three ‘wrong’ category choices.  I also decided to update the category table to include two new columns, one that holds a count of the number of lexemes that have a ‘wordoed’ and the other than holds a count of the number of lexemes that have a ‘wordoe’ in each category.  I then ran a script that generated these figures for all 250,000 or so categories.  This is really just caching information that can be gleaned from a query anyway, but it makes querying a lot faster and makes it easier to pinpoint categories of a particular size and I think these columns will be useful for tasks beyond the quiz (e.g. show me the 10 largest Aj categories).  I then created a new script that queries the database using these columns and returns data for the quiz.

This script is much more streamlined and considerably less prone to getting stuck in loops of finding nothing but unsuitable categories.  Currently the script is set to only bring back categories that have at least two OED words in them, but this could easily be changed to target larger categories only (which would presumably make the quiz more of a challenge).  I could also add in a check to exclude any words that are also found in the category name to increase the challenge further.  The actual quiz page itself was pretty much unaltered during these updates, but I did add in a ‘loading’ spinner, which helps the transition between questions.

I’ve also created an Old English version of the quiz which works in the same way except the date of the word isn’t displayed and the ‘wordoe’ column is used.  Getting 5/5 on this one is definitely more of a challenge!  Here’s an example question:

I spent the rest of the week upgrading all of the WordPress sites I manage to the latest WordPress release.  This took quite a bit of time as I had to track down the credentials for each site, many of which I didn’t already have a note of at home.  There were also some issues with some of the sites that I needed to get Arts IT Support to sort out (e.g. broken SSL certificates, sites with the login page blocked even when using the VPN).  By the end of the week all of the sites were sorted.

Week Beginning 30th March 2020

This was the second week of the Coronavirus lockdown and I followed a similar arrangement to last week, managing to get a pretty decent amount of work done in between home-schooling sessions for my son.  I spent most of my time working for the Books and Borrowing project.  I had a useful conference call with the PI Katie Halsey and Co-I Matt Sangster last week, and the main outcome of that meeting for me was that I’d further expand upon the data design document I’d previously started in order to bring it into line with our understanding of the project’s requirements.  This involved some major reworking of the entity-relationship diagram I had previously designed based on my work with the sample datasets, with the database structure increasing from 11 related tables to 21, incorporating a new system to trace books and their authors across different libraries, to include borrower cross-references and to greatly increase the data recorded about libraries.  I engaged in many email conversations with Katie and Matt over the course of the week as I worked on the document, and on Friday I sent them a finalised version consisting of 34 pages and more than 7,000 words.  This is still in ‘in progress’ version and will no doubt need further tweaks based on feedback and also as I build the system, but I’d say it’s a pretty solid starting point.  My next step will be to add a new section to the document that describes the various features of the content management system that will connect to the database and enable to project’s RAs to add and edit data in a streamlined and efficient way.

Also this week I did some further work for the DSL people, who have noticed some inconsistencies with the way their data is stored in their own records compared to how it appears in the new editing system that they are using.  I wasn’t directly involved in the process of getting their data into the new editing system but spent some time going through old emails, looking at the data and trying to figure out what might have happened.  I also had a conference call with Marc Alexander and the Anglo-Norman Dictionary people to discuss the redevelopment of their website.  It looks like this will be going ahead and I will be doing the redevelopment work.  I’ll try to start on this after Easter, with my first task being the creation of a design document that will map out exactly what features the new site will include and how these relate to the existing site.  I also need to help the AND people to try and export the most recent version of their data from the server as the version they have access to is more than a year old.  We’re going to aim to relaunch the site in November, all being well.

I also had a chat with Fraser Dallachy about the new quiz I’m developing for the Historical Thesaurus.  Fraser had a couple of good ideas about the quiz (e.g. making versions for Old and Middle English) that I’ll need to see about implementing in the coming weeks.  I also had an email conversation with the other developers in the College of Arts about documenting the technologies that we use or have used in the past for projects and made a couple of further tweaks to the Burns Supper map based on feedback from Paul Malgrati.

I’m going to be on holiday next week and won’t be back to work until Wednesday the 15th of April so there won’t be any further updates from me for a while.

Week Beginning 23rd March 2020

This was the first full week of the Coronavirus lockdown and as such I was working from home and also having to look after my nine year-old son who is also at home on lockdown.  My wife and I have arranged to split the days into morning and afternoon shifts, with one of us home-schooling our son while the other works during each shift and extra work squeezed in before and after these shifts.  The arrangement has worked pretty well for all of us this week and I’ve managed to get a fair amount of work done.

This included spotting and requesting fixes for a number of other sites that had started to display scary warnings about their SSL certificates, working on an updated version of the Data Management Plan for the SCOSYA follow-on proposal, fixing some log-in and account related issues for the DSL people and helping Carolyn Jess-Cooke in English Literature with some technical issues relating to a WordPress blog she has set up for a ‘Stay at home’ literary festival (https://stayathomefest.wordpress.com/). I also had a conference call with Katie Halsey and Matt Sangster about the Books and Borrowers project, which is due to start at the beginning of June.  It was my first time using the Zoom videoconferencing software and it worked very well, other than my cat trying to participate several times.  We had a good call and made some plans for the coming weeks and months.  I’m going to try and get an initial version of the content management system and database for the project in place before the official start of the project so that the RAs will be able to use this straight away.  This is of even greater importance now as they are likely to be limited in the kinds of research activities they can do at the start of the project because of travel restrictions and will need to work with digital materials.

Other than these issues I divided my time between three projects.  The first was the Burns Supper map for Paul Malgrati in Scottish Literature.  Paul had sent me some images that are to be used in the map and I spent some time integrating these.  The image appears as a thumbnail with credit text (if available) appearing underneath.  If there is a link to the place the image was taken from the credit text appears as a link.  Clicking on the image thumbnail opens the full image in a new tab.  I also added links to the videos where applicable too, but I decided not to embed the videos in the page as I think these would be too small and there would be just too much going on for locations that have both videos and an image.  Paul also wanted clusters to be limited by areas (e.g. a cluster for Scotland rather than these just being amalgamated with a big cluster for Europe when zooming out) and I investigated this.  I discovered that it is possible to create groups of locations.  E.g. have a new column in the spreadsheet named ‘cluster’ or something like that and all the ‘Scotland’ ones could have ‘Scotland’ here, or all the South American ones could have ‘South America’ here.  These will then be the top level clusters and they will not be further amalgamated on zoom out.  Once Paul gets back to me with the clusters he would like for the data I’ll update things further.  Below is an image of the map with the photos embedded:

The second major project I worked on was the interactive map for Gerry McKeever’s Regional Romanticism project.  Gerry had got back to me with a new version of the data he’d been working on and some feedback from other people he’d sent the map to.  I created a new version of the map featuring the new data and incorporated some changes to how the map worked based on feedback, namely I moved the navigation buttons to the top of the story pane and have made them bigger, with a new white dividing line between the buttons and the rest of the pane.  This hopefully makes them more obvious to people and means the buttons are immediately visible rather than people potentially having to scroll to see them.  I’ve also replaced the directional arrows with thicker chevron icons and have changed the ‘Return to start’ button to ‘Restart’.  I’ve also made the ‘Next’ button on both the overview and the first slide blink every few seconds, at Gerry’s request.  Hopefully this won’t be too annoying for people.  Finally I made the slide number bigger too.  Here’s a screenshot of how things currently look:

My final project of the week was the Historical Thesaurus.  Marc had previously come up with a very good idea of making a nice interactive ‘guess the category’ quiz that would present a word and four possible categories.  The user would have to select the correct category.  I decided to make a start on this during the week.  The first task was to made a script that grabbed a random category from the database (ensuring it was one that contained at least one non-Old English word).  The script then checked the part of speech of this category and grabbed a further three random categories of the same part of speech.  These were all then exported as a JSON file.  I then worked on the quiz page itself, with all of the logic handled in JavaScript.  The page connects to the script to grab the JSON file then extracts the word and displays it.  It then randomised the order of the four returned categories and displayed these as buttons that a user could click on.  I worked through several iterations of the quiz, but eventually I made it so that upon clicking on a choice the script automatically gives a tick or cross and styles the background red or green.   If the guess was incorrect the user can guess again until they get the correct answer, as you can see below:

I then decided to chain several questions together to make the quiz more fun.  Once the correct answer is given a ‘Next’ button appears, leading to a new question.  I set up a ‘max questions’ variable that controls how many questions there are (e.g. 3, 5 or 10) and the questions keep coming until this number is reached.  When the number is reached the user can then view a summary that tells them which words and (correct) categories were included, provides links to the categories and gives the user an overall score.  I decided that if the user guesses correctly the first time they should get one star.  If they guess correctly a second time they get half a star and any more guesses get no stars.  The summary and star ratings for each question are also displayed as the following screenshot shows:

It’s shaping up pretty nicely, but I still need to work on the script that exports data from the database.  Identifying random categories that contain at least one non-OE word and are of the same part of speech as the first randomly chosen category currently means hundreds or even thousands of database calls before a suitable category is returned.  This is inefficient and occasionally the script was getting caught in a loop and timing out before it found a suitable category.  I managed to catch this by having some sample data that loads if a suitable category isn’t found after 1000 attempts, but it’s not ideal.  I’ll need to work on this some more over the next few weeks as time allows.

Week Beginning 16th March 2020

Last week was a full five-day strike and the end of the current period of UCU strike action.  This week I returned to work, but the Coronavirus situation, which has been gradually getting worse over the past few weeks ramped up considerably, with the University closed for teaching and many staff working from home.  I came into work from Monday to Wednesday but the West End was deserted and there didn’t seem much point in me using public transport to come into my office when there was no-one else around so from Thursday onwards I began to work from home, as I will be doing for the foreseeable future.

Despite all of these upheavals and also suffering from a pretty horrible cold I managed to get a lot done this week.  Some of Monday was spend catching up with emails that had come in whilst I had been on strike last week, including a request from Rhona Alcorn of SLD to send her the data and sound files from the Scots School Dictionary and responding to Alan Riach from Scottish Literature about some web pages he wanted updated (these were on the main University site and this is not something I am involved with updating).  I also noticed that the version of this site that was being served up was the version on the old server, meaning my most recent blog posts were not appearing.  Thankfully Raymond Brasas in Arts IT Support was able to sort this out.  Raymond had also emailed me about some WordPress sites I mange that had out of date versions of the software installed.  There were a couple of sites that I’d forgotten about, a couple that were no longer operational and a couple that had legitimate reasons for being out of date, so I got back to him about those, and also updated my spreadsheet of WordPress sites I manage to ensure the ones I’d forgotten about would not be overlooked again.  I also became aware of SSL certificate errors on a couple of websites that were causing the sites to display scary warning messages before anyone could reach the sites, so asked Raymond to fix these.  Finally, Fraser Dallachy, who is working on a pilot for a new Scots Thesaurus, contacted me to see if he could get access to the files that were used to put together the first version of the Concise Scots Dictionary.  We had previously established that any electronic files relating to the printed Scots Thesaurus have been lost and he was hoping that these old dictionary files may contain data that was used in this old thesaurus.  I managed to track the files down, but alas there appeared to be no semantic data in the entries found therein.  I also had a chat with Marc Alexander about a little quiz he would like to develop for the Historical Thesaurus.

I spoke to Jennifer Smith on Monday about the follow-on funding application for her SCOSYA project and spent a bit of time during the week writing a first draft of a Data Management Plan for the application, after reviewing all of the proposal materials she had sent me.  Writing the plan raised some questions and I will no doubt have to revise the plan before the proposal is finalised, but it was good to get a first version completed and sent off.

I also finished work on the interactive map for Gerry McKeever’s Regional Romanticism project this week.  Previously I’d started to use a new plugin to get nice curved lines between markers and all appeared to be working well.  This week I began to integrate the plugin with my map, but unfortunately I’m still encountering unusable slowdown with the new plugin.  Everything works fine to begin with, but after a bit of scrolling and zooming, especially round an area with lots of lines, the page becomes unresponsive.  I wondered whether the issue might be related to the midpoint of the curve being dynamically generated from a function I took from another plugin so instead made a version that generated and then saved these midpoints that could then be used without needing to be calculated each time.  This would also have meant that we could have manually tweaked the curves to position them as desired, which would have been great as some lines were not ideally positioned (e.g. from Scotland to the US via the North Pole), but even this seems to have made little impact on the performance issues.  I even tried turning everything else off (e.g. icons, popups, the NLS map) to see if I could identify another cause of the slowdown but nothing has worked.  I unfortunately had to admit defeat and resort to using straight lines after all.  These are somewhat less visually appealing, but they result in no performance issues.  Here’s a screenshot of this new version:

With these updates in place I made a version of the map that would run directly on the desktop and sent Gerry some instructions on how to update the data, meaning he can continue to work on it and see how it looks.  But my work on this is now complete for the time being.

I was supposed to meet with Paul Malgrati from Scottish Literature on Wednesday to discuss an interactive map of Burns Suppers he would like me to create.  We decided to cancel our meeting due to the Coronavirus, but continued to communicate via email.  Paul had sent me a spreadsheet containing data relating to the Burns Suppers and I spent some time working on some initial versions of the map, reusing some of the code from the Regional Romanticism map, which in turn used code from the SCOSYA map.

I migrated the spreadsheet to an online database and then wrote a script that exports this data in the JSON format that can be easily read into the map.  The initial version uses OpenStreetMap.HOT as a basemap rather than the .DE one that Paul had selected as the latter displays all place-names in German where these are available (e.g. Großbritannien).  The .HOT map is fairly similar, although for some reason parts of South America look like they’re underwater.  We can easily change to an alternative basemap in future if required.  In my initial version all locations are marked with red icons displaying a knife and fork.  We can use other colours or icons to differentiate types if or when these are available.  The map is full screen with an introductory panel in the top right.  Hovering over an icon displays the title of the event while clicking on it replaces the introductory panel with a panel containing the information about the supper.  The content is generated dynamically and only displays fields that contain data (e.g. very few include ‘Dress Code’).  You can always return to the intro by clicking on the ‘Introduction’ button at the top.

I spotted a few issues with the latitude and longitude of some locations that will need fixed.  E.g. St Petersburg has Russia as the country but it is positioned in St Petersburg in Florida while Bogota Burns night in Colombia is positioned in South Sudan.  I also realised that we might want to think about grouping icons as when zoomed out it’s difficult to tell where there are multiple closely positioned icons – e.g. the two in Reykjavik and the two in Glasgow.  However, grouping may be tricky if different locations are assigned different icons / types.

After further email discussions with Paul (and being sent a new version of the spreadsheet) I created an updated version of my initial map.  This version incorporates the data from the spreadsheet and incorporates the new ‘Attendance’ field into the pop-up where applicable.  It is also now possible to zoom further out, and also scroll past the international dateline and still see the data (in the previous version if you did this the data would not appear).  I also integrated the Leaflet Plugin MarkerCluster (see https://github.com/Leaflet/Leaflet.markercluster) that very nicely handles clustering of markers.  In this new version of my map markers are now grouped into clusters that split apart as you zoom in.  I also added in an option to hide and show the pop-up area as on small screens (e.g. mobile phones) the area takes up a lot of space, and if you click on a marker that is already highlighted this now deselects the marker and closes the popup.  Finally, I added a new ‘Filters’ section in the introduction that you can show or hide.  This contains options to filter the data by period.  The three periods are listed (all ‘on’ be default’) and you can deselect or select any of them.  Doing so automatically updates the map to limit the markers to those that meet the criteria.  This is ‘remembered’ as you click on other markers and you can update your criteria by returning to the introduction.  I did wonder about adding a summary of the selected filters to the popup of every marker, but I think this will just add too much clutter, especially when viewing the map on smaller screens (these days most people access websites on tablets or phones).  Here is an example of the map as it currently looks:

The main things left to do are adding more filters and adding in images and videos, but I’ll wait until Paul sends me more data before I do anything further.  That’s all for this week.  I’ll just need to see how work progresses over the next few weeks as with the schools now shut I’ll need to spent time looking after my son in addition to tackling my usual work.