Week Beginning 10th August 2020

I was back at work this week after spending two weeks on holiday, during which time we went to Skye, Crinan and Peebles.  It was really great to see some different places after being cooped up at home for 19 weeks and I feel much better for having been away.  Unfortunately during this week I developed a pretty severe toothache and had to make an emergency appointment with the dentist on Thursday morning.  It turns out I need root canal surgery and am now booked in to have this next Tuesday, but until then I need to try and just cope with the pain, which has been almost unbearable at times, despite regular doses of both ibuprofen and paracetamol.  This did affect my ability work a bit on Thursday afternoon and Friday, but I managed to struggle through.

On my return to work from my holiday on Monday I spent some time catching up with emails that had accumulated whilst I was away, including replying to Roslyn Potter in Scottish Literature about a project website, replying to Jennifer Smith about giving a group of students access to the SCOSYA data and making changes to the Berwickshire Place-names website to make it more attractive to the REF reviewers based on feedback passed on by Jeremy Smith.  I also created a series of high-resolution screenshots of the resource for Carole Hough for a publication, had an email chat with Luca Guariento about linked open data.

I also fixed some issues with the Galloway Glens projects that Thomas Clancy had spotted, including an issue with the place-name element page which was not ordering accented characters properly – all accented characters were being listed at the end rather than with their non-accented versions.  It turned out that while the underlying database orders accented characters correctly, for the elements list I need to get a list of elements used in place-names and a list of elements used in historical forms and then I have to combine these lists and reorder the resulting single list.  This part of the process was not dealing with all accented characters, only a limited set that I’d created for Berwickshire that also dealt with ashes and thorns.  Instead I added in a function taken from WordPress that converts all accented characters to their unaccented equivalent for the purposes of ordering and this ensured the order of the elements list was correct.

The rest of my week was divided between three projects, the first of which was the Books and Borrowing project.  For this I spent some time working with some of the digitised images of the register pages.  We now have access to the images from Westerkirk library and in these records appear in a table that spreads across both recto and verso pages but we have images of the individual pages.  The project RA who is transcribing the records is treating both recto and verso as a single ‘page’ in the system, which makes sense.  We therefore need to stitch the r and v images together into on single image to be associated with this ‘page’.  I downloaded all of the images and have found a way to automatically join two page images together.  However, there is rather a lot of overlap in the images, meaning the book appears to have two joins and some columns are repeated.  I could possibly try to automatically crop the images before joining them, but there is quite a bit of variation in the size of the overlap so this is never going to be perfect and may result in some information getting lost.  The other alternative would be to manually crop and join the images, which I did some experimentation with.  It’s still not perfect due to the angle of the page changing between shots, but it’s a lot better.  The downside with this approach is that someone would have to do the task.  There are about 230 images, so about 115 joins, each one taking 2-3 minutes to create, so maybe about 5 or so hours of effort.  I’ve left it with the PI and Co-I to decide what to do about this.  I also downloaded the images for Volume 1 of the register for Innerpeffray library and created tilesets for these that will allow the images to be zoomed and panned.  I also fixed a bug relating to adding new book items to a record and responded to some feedback about the CMS.

My second major project of the week was the Anglo-Norman Dictionary.  This week I began writing a high-level requirements document for the new AND website that I will be developing.  This mean going through the existing site in detail and considering which features will be retained, how things might be handled better, and how I might develop the site.  I made good progress with the document, and by the end of the week I’d covered the main site.  Next week I need to consider the new API for accessing the data and the staff pages for uploading and publishing new or newly edited entries.  I also responded to a few questions from Heather Pagan of the AND about the searches and read through and gave feedback on a completed draft of the AHRC proposal that the team are hoping to submit next month.

My final major project of the week was the Historical Thesaurus, for which I updated and re-executed by OED Date extraction script based on feedback from Fraser and Marc.  It was a long and complicated process to update the script as there are literally millions of dates and some issues only appear a handful of times, so tracking them down and testing things is tricky.    However, I made the following changes: I added a ‘sortdate_new’ column to the main OED lexeme table that holds the sortdate value from the new XML files, which may differ from the original value.  I’ve done some testing and rather strangely there are many occasions where the new sortdate differs from the old, but the ‘revised’ flag is not set to ‘true’.  I also updated the new OED date table to include a new column where the full date text is contained, as I thought this would be useful for tracing back issues.  E.g. if the OED date is ‘?c1225’ this is stored here.  The actual numeric year in my table now comes from the ‘year’ attribute in the XML instead.  This always contains the numeric value in the OED date, e.g. <q year=”1330″><date>c1330</date></q>.  New lexemes in the data are now getting added into the OED lexeme table and are also having their dates processed.  I’ve added a new column called ‘newaugust2020’ to track these new lexemes.  We’ll possibly have to try and match them up with existing HT lexemes at some point, unless we can consider them all to be ‘new’, meaning they’ll have no matches.  The script also now stores all of the various OE dates, rather than one single OE date of 650 being added for all.  I set the script running on Thursday and by Sunday it had finished executing, resulting in 3,912,109 being added and 4061 new words.