Week Beginning 8th June 2020

This was week 12 of Lockdown and on Monday I arranged to get access to my office at work in order to copy some files from my work PC.  There were some scripts that I needed for the Historical Thesaurus, Fraser’s Scots Thesaurus and the Books and Borrowing projects so I reckoned it was about time to get access.  It all went pretty smoothly, thankfully.  My train into Central was very quiet – I think there were only about five people in my carriage, and none of them were near me.  I walked to the West End and called security to let them know I’d arrived, then got into my office and spent about an hour and a half copying files and doing some work tasks.  It was a bit strange to be back in my office after so long, with my calendar still showing March.  Once the files were all copied I left the building, checked out with security and walked back through a still deserted town to Central.  My train carriage was completely empty on the way back home.

I spent most of the rest of the week continuing with my work on the Books and Borrowing project.  My main task was importing sample data into the content management system.  Matt had sent me the latest copy of the Glasgow Student data over the weekend, and once I had the data processing scripts from the PC at work I could then process his spreadsheet and upload it to the pilot project database.  Processing the Glasgow Student data was not entirely straightforward as the transcriber had used Microsoft Office formatting in the spreadsheet cells to replicate features such as superscript text and strikethroughs.  It is a bit of a pain to export an Excel spreadsheet as plain text while retaining such formatting, but thankfully I’d solved that issue previously and my script was able to take an Excel file that had been saved as HTML and then pick out the formatting to keep whilst ditching all of the horrible HTML formatting that Microsoft adds in to Office files that are saved in that format.

Once the Glasgow Student data had been uploaded to the pilot project website I could then migrate it to the Books and Borrowing data structure.  It took the best part of a day to write a script that processed the data, dealing with issues like multiple book levels, additional fields and generating ledgers and pages.  After the migration there were 3 ledgers, 403 pages and 8191 borrowing records, with associations to 832 borrowers and 1080 books.  With this in place I then began to import sample data from a previous study of Innerpeffray library.  This was also in a spreadsheet, but was structured very differently and I needed to write a separate data import script to process it.  There were some additional complications due to the character encoding the spreadsheet uses, that resulting in lots of hidden special characters being embedded in the text when the spreadsheet was converted to a plain text file for upload.  This really messed up the upload process and took some time to get to the bottom of.  Also, there is variation in page numbering (e.g. sometimes ‘3r’, sometimes ‘3 r’) and this resulted in multiple pages being created for each variation before I spotted the issue.  Also, the spreadsheet is not always listed in page order – there were records from earlier pages added in amongst later pages.  This also messed up the upload process before I spotted the issue and updated my script to take this into consideration.  There were also some issues of data failing to upload when it contained accented characters, but I think I got to the bottom of that.

As with the Glasgow data, I created editions from holdings.  I did add in a check to see whether any of the Glasgow editions matched the titles of the Innerpeffray titles, and used the existing Glasgow edition if this situation arose, but due to the differences in transcription I don’t think any existing editions have been used.  This will need some manual correction at some point.  Similarly, there may be some existing Glasgow authors that might be used rather than repeating the same information from Innerpeffray but due to differences in transcription I don’t think this will have happened either.  As before, author data has for now just been uploaded into the ‘surname’ field and will need to be manually split up further and some Glasgow and Innerpeffray authors will need to be merged.  For example, in the Glasgow data we have ‘Cave, William, 1637-1713.’ Whereas in Innerpeffray we have ‘Cave, William, 1637-1713’.  Because of the full stop at the end of the Glasgow author these have ended up being inserted as separate authors.  After the upload process was complete there were 6550 borrowing records for Innerpeffray, split over 340 pages in one ledger.  A total of 1017 unique borrowers and 840 unique book holdings were added to the library.

I created user accounts for the rest of the team to access the CMS and test things out once the sample data for these two libraries was in place.  The project PI, Katie Halsey spotted an issue with the autocomplete for selecting an existing edition not working, so I spent some time investigating this.  It turns out that there are more character encoding issues with the data that are resulting in the JSON file that is generated for use in the autocomplete failing to be valid.  This is also happening with the AJAX script that populates the fields once an autocomplete option is selected.  I only investigated this on Friday afternoon and didn’t have time to fix it, but I’m hoping that next week if I fix the character encoding issues and ensure all line break characters are removed from the data then things will be ok.

Other than the Books and Borrowing project, I spoke to Rhona Alcorn of the DSL this week to discuss timescales for DSL developments.  I also fixed an issue with the Android version of the Scots School Dictionary app.  I gave some advice to Cris Sarg, who is managing the data for the Glasgow Medical Humanities project, and I made some further tweaks to the ‘export data for publication’ facilities for Carole Hough’s REELS project.

I rounded off the week by working on sorting out the new way of storing dates for the Historical Thesaurus.  Although we’d previously decided on a structure for the new dates system (which is much more rational and will allow labels to be associated with specific dates rather than the lexeme as a whole) I hadn’t generated the actual new date data.  My earlier script (which I retrieved from my office on Monday) instead iterated through each lexeme, generated the new date information and only outputted data if the generated full date did not match the original full date.  I’d saved this output as a spreadsheet and Fraser had gone through the rows and had identified any the needed fixed, updating the spreadsheet as required.  I then wrote a script to fix the date columns that needed fixed in order for the new fulldate to be properly generated.

With that in place I then wrote a script to generate the new date information for each of the more than 700,000 lexemes in the system.  I tried running this on the server initially, but this quickly timed out, meaning I had to run the script locally.  I will then be able to import the table into the online database.  The script took about 20 hours to run, but seems to have worked successfully, with almost 1.4 million date rows generated for the lexemes.  Hopefully next week I’ll find the time to work on this some more.