Week Beginning 6th July 2020

Week 16 of Lockdown and still working from home.  I continued working on the data import for the Books and Borrowers project this week.  I wrote a script to import data from Haddington, which took some time due to the large number of additional fields in the data (15 across Borrowers, Holdings and Borrowings), but are executing it resulted in a further 5,163 borrowing records across 2 ledgers and 494 pages being added, including 1399 book holding records and 717 borrowers.

I then moved onto the datasets from Leighton and Wigtown.  Leighton was a much smaller dataset, with just 193 borrowing records over 18 pages in one ledger and involving 18 borrowers and 71 books.  As before, I have just created book holding records for these (rather than project-wide edition records), although in this case there are authors for books too, which I have also created.  Wigtown was another smaller dataset.  The spreadsheet has three sheets, the first is a list of borrowers, the second a list of borrowings and the third a list of books.  However, no unique identifiers are used to connect the borrowers and books to the information in the borrowings sheet and there’s no other field that matches across the sheets to allow the data to be automatically connected up.  For example, in the Books sheet there is the book ‘History of Edinburgh’ by author ‘Arnot, Hugo’ but in the borrowings tab author surname and forename are split into different columns (so ‘Arnot’ and ‘Hugo’ and book titles don’t match (in this case the book appears as simply ‘Edinburgh’ in the borrowings).  Therefore I’ve not been able to automatically pull in the information from the books sheet.  However, as there are only 59 books in the books sheet it shouldn’t take too much time to manually add the necessary data when created Edition records.  It’s a similar issue with Borrowers in the first sheet – they appear with name in one column (e.g. ‘Douglas, Andrew’) but in the Borrowings sheet the names are split into separate forename and surname columns.  There are also instances of people with the same name (e.g. ‘Stewart, John’) but without unique identifiers there’s no way to differentiate these.  There are only 110 people listed in the Borrowers sheet, and only 43 in the actual borrowing data, so again, it’s probably better if any details that are required are added in manually.

I imported a total of 898 borrowing records for Wigtown.  As there is no page or ledger information in the data I just added these all to one page in a made-up ledger.  It does however mean that the page can take quite a while to load in the CMS.  There are 43 associated borrowers and 53 associated books, which again have been created as Holding records only and have associated authors.  However, there are multiple Book Items created for many of these 53 books – there are actually 224 book items.  This is because the spreadsheet contains a separate ‘Volume’ column and a book may be listed with the same title but a different volume.  In such cases a Holding record is made for the book (e.g. ‘Decline and Fall of Rome’) and an Item is made for each Volume that appears (in this case 12 items for the listed volumes 1-12 across the dataset).  With these datasets imported I have now processed all of the existing data I have access to, other than the Glasgow Professors borrowing records, but these are still being worked on.

I did some other tasks for the project this week as well, including reviewing the digitisation policy document for the project, which lists guidelines for the team to follow when they have to take photos of ledger pages themselves in libraries where no professional digitisation service is available.  I also discussed how borrower occupations will be handled in the system with Katie.

In addition to the Books and Borrowers project I found time to work on a number of other projects this week too.  I wrote a Data Management Plan for an AHRC Networking proposal that Carolyn Jess-Cooke in English Literature is putting together and I had an email conversation with Heather Pagan of the Anglo-Norman Dictionary about the Data Management Plan she wants me to write for a new AHRC proposal that Glasgow will be involved with.  I responded to a query about a place-names project from Thomas Clancy, a query about App certification from Brian McKenna in IT Services and a query about domain name registration from Eleanor Lawson at QMU.  Also (outside of work time) I’ve been helping my brother-in-law set up Beacon Genealogy, through which he offers genealogy and family history research services.

Also this week I worked with Jennifer Smith to make a number of changes to the content of the SCOSYA website (https://scotssyntaxatlas.ac.uk/) to provide more information about the project for REF purposes and I added a new dataset to the interactive map of Burns Suppers that I’m creating for Paul Malgrati in Scottish Literature.  I also went through all of the WordPress sites I manage and upgraded them to the most recent version of WordPress.

Finally, I spent some time writing scripts for the DSL people to help identify child entries in the DOST and SND datasets that haven’t been properly merged with main entries when exported from their editing software.  In such cases the child entries have been added to the main entries, but then they haven’t been removed as separate entries in the output data, meaning the child entries appear twice.  When attempting to process the SND data I discovered there were some errors in the XML file (mismatched tags) that prevented my script from processing the file, so I had to spend some time tracking these down and fixing them.  But once this had been done my script could do through the entire dataset, look for an ID that appeared as a URL in one entry and as an ID of another entry and in such cases pull out the IDs and the full XML of each entry and export it into an HTML table.  There were about 180 duplicate child entries in DOST but a lot more in SND (the DOST file is about 1.5mb, the SND one is about 50mb).  Hopefully once the DSL people have analysed the data we can then strip out the unnecessary child entries and have a better dataset to import into the new editing system the DSL is going to be using.