Week Beginning 24th October 2022

I returned to work this week after having a lovely week’s holiday in the Lake District.  I spent most of the week working for the Books and Borrowing project.  I’d received images for two library registers from Selkirk whilst I was away and I set about integrating them into our system.  This required a bit of work to get the images matched up to the page records for the registers that already exist in the CMS.   Most of the images are double-pages but the records in the CMS are of single pages marked as ‘L’ or ‘R’.  Not all of the double-page images have both ‘L’ and ‘R’ in the CMS and some images don’t have any corresponding pages in the CMS.  For example in Volume 1 we have ‘1010199l’ followed by ‘1010203l´followed by ‘1010205l’ and then ‘1010205r’.  This seems to be quite correct as the missing pages don’t contain borrowing records.  However, I still needed to figure out how to match up images and page records.  As with previous situations, the options were either slicing the images down the middle to create separate ‘L’ and ‘R’ images to match each page or joining the ‘L’ and ‘R’ page records in the CMS to make one single record that then matches the double-page image.  There are several hundred images so manually chopping them up wasn’t really an option, and automatically slicing them down the middle wouldn’t work too well as the page divide is often not in the centre of the image.  This then left joining up the page records in the CMS as the best option and I wrote a script to join the page records, rename them to remove the ‘L’ and ‘R’ affixes, moving all borrowing records across and renumbering their page order and then deleting the now empty pages.  Thankfully it all seemed to work well.  I also uploaded the images for the final register from the Royal High School, which thankfully was a much more straightforward process as all image files matched references already stored in the CMS for each page.

I then returned to the development of the front-end for the project.  When I looked at the library page I’d previously created I noticed that the interactive map of library locations was failing to load.  After a bit of investigation I realised that this was caused by new line characters appearing in the JSON data for the map, which was invalidating the file structure.  These had been added in via the library ‘name variants’ field in the CMS and were appearing in the data for the library popup on the map.  I needed to update the script that generated the JSON data to ensure that new line characters were stripped out of the data, and after that the maps loaded again.

Before I went on holiday I’d created a browse page for library books that split the library’s books up based on the initial letter of their titles.  The approach I’d taken worked pretty well, but St Andrews was still a bit of an issue due to it containing many more books than the other libraries (more than 8,500).  Project Co-I Matt Sangster suggested that we should omit some registers from the data as their contents (including book records) are not likely to be worked on during the course of the project.  However, I decided to just leave the data in place for now, as excluding data for specific registers would require quite a lot of reworking of the code.  The book data for a library is associated directly with the library record and not the specific registers and all the queries would need to be rewritten to check which registers a book appears in.  I reckon that if these registers are not going to be tackled by the project it might be better to just delete them, not just to make the system work better but to avoid confusing users with messy data, but I decided to leave everything as it is for now.

This week I added in two further ways of browsing books in a selected library:  By author and by most borrowed.  A drop-down list featuring the three browse options appears at the top of the ‘Books’ page now, and I’ve added in a title and explanatory paragraph about the list type. The ‘by author’ browse works in a similar manner to the ‘by title’ browse, with a list of initial letter tabs featuring the initial letter of the author’s surname and a count of the number of books that have an author with a surname beginning with this letter.  Note that any books that don’t have an associated author do not appear in this list.  I did think about adding a ‘no author’ tab as well, but some libraries (e.g. St Andrews) have so many books without specified authors that the data for this tab would take far too long to load in.  Note also that if a book has multiple authors then the book will appear multiple times – once for each author.  Here’s a screenshot of how the interface currently looks:

The actual list of books works in a similar way to the ‘title’ list but is divided by author, with authors appearing with their full name and dates in red above a list of their books.  The ordering of the records is by author surname then forename then author ID then book title.  This means two authors with the same name will still appear as separate headings with their books ordered alphabetically.  However, this has also uncovered some issues with duplicate author records.

Getting this browse list working actually took a huge amount of effort due to the complex way we store authors.  In our system an author can be associated with any one of four levels of book record (work / edition / holding / item) and an author associated at a higher level needs to cascade down to lower level book records.  Running queries directly on this structure proved to be too resource intensive and slow so instead I wrote a script to generate cached data about authors.  This script goes through every author connection at all levels and picks out the unique authors that should be associated with each book holding record.  It then stores a reference to the ID of the author, the holding record and the initial letter of the author’s surname in a new table that is much more efficient to reference.  This then gets used to generate the letter tabs with the number of book counts and to work out which books to return when an author surname beginning with a letter is selected.

However, one thing we need to consider about using cached tables is that the data only gets updated when I run the script to refresh the cache, so any changes / additions to authors made in the CMS will not be directly reflected in the library books tab.  This is also true of the ‘browse books by title’ lists I previously created too.  I noticed when looking at the books beginning with ‘V’ for a library (I can’t remember which) that one of the titles clearly didn’t begin with a ‘V’, which confused me for a while before I realised it’s because the title must have been changed in the CMS since I last generated the cached data.

The ’most borrowed’ page lists the top 100 most borrowed books for the library, from most to least borrowed.  Thankfully this was rather more straightforward to implement as I had already created the cached fields for this view.  I did consider whether to have tabs allowing you to view all of the books by number of borrowings, but I wasn’t really sure how useful this would be.  In terms of the display of the ‘top 100’ the books are listed in the same way as the other lists, but the number of borrowings is highlighted in red text to make it easier to see.  I’ve also added in a number to the top-left of the book record so you can see which place a book has in the ‘hitlist’, as you can see in the following screenshot:

I also added in a ‘to top’ button that appears as you scroll down the page (it appears in the bottom right, as you can see in the above screenshot).  Clicking on this scrolls to the page title, which should make the page easier to use – I’ve certainly been making good use of the button anyway.

Also this week I submitted my paper ‘Speak For Yersel: Developing a crowdsourced linguistic survey of Scotland’ to DH2023.  As it’s the first ‘in person’ DH conference to be held in Europe since 2019 I suspect there will be a huge number of paper submissions, so we’ll just need to see if it gets accepted or not.  Also for Speak For Yersel I had a lengthy email conversation with Jennifer Smith about repurposing the SFY system for use in other areas.  The biggest issue here would be generating the data about the areas:  settlements for the drop-down lists, postcode areas with GeoJSON shape files and larger region areas with appropriate GeoJSON shape files.  It took Mary a long time to gather or create all of this data and someone would have to do the same for any new region.  This might be a couple of weeks of effort for each area.  It turns out that Jennifer has someone in mind for this work, which would mean all I would need to do is plug in a new set of questions, work with the new area data and make some tweaks to the interface.  We’ll see how this develops.  I also wrote a script to export the survey data for further analysis.

Another project I spent some time on this week was Speech Star.  For this I created a new ‘Child Speech Error Database’ and populated it with around 90 records that Eleanor Lawson had sent me.  I imported all of the data into the same database as is used for the non-disordered speech database and have added a flag that decides which content is displayed in which page.  I removed ‘accent’ as a filter option (as all speakers are from the same area) and have added in ‘error type’.  Currently the ‘age’ filter defaults to the age group 0-17 as I wasn’t sure how this filter should work, as all speakers are children.

The display of records is similar to the non-disordered page in that there are two means of listing the data, each with its own tab.  In the new page these tabs are for ‘Error type’ and ‘word’.  I also added in ‘phonemic target’ and ‘phonetic production’ as new columns in the table as I thought it would be useful to include these, and I updated the video pop-up for both the new page and the non-disordered page to bring it into line with the popup for the disordered paediatric database, meaning all metadata now appears underneath the video rather than some appearing in the title bar or above the video and the rest below.  I’ve ensured this is exactly the same for the ‘multiple video’ display too.  At the moment the metadata all just appears on one long line (other than speaker ID, sex and age) so the full width of the popup is used, but we might change this to a two-column layout.

Later in the week Eleanor got back to me to say she’d sent me the wrong version of the spreadsheet and I therefore replaced the data.  However, I spotted something relating to the way I structure the data that might be an issue.  I’d noticed a typo in the earlier spreadsheet (there is a ‘helicopter’ and a ‘helecopter’) and I fixed it, but I forgot to fix it before uploading the newer file.   Each prompt is only stored once in the database, even if it is used by multiple speakers so I was going to go into the database and remove the ‘helecopter’ prompt row that didn’t need to be generated and point the speaker to the existing ‘helicopter’ prompt.  However, I noticed that ‘helicopter’ in the spreadsheet has ‘k’ as the sound whereas the existing record in the database has ‘l’.  I realised this is because the ‘helicopter’ prompt had been created as part of the non-disordered speech database and here the sound is indeed ‘l’.  It looks like one prompt may have multiple sounds associated with it, which my structure isn’t set up to deal with.  I’m going to have to update the structure next week.

Also this week I responded to a request for advice from David Featherstone in Geography who is putting together some sort of digitisation project.  I also responded to a query from Pauline Graham at the DSL regarding the source data for the Scots School Dictionary.  She wondered whether I had the original XML and I explained that there was no original XML.  The original data was stored in an ancient Foxpro database that ran from a CD.  When I created the original School Dictionary app I managed to find a way to extract the data and I saved it as two CSV files – one English-Scots the other Scots-English.  I then ran a script to convert this into JSON which is what the original app uses.  I gave Pauline a link to download all of the data for the app, including both English and Scots JSON files and the sound files and I also uploaded the English CSV file in case this would be more useful.

That’s all for this week.  Next week I’ll fix the issues with the Speech Star database and continue with the development of the Books and Borrowing front-end.