Week Beginning 10th October 2022

I spent quite a bit of time finishing things off for the Speak For Yersel project.  I created a stats page for the project team to access.  The page allows you to specify a ‘from’ and ‘to’ date (it defaults to showing stats from the end of May to the end of the current day).  If you want a specific day you can enter the same date in ‘from’ and ‘to’ (e.g. ‘2022-10-04’ will display stats for everyone who registered on the Tuesday after the launch).

The stats relate to users registered in the selected period rather than answers submitted in the selected period. If a person registered in the selected period then all of their answers are included in the figures, whether they were submitted in the period or not. If a person registered outside of the selected period but submitted answers during the selected period these are not included.

The stats display the total number of users registered in the selected period, split into the number who chose a location in Scotland and those who selected elsewhere.  Then the total number of survey answers submitted by these two groups are shown, divided into separate sections for the five surveys.  I might need to update the page to add more in at a later date.  For example, one thing that isn’t shown is the number of people who completed each survey as opposed to only answering a few questions.  Also, I haven’t included stats about the quizzes or activities yet, but these could be added.

I also worked on an abstract about the project for the Digital Humanities 2023 conference.  In preparation for this I extracted all of the text relating to the project from this blog as a record of the development of the project.  It’s more than 21,000 words long and covers everything from our first team discussions about potential approaches in September last year through to the launch of the site last week.  I then went through this and pulled out some of the more interesting sections relating to the generation of the maps, the handling of user submissions and the automatic generation of quiz answers based on submitted data.  I sent this to Jennifer for feedback and then wrote a second version.  Hopefully it will be accepted for the conference, but even if it’s not I’ll hopefully be able to go as the DH conference is always useful to attend.

Also this week I attended a talk about a lemmatiser for Anglo-Norman that some researchers in France have developed using the Anglo-Norman dictionary.  It was a very interesting talk and included a demonstration of the corpus that had been constructed using the tool.  I’m probably going to be working with the team at some point later on, sending them some data from the underlying XML files of the Anglo-Norman Dictionary.

I also replaced the Seeing Speech videos with a new set the Eleanor Lawson had generated that were mirrored to match the videos we’re producing for the Speech Star project and investigated how I will get to Zurich for a thesaurus related workshop in January.

I spent the rest of the week working on the Books and Borrowing project, working on the ‘books’ tab in the library page.  I’d started on the API endpoint for this last week, which returned all books for a library and then processed them.  This was required as books have two title fields (standardised and original title), either one of which may be blank so to order to books by title the records first need to be returned to see which ‘title’ field to use.  Also ordering by number of borrowings and by author requires all books to be returned and processed.  This works fine for smaller libraries (e.g. Chambers has 961 books) but returning all books for a large library like St Andrews  that has more then 8,500 books was taking a long time, and resulting in a JSON file that was over 6MB in size.

I created an initial version of the ‘books’ page using this full dataset, with tabs across the top for each initial letter of the title (browsing by author and number of borrowings is still to do) and a count of the number of books in each tab also displayed.  Book records are then displayed in a similar manner to how they appear in the ‘page’ view, but with some additional data, namely total counts of the number of borrowings for the book holding record and counts of borrowings of individual items (if applicable).  These will eventually be linked to the search.

The page looked pretty good and worked pretty well, but was very inefficient as the full JSON file needed to be generated and passed to the browser every time a new letter was selected.  Instead I updated the underlying database to add two new fields to the book holding table.  The first stores the initial letter of the title (standardised if present, original if not) and the second stores a count of the total number of borrowings for the holding record.  I wrote a couple of scripts to add this data in, and these will need to be run periodically to refresh these cached fields as the do not otherwise get updated when changes are made in the CMS.  Having these fields in place means the scripts will be able to pinpoint and return subsets of the books in the library at the database query level rather than returning all data and then subsequently processing it.  This makes things much more efficient as less data is being processed at any one time.

I still need to add in facilities to browse the books by initial letter of the author’s surname and also facilities to list books by the number of borrowings, but for now you can at least browse books alphabetically by title.  Unfortunately for large libraries there is still a lot of data to process even when only dealing with specific initial letters.  For example, there are 1063 books beginning with ‘T’ in St Andrews so the returned data still takes quite a few seconds to load in.

That’s all for this week.  I’ll be on holiday next week so there won’t be a further report until the week after that.