
Month: October 2022
Week Beginning 24th October 2022
I returned to work this week after having a lovely week’s holiday in the Lake District. I spent most of the week working for the Books and Borrowing project. I’d received images for two library registers from Selkirk whilst I was away and I set about integrating them into our system. This required a bit of work to get the images matched up to the page records for the registers that already exist in the CMS. Most of the images are double-pages but the records in the CMS are of single pages marked as ‘L’ or ‘R’. Not all of the double-page images have both ‘L’ and ‘R’ in the CMS and some images don’t have any corresponding pages in the CMS. For example in Volume 1 we have ‘1010199l’ followed by ‘1010203l´followed by ‘1010205l’ and then ‘1010205r’. This seems to be quite correct as the missing pages don’t contain borrowing records. However, I still needed to figure out how to match up images and page records. As with previous situations, the options were either slicing the images down the middle to create separate ‘L’ and ‘R’ images to match each page or joining the ‘L’ and ‘R’ page records in the CMS to make one single record that then matches the double-page image. There are several hundred images so manually chopping them up wasn’t really an option, and automatically slicing them down the middle wouldn’t work too well as the page divide is often not in the centre of the image. This then left joining up the page records in the CMS as the best option and I wrote a script to join the page records, rename them to remove the ‘L’ and ‘R’ affixes, moving all borrowing records across and renumbering their page order and then deleting the now empty pages. Thankfully it all seemed to work well. I also uploaded the images for the final register from the Royal High School, which thankfully was a much more straightforward process as all image files matched references already stored in the CMS for each page.
I then returned to the development of the front-end for the project. When I looked at the library page I’d previously created I noticed that the interactive map of library locations was failing to load. After a bit of investigation I realised that this was caused by new line characters appearing in the JSON data for the map, which was invalidating the file structure. These had been added in via the library ‘name variants’ field in the CMS and were appearing in the data for the library popup on the map. I needed to update the script that generated the JSON data to ensure that new line characters were stripped out of the data, and after that the maps loaded again.
Before I went on holiday I’d created a browse page for library books that split the library’s books up based on the initial letter of their titles. The approach I’d taken worked pretty well, but St Andrews was still a bit of an issue due to it containing many more books than the other libraries (more than 8,500). Project Co-I Matt Sangster suggested that we should omit some registers from the data as their contents (including book records) are not likely to be worked on during the course of the project. However, I decided to just leave the data in place for now, as excluding data for specific registers would require quite a lot of reworking of the code. The book data for a library is associated directly with the library record and not the specific registers and all the queries would need to be rewritten to check which registers a book appears in. I reckon that if these registers are not going to be tackled by the project it might be better to just delete them, not just to make the system work better but to avoid confusing users with messy data, but I decided to leave everything as it is for now.
This week I added in two further ways of browsing books in a selected library: By author and by most borrowed. A drop-down list featuring the three browse options appears at the top of the ‘Books’ page now, and I’ve added in a title and explanatory paragraph about the list type. The ‘by author’ browse works in a similar manner to the ‘by title’ browse, with a list of initial letter tabs featuring the initial letter of the author’s surname and a count of the number of books that have an author with a surname beginning with this letter. Note that any books that don’t have an associated author do not appear in this list. I did think about adding a ‘no author’ tab as well, but some libraries (e.g. St Andrews) have so many books without specified authors that the data for this tab would take far too long to load in. Note also that if a book has multiple authors then the book will appear multiple times – once for each author. Here’s a screenshot of how the interface currently looks:
The actual list of books works in a similar way to the ‘title’ list but is divided by author, with authors appearing with their full name and dates in red above a list of their books. The ordering of the records is by author surname then forename then author ID then book title. This means two authors with the same name will still appear as separate headings with their books ordered alphabetically. However, this has also uncovered some issues with duplicate author records.
Getting this browse list working actually took a huge amount of effort due to the complex way we store authors. In our system an author can be associated with any one of four levels of book record (work / edition / holding / item) and an author associated at a higher level needs to cascade down to lower level book records. Running queries directly on this structure proved to be too resource intensive and slow so instead I wrote a script to generate cached data about authors. This script goes through every author connection at all levels and picks out the unique authors that should be associated with each book holding record. It then stores a reference to the ID of the author, the holding record and the initial letter of the author’s surname in a new table that is much more efficient to reference. This then gets used to generate the letter tabs with the number of book counts and to work out which books to return when an author surname beginning with a letter is selected.
However, one thing we need to consider about using cached tables is that the data only gets updated when I run the script to refresh the cache, so any changes / additions to authors made in the CMS will not be directly reflected in the library books tab. This is also true of the ‘browse books by title’ lists I previously created too. I noticed when looking at the books beginning with ‘V’ for a library (I can’t remember which) that one of the titles clearly didn’t begin with a ‘V’, which confused me for a while before I realised it’s because the title must have been changed in the CMS since I last generated the cached data.
The ’most borrowed’ page lists the top 100 most borrowed books for the library, from most to least borrowed. Thankfully this was rather more straightforward to implement as I had already created the cached fields for this view. I did consider whether to have tabs allowing you to view all of the books by number of borrowings, but I wasn’t really sure how useful this would be. In terms of the display of the ‘top 100’ the books are listed in the same way as the other lists, but the number of borrowings is highlighted in red text to make it easier to see. I’ve also added in a number to the top-left of the book record so you can see which place a book has in the ‘hitlist’, as you can see in the following screenshot:
I also added in a ‘to top’ button that appears as you scroll down the page (it appears in the bottom right, as you can see in the above screenshot). Clicking on this scrolls to the page title, which should make the page easier to use – I’ve certainly been making good use of the button anyway.
Also this week I submitted my paper ‘Speak For Yersel: Developing a crowdsourced linguistic survey of Scotland’ to DH2023. As it’s the first ‘in person’ DH conference to be held in Europe since 2019 I suspect there will be a huge number of paper submissions, so we’ll just need to see if it gets accepted or not. Also for Speak For Yersel I had a lengthy email conversation with Jennifer Smith about repurposing the SFY system for use in other areas. The biggest issue here would be generating the data about the areas: settlements for the drop-down lists, postcode areas with GeoJSON shape files and larger region areas with appropriate GeoJSON shape files. It took Mary a long time to gather or create all of this data and someone would have to do the same for any new region. This might be a couple of weeks of effort for each area. It turns out that Jennifer has someone in mind for this work, which would mean all I would need to do is plug in a new set of questions, work with the new area data and make some tweaks to the interface. We’ll see how this develops. I also wrote a script to export the survey data for further analysis.
Another project I spent some time on this week was Speech Star. For this I created a new ‘Child Speech Error Database’ and populated it with around 90 records that Eleanor Lawson had sent me. I imported all of the data into the same database as is used for the non-disordered speech database and have added a flag that decides which content is displayed in which page. I removed ‘accent’ as a filter option (as all speakers are from the same area) and have added in ‘error type’. Currently the ‘age’ filter defaults to the age group 0-17 as I wasn’t sure how this filter should work, as all speakers are children.
The display of records is similar to the non-disordered page in that there are two means of listing the data, each with its own tab. In the new page these tabs are for ‘Error type’ and ‘word’. I also added in ‘phonemic target’ and ‘phonetic production’ as new columns in the table as I thought it would be useful to include these, and I updated the video pop-up for both the new page and the non-disordered page to bring it into line with the popup for the disordered paediatric database, meaning all metadata now appears underneath the video rather than some appearing in the title bar or above the video and the rest below. I’ve ensured this is exactly the same for the ‘multiple video’ display too. At the moment the metadata all just appears on one long line (other than speaker ID, sex and age) so the full width of the popup is used, but we might change this to a two-column layout.
Later in the week Eleanor got back to me to say she’d sent me the wrong version of the spreadsheet and I therefore replaced the data. However, I spotted something relating to the way I structure the data that might be an issue. I’d noticed a typo in the earlier spreadsheet (there is a ‘helicopter’ and a ‘helecopter’) and I fixed it, but I forgot to fix it before uploading the newer file. Each prompt is only stored once in the database, even if it is used by multiple speakers so I was going to go into the database and remove the ‘helecopter’ prompt row that didn’t need to be generated and point the speaker to the existing ‘helicopter’ prompt. However, I noticed that ‘helicopter’ in the spreadsheet has ‘k’ as the sound whereas the existing record in the database has ‘l’. I realised this is because the ‘helicopter’ prompt had been created as part of the non-disordered speech database and here the sound is indeed ‘l’. It looks like one prompt may have multiple sounds associated with it, which my structure isn’t set up to deal with. I’m going to have to update the structure next week.
Also this week I responded to a request for advice from David Featherstone in Geography who is putting together some sort of digitisation project. I also responded to a query from Pauline Graham at the DSL regarding the source data for the Scots School Dictionary. She wondered whether I had the original XML and I explained that there was no original XML. The original data was stored in an ancient Foxpro database that ran from a CD. When I created the original School Dictionary app I managed to find a way to extract the data and I saved it as two CSV files – one English-Scots the other Scots-English. I then ran a script to convert this into JSON which is what the original app uses. I gave Pauline a link to download all of the data for the app, including both English and Scots JSON files and the sound files and I also uploaded the English CSV file in case this would be more useful.
That’s all for this week. Next week I’ll fix the issues with the Speech Star database and continue with the development of the Books and Borrowing front-end.
Week Beginning 10th October 2022
I spent quite a bit of time finishing things off for the Speak For Yersel project. I created a stats page for the project team to access. The page allows you to specify a ‘from’ and ‘to’ date (it defaults to showing stats from the end of May to the end of the current day). If you want a specific day you can enter the same date in ‘from’ and ‘to’ (e.g. ‘2022-10-04’ will display stats for everyone who registered on the Tuesday after the launch).
The stats relate to users registered in the selected period rather than answers submitted in the selected period. If a person registered in the selected period then all of their answers are included in the figures, whether they were submitted in the period or not. If a person registered outside of the selected period but submitted answers during the selected period these are not included.
The stats display the total number of users registered in the selected period, split into the number who chose a location in Scotland and those who selected elsewhere. Then the total number of survey answers submitted by these two groups are shown, divided into separate sections for the five surveys. I might need to update the page to add more in at a later date. For example, one thing that isn’t shown is the number of people who completed each survey as opposed to only answering a few questions. Also, I haven’t included stats about the quizzes or activities yet, but these could be added.
I also worked on an abstract about the project for the Digital Humanities 2023 conference. In preparation for this I extracted all of the text relating to the project from this blog as a record of the development of the project. It’s more than 21,000 words long and covers everything from our first team discussions about potential approaches in September last year through to the launch of the site last week. I then went through this and pulled out some of the more interesting sections relating to the generation of the maps, the handling of user submissions and the automatic generation of quiz answers based on submitted data. I sent this to Jennifer for feedback and then wrote a second version. Hopefully it will be accepted for the conference, but even if it’s not I’ll hopefully be able to go as the DH conference is always useful to attend.
Also this week I attended a talk about a lemmatiser for Anglo-Norman that some researchers in France have developed using the Anglo-Norman dictionary. It was a very interesting talk and included a demonstration of the corpus that had been constructed using the tool. I’m probably going to be working with the team at some point later on, sending them some data from the underlying XML files of the Anglo-Norman Dictionary.
I also replaced the Seeing Speech videos with a new set the Eleanor Lawson had generated that were mirrored to match the videos we’re producing for the Speech Star project and investigated how I will get to Zurich for a thesaurus related workshop in January.
I spent the rest of the week working on the Books and Borrowing project, working on the ‘books’ tab in the library page. I’d started on the API endpoint for this last week, which returned all books for a library and then processed them. This was required as books have two title fields (standardised and original title), either one of which may be blank so to order to books by title the records first need to be returned to see which ‘title’ field to use. Also ordering by number of borrowings and by author requires all books to be returned and processed. This works fine for smaller libraries (e.g. Chambers has 961 books) but returning all books for a large library like St Andrews that has more then 8,500 books was taking a long time, and resulting in a JSON file that was over 6MB in size.
I created an initial version of the ‘books’ page using this full dataset, with tabs across the top for each initial letter of the title (browsing by author and number of borrowings is still to do) and a count of the number of books in each tab also displayed. Book records are then displayed in a similar manner to how they appear in the ‘page’ view, but with some additional data, namely total counts of the number of borrowings for the book holding record and counts of borrowings of individual items (if applicable). These will eventually be linked to the search.
The page looked pretty good and worked pretty well, but was very inefficient as the full JSON file needed to be generated and passed to the browser every time a new letter was selected. Instead I updated the underlying database to add two new fields to the book holding table. The first stores the initial letter of the title (standardised if present, original if not) and the second stores a count of the total number of borrowings for the holding record. I wrote a couple of scripts to add this data in, and these will need to be run periodically to refresh these cached fields as the do not otherwise get updated when changes are made in the CMS. Having these fields in place means the scripts will be able to pinpoint and return subsets of the books in the library at the database query level rather than returning all data and then subsequently processing it. This makes things much more efficient as less data is being processed at any one time.
I still need to add in facilities to browse the books by initial letter of the author’s surname and also facilities to list books by the number of borrowings, but for now you can at least browse books alphabetically by title. Unfortunately for large libraries there is still a lot of data to process even when only dealing with specific initial letters. For example, there are 1063 books beginning with ‘T’ in St Andrews so the returned data still takes quite a few seconds to load in.
That’s all for this week. I’ll be on holiday next week so there won’t be a further report until the week after that.
Week Beginning 3rd October 2022
The Speak For Yersel project launched this week and is now available to use here: https://speakforyersel.ac.uk/. It’s been a pretty intense project to work on and has required much more work than I’d expected, but I’m very happy with the end result. We didn’t get as much media attention as we were hoping for, but social media worked out very well for the project and in the space of a week we’d had more than 5,000 registered users completing thousands of survey questions. I spent some time this week tweaking things after the launch. For example, I hadn’t added the metadata tags required by Twitter and Facebook / WhatsApp to nicely format links to the website (for example the information detailed here https://developers.facebook.com/docs/sharing/webmasters/) and it took a bit of time to add these in with the correct content.
I also gave some advice to Anja Kuschmann at Strathclyde about applying for a domain for the new VARICS project I’m involved with and investigated a replacement batch of videos that Eleanor had created for the Seeing Speech website. I’ll need to wait until she gets back to me with files that match the filenames used on the existing site before I can take this further, though. I also fixed an issue with the Berwickshire place-names website which has lost its additional CSS and investigated a problem with the domain for the Uist Saints website that has still unfortunately not been resolved.
Other than these tasks I spent the rest of the week continuing to develop the front-end for the Books and Borrowing project. I completed an initial version of the ‘page’ view, including all three views (image, text and image and text). I added in a ‘jump to page’ feature, allowing you (as you might expect) to jump directly to any page in the register when viewing a page. I also completed the ‘text’ view of the page, which now features all of the publicly accessible data relating to the records – borrowing records, borrowers, book holding and item records and any associated book editions and book works, plus associated authors. There’s an awful lot of data and it took quite a lot of time to think about how best to lay it all out (especially taking into consideration screens of different sizes), but I’m pretty happy with how this first version looks.
Currently the first thing you see for a record is the transcribed text, which is big and green. Then all fields relating to the borrowing appear under this. The record number as it appears on the page plus the record’s unique ID are displayed in the top right for reference (and citation). Then follows a section about the borrower, with the borrower’s name in green (I’ve used this green to make all of the most important bits of text stand out from the rest of the record but the colour may be changed in future). Then follows the information about the book holding and any specific volumes that were borrowed. If there is an associated site-wide book edition record (or records) these appear in a dark grey box, together with any associated book work record (although there aren’t many of these associations yet). If there is a link to a library record this appears as a button on the right of the record. Similarly, if there’s an ESTC and / or other authority link for the edition these appear to the right of the edition section.
Authors now cascade down through the data as we initially planned. If there’s an author associated with a work it is automatically associated with and displayed alongside the edition and holding. If there’s an author associated with an editon but not a work it is then associated with the holding. If a book at a specific level has an author specified then this replaces any cascading author from this point downwards in the sequence. Something that isn’t in place yet are the links from information to search results, as I haven’t developed the search yet. But eventually things like borrower name, author, book title etc will be links allowing you to search directly for the items.
One other thing I’ve added in is the option to highlight a record. Press anywhere in a record and it is highlighted in yellow. Press again to reset it. This can be quite useful as you’re scrolling through a page with lots of records on if there are certain records you’re interested in. You can highlight as many records as you want. It’s possible that we may add other functionality to this, e.g. the option to download the data for selected records. Here’s a screenshot of the text view of the page:
I also completed the ‘image and text’ view. This works best on a large screen (i.e. not a mobile phone, although it is just about possible to use it on one, as I did test this out). The image takes up about 60% of the screen width and the text takes up the remaining 40%. The height of the records section is fixed to the height of the image area and is scrollable, so you can scroll down the records whilst still viewing the image (rather than the whole page scrolling and the image disappearing off the screen). I think this view works really well and the records are still perfectly usable in the more confined area and it’s great to be able to compare the image and the text side by side. Here’s a screenshot of the same page when viewing both text and image:
I tested the new interface out with registers from all of our available libraries and everything is looking good to me. Some registers don’t have images yet, so I added in a check for this to ensure that the image views and page thumbnails don’t appear for such registers. After that I moved onto developing the interface to browse book holdings when viewing a library. I created an API endpoint for returning all of the data associated with holding records for a specified library. This includes all of the book holding data, information about each of the book items associated with the holding record (including the number of borrowing records for each), the total number of borrowing records for the holding, any associated book edition and book work records (and there may be multiple editions associated with each holding) plus any authors associated with the book. Authors cascade down through the record as they do when viewing borrowing records in the page. This is a gigantic amount of information, especially as libraries may have many thousands of book holding records. The API call loads pretty rapidly for smaller libraries (e.g. Chambers Library with 961 book holding records) but for larger ones (e.g. St Andrews with over 8,500 book holding records) the API call takes too long to return the data (in the latter case it takes about a minute and returns a JSON file that’s over 6Mb in size). The problem is the data needs to be returned in full in order to do things like order it by largest number of borrowings. Clearly dynamically generating the data each time is going to be too slow so instead I am going to investigate caching the data. For example, that 6Mb JSON file can just site there as an actual file rather than being generated each time. Instead I will write a script to regenerate the cached files and I can run this whenever data gets updated (or maybe once a week whilst the project is still active). I’ll continue to work on this next week.
Week Beginning 26th September 2022
I spent most of my time this week getting back into the development of the front-end for the Books and Borrowing project. It’s been a long time since I was able to work on this due to commitments to other projects and also due to there being a lot more for me to do than I was expecting regarding processing images and generating associated data in the project’s content management system over the summer. However, I have been able to get back into the development of the front-end this week and managed to make some pretty good progress. The first thing I did was to make some changes to the ‘libraries’ page based on feedback I received ages ago from the project’s Co-I Matt Sangster. The map of libraries used clustering to group libraries that are close together when the map is zoomed out, but Matt didn’t like this. I therefore removed the clusters and turned the library locations back into regular individual markers. However, it is now rather difficult to distinguish the markers for a number of libraries. For example, the markers for Glasgow and the Hunterian libraries (back when the University was still on the High Street) are on top of each other and you have to zoom in a very long way before you can even tell there are two markers there.
I also updated the tabular view of libraries. Previously the library name was a button that when clicked on opened the library’s page. Now the name is text and there are two buttons underneath. The first one opens the library page while the second pans and zooms the map to the selected library, whilst also scrolling the page to the top of the map. This uses Leaflet’s ‘flyTo’ function which works pretty well, although the map tiles don’t quite load in fast enough for the automatic ‘zoom out, pan and zoom in’ to proceed as smoothly as it ought to.
After that I moved onto the library page, which previously just displayed the map and the library name. I updated the tabs for the various sections to display the number of registers, books and borrowers that are associated with the library. The Introduction page also now features the information recorded about the library that has been entered into the CMS. This includes location information, dates, links to the library etc. Beneath the summary info there is the map, and beneath this is a bar chart showing the number of borrowings per year at the library. Beneath the bar chart you can find the longer textual fields about the library such as descriptions and sources. Here’s a screenshot of the page for St Andrews:
I also worked on the ‘Registers’ tab, which now displays a tabular list of the selected library’s registers, and I also ensured that when you select one of the tabs other than ‘Introduction’ the page automatically scrolls down to the top of the tabs to avoid the need to manually scroll past the header image (but we still may make this narrower eventually). The tabular list of registers can be ordered by any of the columns and includes data on the number of pages, borrowers, books and borrowing records featured in each.
When you open a register the information about it is displayed (e.g. descriptions, dates, stats about the number of books etc referenced in the register) and large thumbnails of each page together with page numbers and the number of records on each page are displayed. The thumbnails are rather large and I could make them smaller, but doing so would mean that all the pages end up looking the same – beige rectangles. The thumbnails are generated on the fly by the IIIF server and the first time a register is loaded it can take a while for the thumbnails to load in. However, generated thumbnails are then cached on the server so subsequent page loads are a lot quicker. Here’s a screenshot of a register page for St Andrews:
One thing I also did was write a script to add in a new ‘pageorder’ field to the ‘page’ database table. I then wrote a script that generated the page order for every page in every register in the system. This picks out the page that has no preceding page and iterates through pages based on the ‘next page’ ID. Previously pages in lists were ordered by their auto-incrementing ID, but this meant that if new pages needed to be inserted for a register they ended up stuck at the end of the list, even though the ‘next’ and ‘previous’ links worked successfully. This new ‘pageorder’ field ensures lists of pages are displayed in the proper order. I’ve updated the CMS to ensure this new field is used when viewing a register, although I haven’t as of yet updated the CMS to regenerate the ‘pageorder’ for a register if new pages are added out of sequence. For now if this happens I’ll need to manually run my script again to update things.
Anyway, back to the front-end: The new ‘pageorder’ is used in the list of pages mentioned above so the thumbnails get displaying in the correct order. I may add pagination to this page, as all of the thumbnails are currently on one page and it can take a while to load, although these days people seem to prefer having long pages rather than having data split over multiple pages.
The final section I worked on was the page for viewing an actual page of the register, and this is still very much in progress. You can open a register page by pressing on its thumbnail and currently you can navigate through the register using the ‘next’ and ‘previous’ buttons or return to the list of pages. I still need to add in a ‘jump to page’ feature here too. As discussed in the requirements document, there will be three views of the page: Text, Image and Text and Image side-by-side. Currently I have implemented the image view only. Pressing on the ‘Image view’ tab opens a zoomable / pannable interface through which the image of the register page can be viewed. You can also make this interface full screen by pressing on the button in the top right. Also, if you’re viewing the image and you use the ‘next’ and ‘previous’ navigation links you will stay on the ‘image’ tab when other pages load. Here’s a screenshot of the ‘image view’ of the page:
Also this week I wrote a three-page requirements document for the redevelopment of the front-ends for the various place-names projects I’ve created using the system originally developed for the Berwickshire place-names project which launched back in 2018. The requirements document proposes some major changes to the front-end, moving to an interface that operates almost entirely within the map and enabling users to search and browse all data from within the map view rather than having to navigate to other pages. I sent the document off to Thomas Clancy, for whom I’m currently developing the systems for two place-names projects (Ayr and Iona) and I’ll just need to wait to hear back from him before I take things further.
I also responded to a query from Marc Alexander about the number of categories in the Thesaurus of Old English, investigated a couple of server issues that were affecting the Glasgow Medical Humanities site, removed all existing place-name elements from the Iona place-names CMS so that the team can start afresh and responded to a query from Eleanor Lawson about the filenames of video files on the Seeing Speech site. I also made some further tweaks to the Speak For Yersel resource ahead of its launch next week. This included adding survey numbers to the survey page and updating the navigation links and writing a script that purges a user and all related data from the system. I ran this to remove all of my test data from the system. If we do need to delete a user in future (either because their data is clearly spam or a malicious attempt to skew the results, or because a user has asked us to remove their data) I can run this script again. I also ran through every single activity on the site to check everything was working correctly. The only thing I noticed is that I hadn’t updated the script to remove the flags for completed surveys when a user logs out, meaning after logging out and creating a new user the ticks for completed surveys were still displaying. I fixed this.
I also fixed a few issues with the Burns mini-site about Kozeluch, including updating the table sort options which had stopped working correctly when I added a new column to the table last week and fixing some typos with the introductory text. I also had a chat with the editor of the Anglo-Norman Dictionary about future developments and responded to a query from Ann Ferguson about the DSL bibliographies. Next week I will continue with the B&B developments.