I had a very busy week this week, working on several different projects. For the Books and Borrowing project I participated in the team Zoom call on Monday to discuss the upcoming development of the front-end and API for the project, which will include many different search and browse facilities, graphs and visualisations. I followed this up with a lengthy email to the PI and Co-I where I listed some previous work I’ve done and discussed some visualisation libraries we could use. In the coming weeks I’ll need to work with them to write a requirements document for the front-end. I also downloaded images from Orkney library, uploaded all of them to the server and generated the necessary register and page records. One register with 7 pages already existed in the system and I ensured that page images were associated with these and the remaining pages of the register fit in with the existing ones. I also processed the Wigtown data that Gerry McKeever had been working on, splitting the data associated with one register into two distinct registers, uploading page images and generating the necessary page records. This was a pretty complicated process, and I still need to complete the work on it next week, as there are several borrowing records listed as separate rows when in actual fact they are merely another volume of the same book borrowed at the same time. These records will need to be amalgamated.
For the Speak For Yersel project I had a meeting with the PI and RA on Monday to discuss updates to the interface I’ve been working on, new data for the ‘click’ exercise and a new type of exercise that will precede the ‘click’ exercise and will involve users listening to sound clips then dragging and dropping them onto areas of a map to see whether they can guess where the speaker is from. I spent some time later in the week making all of the required changes to the interface and the grammar exercise, including updating the style used for the interactive map and using different marker colours.
I also continued to work on the speech database for the Speech Star project based on feedback I received about the first version I completed last week. I added in some new introductory text and changed the order of the filter options. I also made the filter option section hidden by default as it takes up quite a lot of space, especially on narrow screens. There’s now a button to show / hide the filters, with the section sliding down or up. If a filter option is selected the section remains visible by default. I also changed the colour of the filter option section to a grey with a subtle gradient (it gets lighter towards the right) and added a similar gradient to the header, just to see how it looks.
The biggest update was to the filter options, which I overhauled so that instead of a drop-down list where one option in each filter type can be selected there are checkboxes for each filter option, allowing multiple items of any type to be selected. This was a fairly large change to implement as the way selected options are passed to the script and the way the database is queried needed to be completely changed. When an option is selected the page immediately reloads to display the results of the selection and this can also change the contents of the other filter option boxes – e.g. selecting ‘alveolar’ limits the options in the ‘sound’ section. I also removed the ‘All’ option and left all checkboxes unselected by default. This is how filters on clothes shopping sites do it – ‘all’ is the default and a limit is only applied if an option is ticked.
I also changed the ‘accent’ labels as requested, changed the ‘By Prompt’ header to ‘By Word’ and updated the order of items in the ‘position’ filter. I also fixed an issue where ‘cheap’ and ‘choose’ were appearing in a column instead of the real data. Finally, I made the overlay that appears when a video is clicked on darker so it’s more obvious that you can’t click on the buttons. I did investigate whether it was possible to have the popup open while other page elements were still accessible but this is not something that the Bootstrap interface framework that I’m using supports, at least not without a lot of hacking about with its source code. I don’t think it’s worth pursuing this as the popup will cover much of the screen on tablets / phones anyway, and when I add in the option to view multiple videos the popup will be even larger.
Also this week I made some minor tweaks to the Burns mini-project I was working on last week and had a chat with the DSL people about a few items, such as the data import process that we will be going through again in the next month or so and some of the outstanding tasks that I still need to tackle with the DSL’s interface.
I also did some work for the AND this week, investigating a weird timeout error that cropped up on the new server and discussing how best to tackle a major update to the AND’s data. The team have finished working on a major overhaul of the letter S and this is now ready to go live. We have decided that I will ask for a test instance of the AND to be set up so I can work with the new data, testing out how the DMS runs on the new server and how it will cope with such a large update.
The editor, Geert, had also spotted an issue with the textbase search, which didn’t seem to include one of the texts (Fabliaux) he was searching for. I investigated the issue and it looked like the script that extracted words from pages may have silently failed in some cases. There are 12,633 page records in the textbase, each of which has a word count. When the word count is greater than zero my script processes the contents of the page to generate the data for searching. However, there appear to be 1889 pages in the system that have a word count of zero, including all of Fabliaux. Further investigation revealed that my scripts expect the XML to be structured with the main content in a <body> tag. This cuts out all of the front matter and back matter from the searches, which is what we’d agreed should happen and thankfully accounts for many of the supposedly ‘blank’ pages listed above as they’re not the actual body of the text.
However, Fabliaux doesn’t include the <body> tag in the standard way. In fact, the XML file consists of multiple individual texts, each of which has a separate <body> tag. As my script didn’t find a <body> in the expected place no content was processed. I ran a script to check the other texts and the following also have a similar issue: gaunt1372 (710 pages) and polsongs (111 pages), in addition to the 37 pages of Fabliaux. Having identified these I could update my script that generates search words and re-ran it for these texts, fixing the issue.
Also this week I attended a Zoom-based seminar on ‘Digitally Exhibiting Textual Heritage’ that was being run by Information Studies. This featured four speakers from archives, libraries and museums discussing how digital versions of texts can be exhibited, both in galleries and online. Some really interesting projects were discussed, both past and present. This included the BL’s ‘Turning the Pages’ system (http://www.bl.uk/turning-the-pages/) , some really cool transparent LCD display cases (https://crystal-display.com/transparent-displays-and-showcases/) that allow images to be projected on clear glass while objects behind the panel are still visible. 3d representations of gallery spaces were discussed (e.g. https://www.lib.cam.ac.uk/ghostwords), as were ‘long form narrative scrolls’ such as https://www.nytimes.com/projects/2012/snow-fall/index.html#/?part=tunnel-creek, http://www.wolseymanuscripts.ac.uk/ and https://stories.durham.ac.uk/journeys-prologue/. There is a tool that can be used to create these here: https://shorthand.com/. It was a very interesting session!
I divided my time mostly between three projects this week: Speech Star, Speak For Yersel and a Burns mini-project for Kirsteen McCue. For Speech Star I set up the project’s website, based on our mockup number 9 (which still needs work) and completed an initial version of the speech database. As with the Dynamic Dialects accent chart (https://www.dynamicdialects.ac.uk/accent-chart/) , there are limiting options and any combination of these can be selected. The page refreshes after each selection is made and the contents of the other drop-down lists vary depending on the option that is selected. As requested, there are 6 limiting options (accent, sex, age range, sound, articulation and position).
I created two ‘views’ of the data that are available in different tabs on the page. The first is ‘By Accent’ which lists all data by region. Within each region there is a table for each speaker with columns for the word that’s spoken and its corresponding sound, articulation and position. Users can press on a column heading to order the table by that column. Pressing again reverses the order. Note that this only affects the current table and not those of other speakers. Users can also press on the button in the ‘Word’ column to open a popup containing the video, which automatically plays. Pressing any part of the browser window outside of the popup closes the popup and stops the video, as does pressing on the ‘X’ icon in the top-right of the popup.
The ‘By Prompt’ tab presents exactly the same data, but arranged by the word that’s spoken rather than by accent. This allows you to quickly access the videos for all speakers if you’re interested in hearing a particular sound. Note that the limit options apply equally to both tabs and are ‘remembered’ if you switch from one tab to the other.
The main reason I created the two-tab layout is to give users the bi-directional access to video clips that the Dynamic Dialects Accent Chart offers without ending up with a table that is far too long for most screens, especially mobile screens. One thing I haven’t included yet is the option to view multiple video clips side by side. I remember this was discussed as a possibility some time ago but I need to discuss this further with the rest of the team to understand how they would like it to function. Below is a screenshot of the database, but note that the interface is still just a mockup and all elements such as the logo, fonts and colours will likely change before the site launches:
For the Speak For Yersel project I also created an initial project website using our latest mockup template and I migrated both sample activities over to the new site. At the moment the ‘Home’ and ‘About’ pages just have some sample blocks of text I’ve taken from SCOSYA. The ‘Activities’ page provides links to the ‘grammar’ and ‘click’ exercises which mostly work in the same way as in the old mockups with a couple of differences that took some time to implement.
Firstly, the ‘grammar’ exercise now features actual interactive maps throughout the various stages. These are the sample maps I created previously that feature large numbers of randomly positioned markers and local authority boundaries. I also added a ‘fullscreen’ option to the bottom-right of each map (the same as SCOSYA) to give people the option of viewing a larger version of the map. Here’s an example of how the grammar exercise now looks:
Also this week I gave some further advice to the students who are migrating the IJOSTS journal. Fixed an issue with some data in the Old English Thesaurus for Jane Roberts and responded to an enquiry about the English Language Twitter account.
I continued to work on the Books and Borrowing project for a lot of this week, completing some of the tasks I began last week and working on some others. We ran out of server space for digitised page images last week, and although I freed up some space by deleting a bunch of images that were no longer required we still have a lot of images to come. The team estimates that a further 11,575 images will be required. If the images we receive for these pages are comparable to the ones from the NLS, which average around 1.5Mb each, then 30Gb should give us plenty of space. However, after checking through the images we’ve received from other digitisation units it turns out that the NLS images are a vit of an outlier in term of file size and generally 8-10Mb is more usual. If we use this as an estimate then we would maybe require 120Gb-130Gb of additional space. I did some experiments with resizing and changing the image quality of one of the larger images, managing to bring an 8.4Mb image down to 2.4Mb while still retaining its legibility. If we apply this approach to the tens of thousands of larger images we have then this would result in a considerable saving of storage. However, Stirling’s IT people very kindly offered to give us a further 150Gb of space for the images so this resampling process shouldn’t be needed for now at least.
Another task for the project this week was to write a script to renumber the folio numbers for the 14 volumes from the Advocates Library that I noticed had irregular numbering. Each of the 14 volumes had different issues with their handwritten numbering, so I had to tailor my script to each volume in turn, and once the process was complete the folio numbers used to identify page images in the CMS (and eventually in the front-end) entirely matched the handwritten numbers for each volume.
My next task for the project was to import the records for several volumes from the Royal High School of Edinburgh but I ran into a bit of an issue. I had previously been intending to extract the ‘item’ column and create a book holding record and a single book item record for each distinct entry in the column. This would then be associated with all borrowing records in RHS that also feature this exact ‘item’. However, this is going to result in a lot of duplicate holding records due to the contents of the ‘item’ column including information about different volumes of a book and/or sometimes using different spellings.
For example, in SL137142 the book ‘Banier’s Mythology’ appears four times as follows (assuming ‘Banier’ and ‘Bannier’ are the same):
- Banier’s Mythology v. 1, 2
- Banier’s Mythology v. 1, 2
- Bannier’s Myth 4 vols
- Bannier’s Myth. Vol 3 & 4
My script would create one holding and item record for ‘Banier’s Mythology v. 1, 2’ and associate it with the first two borrowing records but the 3rd and 4th items above would end up generating two additional holding / item records which would then be associated with the 3rd and 4th borrowing records.
No script I can write (at least not without a huge amount of work) would be able to figure out that all four of these books are actually the same, or that there are actually 4 volumes for the one book, each requiring its own book item record, and that volumes 1 & 2 need to be associated with borrowing records 1&2 while all 4 volumes need to be associated with borrowing record 3 and volumes 3&4 need to be associated with borrowing record 4. I did wonder whether I might be able to automatically extract volume data from the ‘item’ column but there is just too much variation.
We’re going to have to tackle the normalisation of book holding names and the generation of all required book items for volumes at some point and this either needs to be done prior to ingest via the spreadsheets or after ingest via the CMS.
My feeling is that it might be simpler to do it via the spreadsheets before I import the data. If we were to do this then the ‘Item’ column would become the ‘original title’ and we’d need two further columns, one for the ‘standardised title’ and one listing the volumes, consisting of a number of each volume separated with a comma. With the above examples we would end up with the following (with a | representing a column division):
- Banier’s Mythology v. 1, 2 | Banier’s Mythology | 1,2
- Banier’s Mythology v. 1, 2 | Banier’s Mythology | 1,2
- Bannier’s Myth 4 vols | Banier’s Mythology | 1,2,3,4
- Bannier’s Myth. Vol 3 & 4 | Banier’s Mythology | 3,4
If each sheet of the spreadsheet is ordered alphabetically by the ‘item’ column it might not take too long to add in this information. The additional fields could also be omitted where the ‘item’ column has no volumes or different spellings. E.g. ‘Hederici Lexicon’ may be fine as it is. If the ‘standardised title’ and ‘volumes’ columns are left blank in this case then when my script reaches the record it will know to use ‘Hederici Lexicon’ as both original and standardised titles and to generate one single unnumbered book item record for it. We agreed that normalising the data prior to ingest would be the best approach and I will therefore wait until I receive updated data before I proceed further with this.
Also this week I generated a new version of a spreadsheet containing the records for one register for Gerry McKeever, who wanted borrowers, book items and book holding details to be included in addition to the main borrowing record. I also made a pretty major update to the CMS to enable books and borrower listings for a library to be filtered by year of borrowing in addition to filtering by register. Users can either limit the data by register or year (not both). They need to ensure the register drop-down is empty for the year filter to work, otherwise the selected register will be used as the filter. On either the ‘books’ or ‘borrowers’ tab in the year box they can add either a single year (e.g. 1774) or a range (e.g. 1770-1779). Then when ‘Go’ is pressed the data displayed is limited to the year or years entered. This also includes the figures in the ‘borrowing records’ and ‘Total borrowed items’ columns. Also, the borrowing records listed when a related pop-up is opened will only feature those in the selected years.
I also worked with Raymond in Arts IT Support and Geert, the editor of the Anglo-Norman Dictionary to complete the process of migrating the AND website to the new server. The website (https://anglo-norman.net/) is now hosted on the new server and is considerably faster than it was previously. We also took the opportunity the launch the Anglo-Norman Textbase, which I had developed extensively a few months ago. Searching and browsing can be found here: https://anglo-norman.net/textbase/ and this marks the final major item in my overhaul of the AND resource.
My last major task of the week was to start work on a database of ultrasound video files for the Speech Star project. I received a spreadsheet of metadata and the video files from Eleanor this week and began processing everything. I wrote a script to export the metadata into a three-table related database (speakers, prompts and individual videos of speakers saying the prompts) and began work on the front-end through which this database and the associated video files will be accessed. I’ll be continuing with this next week.
In addition to the above I also gave some advice to the students who are migrating the IJOSTS journal over the WordPress, had a chat with the DSL people about when we’ll make the switch to the new API and data, set up a WordPress site for Joanna Kopaczyk for the International Conference on Middle English, upgraded all of the WordPress sites I manage to the latest version of WordPress, made a few tweaks to the 17th Century Symposium website for Roslyn Potter, spoke to Kate Simpson in Information Studies about speaking to her Digital Humanities students about what I do and arranged server space to be set up for the Speak For Yersel project website and the Speech Star project website. I also helped launch the new Burns website: https://burnsc21-letters-poems.glasgow.ac.uk/ and updated the existing Burns website to link into it via new top-level tabs. So a pretty busy week!
This was my first week back after the Christmas holidays, and it was a three-day week. I spent the days almost exclusively on the Books and Borrowing project. We had received a further batch of images for 23 library registers from the NLS, which I needed to download from the NLS’s server and process. This involved renaming many thousands of images via a little script I’d written in order to give the images more meaningful filenames and stripping out several thousand images of blank pages that had been included but are not needed by the project. I then needed to upload the images to the project’s web server and then generate all of the necessary register and page records in the CMS for each page image.
I also needed up update the way folio numbers were generated for the registers. For the previous batch of images from the NLS I had just assigned the numerical part of the image’s filename as the folio number, but it turns out that most of the images have a hand-written page number in the top-right which starts at 1 for the first actual page of borrowing records. There are usually a few pages before this, and these need to be given Roman numerals as folio numbers. I therefore had to write another script that would take into consideration the number of front-matter pages in each register, assign Roman numerals as folio numbers to them and then begin the numbering of borrowing record pages from 1 after that, incrementing through the rest of the volume.
I guess it was inevitable with data of this sort, but I ran into some difficulties whilst processing it. Firstly, there were some problems with the Jpeg images the NLS had sent for two of the volumes. These didn’t match the Tiff images for the volumes, with each volume having an incorrect number of files. Thankfully the NLS were able to quickly figure out what had gone wrong and were able to supply updated images.
The next issue to crop up occurred when I began to upload the images to the server. After uploading about 5Gb of images the upload terminated, and soon after that I received emails from the project team saying they were unable to log into the CMS. It turns out that the server had run out of storage. Each time someone logs into the CMS the server needs a tiny amount of space to store a session variable, but there wasn’t enough space to store this, meaning it was impossible to log in successfully. I emailed the IT people at Stirling (Where the project server is located) to enquire about getting some further space allocated but I haven’t heard anything back yet. In the meantime I deleted the images from the partially uploaded volume which freed up enough space to enable the CMS to function again. I also figured out a way to free up some further space: The first batch of images from the NLS also included images of blank pages across 13 volumes – several thousand images. It was only after uploading these and generating page records that we had decided to remove the blank pages, but I only removed the CMS records for these pages – the image files were still stored on the server. I therefore wrote another script to identify and delete all of the blank page images from the first batch that was uploaded, which freed up 4-5Gb of space from the server, which was enough to complete the upload of the second batch of registers from the NLS. We will still need more space, though, as there are still many thousands of images left to add.
I also took the opportunity to update the folio numbers of the first batch of NLS registers to bring them into line with the updated method we’d decided on for the second batch (Roman numerals for front-matter and then incrementing page numbers from the first page of borrowing records). I wrote a script to renumber all of the required volumes, which was mostly a success.
However, I also noticed that the automatically generated folio numbers often became out of step with the hand-written folio numbers found in the top-right corner of the images. I decided to go through each of the volumes to identify all that became unaligned and to pinpoint on exactly which page or pages the misalignment occurred. This took some time as there were 32 volumes that needed checked, and each time an issue was spotted I needed to look back through the pages and associated images from the last page until I found the point where the page numbers correctly aligned. I discovered that there were numbering issues with 14 of the 32 volumes, mainly due to whoever wrote the numbers in getting muddled. There are occasions where a number is missed, or a number is repeated. In once volume the page numbers advance by 100 from one page to the next. It should be possible for me to write a script that will update the folio numbers to bring them into alignment with the erroneous handwritten numbers (for example where a number is repeated these will be given ‘a’ and ‘b’ suffixes). I didn’t have time to write the script this week but will do so next week.
Also for the project this week I looked through the spreadsheet of borrowing records from the Royal High School of Edinburgh that one of the RAs has been preparing. I had a couple of questions about the spreadsheet, and I’m hoping to be able to process it next week. I also exported the records from one register for Gerry McKeever to work on, as these records now need to be split across two volumes rather than one.
Also this week I had an email conversation with Marc Alexander about a few issues, during which he noted that the Historical Thesaurus website was offline. Further investigation revealed that the entire server was offline, meaning several other websites were down too. I asked Arts IT Support to look into this, which took a little time as it was a physical issue with the hardware and they were all still working remotely. However, the following day they were able to investigate and address the issue, which they reckon was caused by a faulty network port.