I continued to work on the Books and Borrowing project for a lot of this week, completing some of the tasks I began last week and working on some others. We ran out of server space for digitised page images last week, and although I freed up some space by deleting a bunch of images that were no longer required we still have a lot of images to come. The team estimates that a further 11,575 images will be required. If the images we receive for these pages are comparable to the ones from the NLS, which average around 1.5Mb each, then 30Gb should give us plenty of space. However, after checking through the images we’ve received from other digitisation units it turns out that the NLS images are a vit of an outlier in term of file size and generally 8-10Mb is more usual. If we use this as an estimate then we would maybe require 120Gb-130Gb of additional space. I did some experiments with resizing and changing the image quality of one of the larger images, managing to bring an 8.4Mb image down to 2.4Mb while still retaining its legibility. If we apply this approach to the tens of thousands of larger images we have then this would result in a considerable saving of storage. However, Stirling’s IT people very kindly offered to give us a further 150Gb of space for the images so this resampling process shouldn’t be needed for now at least.
Another task for the project this week was to write a script to renumber the folio numbers for the 14 volumes from the Advocates Library that I noticed had irregular numbering. Each of the 14 volumes had different issues with their handwritten numbering, so I had to tailor my script to each volume in turn, and once the process was complete the folio numbers used to identify page images in the CMS (and eventually in the front-end) entirely matched the handwritten numbers for each volume.
My next task for the project was to import the records for several volumes from the Royal High School of Edinburgh but I ran into a bit of an issue. I had previously been intending to extract the ‘item’ column and create a book holding record and a single book item record for each distinct entry in the column. This would then be associated with all borrowing records in RHS that also feature this exact ‘item’. However, this is going to result in a lot of duplicate holding records due to the contents of the ‘item’ column including information about different volumes of a book and/or sometimes using different spellings.
For example, in SL137142 the book ‘Banier’s Mythology’ appears four times as follows (assuming ‘Banier’ and ‘Bannier’ are the same):
- Banier’s Mythology v. 1, 2
- Banier’s Mythology v. 1, 2
- Bannier’s Myth 4 vols
- Bannier’s Myth. Vol 3 & 4
My script would create one holding and item record for ‘Banier’s Mythology v. 1, 2’ and associate it with the first two borrowing records but the 3rd and 4th items above would end up generating two additional holding / item records which would then be associated with the 3rd and 4th borrowing records.
No script I can write (at least not without a huge amount of work) would be able to figure out that all four of these books are actually the same, or that there are actually 4 volumes for the one book, each requiring its own book item record, and that volumes 1 & 2 need to be associated with borrowing records 1&2 while all 4 volumes need to be associated with borrowing record 3 and volumes 3&4 need to be associated with borrowing record 4. I did wonder whether I might be able to automatically extract volume data from the ‘item’ column but there is just too much variation.
We’re going to have to tackle the normalisation of book holding names and the generation of all required book items for volumes at some point and this either needs to be done prior to ingest via the spreadsheets or after ingest via the CMS.
My feeling is that it might be simpler to do it via the spreadsheets before I import the data. If we were to do this then the ‘Item’ column would become the ‘original title’ and we’d need two further columns, one for the ‘standardised title’ and one listing the volumes, consisting of a number of each volume separated with a comma. With the above examples we would end up with the following (with a | representing a column division):
- Banier’s Mythology v. 1, 2 | Banier’s Mythology | 1,2
- Banier’s Mythology v. 1, 2 | Banier’s Mythology | 1,2
- Bannier’s Myth 4 vols | Banier’s Mythology | 1,2,3,4
- Bannier’s Myth. Vol 3 & 4 | Banier’s Mythology | 3,4
If each sheet of the spreadsheet is ordered alphabetically by the ‘item’ column it might not take too long to add in this information. The additional fields could also be omitted where the ‘item’ column has no volumes or different spellings. E.g. ‘Hederici Lexicon’ may be fine as it is. If the ‘standardised title’ and ‘volumes’ columns are left blank in this case then when my script reaches the record it will know to use ‘Hederici Lexicon’ as both original and standardised titles and to generate one single unnumbered book item record for it. We agreed that normalising the data prior to ingest would be the best approach and I will therefore wait until I receive updated data before I proceed further with this.
Also this week I generated a new version of a spreadsheet containing the records for one register for Gerry McKeever, who wanted borrowers, book items and book holding details to be included in addition to the main borrowing record. I also made a pretty major update to the CMS to enable books and borrower listings for a library to be filtered by year of borrowing in addition to filtering by register. Users can either limit the data by register or year (not both). They need to ensure the register drop-down is empty for the year filter to work, otherwise the selected register will be used as the filter. On either the ‘books’ or ‘borrowers’ tab in the year box they can add either a single year (e.g. 1774) or a range (e.g. 1770-1779). Then when ‘Go’ is pressed the data displayed is limited to the year or years entered. This also includes the figures in the ‘borrowing records’ and ‘Total borrowed items’ columns. Also, the borrowing records listed when a related pop-up is opened will only feature those in the selected years.
I also worked with Raymond in Arts IT Support and Geert, the editor of the Anglo-Norman Dictionary to complete the process of migrating the AND website to the new server. The website (https://anglo-norman.net/) is now hosted on the new server and is considerably faster than it was previously. We also took the opportunity the launch the Anglo-Norman Textbase, which I had developed extensively a few months ago. Searching and browsing can be found here: https://anglo-norman.net/textbase/ and this marks the final major item in my overhaul of the AND resource.
My last major task of the week was to start work on a database of ultrasound video files for the Speech Star project. I received a spreadsheet of metadata and the video files from Eleanor this week and began processing everything. I wrote a script to export the metadata into a three-table related database (speakers, prompts and individual videos of speakers saying the prompts) and began work on the front-end through which this database and the associated video files will be accessed. I’ll be continuing with this next week.
In addition to the above I also gave some advice to the students who are migrating the IJOSTS journal over the WordPress, had a chat with the DSL people about when we’ll make the switch to the new API and data, set up a WordPress site for Joanna Kopaczyk for the International Conference on Middle English, upgraded all of the WordPress sites I manage to the latest version of WordPress, made a few tweaks to the 17th Century Symposium website for Roslyn Potter, spoke to Kate Simpson in Information Studies about speaking to her Digital Humanities students about what I do and arranged server space to be set up for the Speak For Yersel project website and the Speech Star project website. I also helped launch the new Burns website: https://burnsc21-letters-poems.glasgow.ac.uk/ and updated the existing Burns website to link into it via new top-level tabs. So a pretty busy week!
This was my first week back after the Christmas holidays, and it was a three-day week. I spent the days almost exclusively on the Books and Borrowing project. We had received a further batch of images for 23 library registers from the NLS, which I needed to download from the NLS’s server and process. This involved renaming many thousands of images via a little script I’d written in order to give the images more meaningful filenames and stripping out several thousand images of blank pages that had been included but are not needed by the project. I then needed to upload the images to the project’s web server and then generate all of the necessary register and page records in the CMS for each page image.
I also needed up update the way folio numbers were generated for the registers. For the previous batch of images from the NLS I had just assigned the numerical part of the image’s filename as the folio number, but it turns out that most of the images have a hand-written page number in the top-right which starts at 1 for the first actual page of borrowing records. There are usually a few pages before this, and these need to be given Roman numerals as folio numbers. I therefore had to write another script that would take into consideration the number of front-matter pages in each register, assign Roman numerals as folio numbers to them and then begin the numbering of borrowing record pages from 1 after that, incrementing through the rest of the volume.
I guess it was inevitable with data of this sort, but I ran into some difficulties whilst processing it. Firstly, there were some problems with the Jpeg images the NLS had sent for two of the volumes. These didn’t match the Tiff images for the volumes, with each volume having an incorrect number of files. Thankfully the NLS were able to quickly figure out what had gone wrong and were able to supply updated images.
The next issue to crop up occurred when I began to upload the images to the server. After uploading about 5Gb of images the upload terminated, and soon after that I received emails from the project team saying they were unable to log into the CMS. It turns out that the server had run out of storage. Each time someone logs into the CMS the server needs a tiny amount of space to store a session variable, but there wasn’t enough space to store this, meaning it was impossible to log in successfully. I emailed the IT people at Stirling (Where the project server is located) to enquire about getting some further space allocated but I haven’t heard anything back yet. In the meantime I deleted the images from the partially uploaded volume which freed up enough space to enable the CMS to function again. I also figured out a way to free up some further space: The first batch of images from the NLS also included images of blank pages across 13 volumes – several thousand images. It was only after uploading these and generating page records that we had decided to remove the blank pages, but I only removed the CMS records for these pages – the image files were still stored on the server. I therefore wrote another script to identify and delete all of the blank page images from the first batch that was uploaded, which freed up 4-5Gb of space from the server, which was enough to complete the upload of the second batch of registers from the NLS. We will still need more space, though, as there are still many thousands of images left to add.
I also took the opportunity to update the folio numbers of the first batch of NLS registers to bring them into line with the updated method we’d decided on for the second batch (Roman numerals for front-matter and then incrementing page numbers from the first page of borrowing records). I wrote a script to renumber all of the required volumes, which was mostly a success.
However, I also noticed that the automatically generated folio numbers often became out of step with the hand-written folio numbers found in the top-right corner of the images. I decided to go through each of the volumes to identify all that became unaligned and to pinpoint on exactly which page or pages the misalignment occurred. This took some time as there were 32 volumes that needed checked, and each time an issue was spotted I needed to look back through the pages and associated images from the last page until I found the point where the page numbers correctly aligned. I discovered that there were numbering issues with 14 of the 32 volumes, mainly due to whoever wrote the numbers in getting muddled. There are occasions where a number is missed, or a number is repeated. In once volume the page numbers advance by 100 from one page to the next. It should be possible for me to write a script that will update the folio numbers to bring them into alignment with the erroneous handwritten numbers (for example where a number is repeated these will be given ‘a’ and ‘b’ suffixes). I didn’t have time to write the script this week but will do so next week.
Also for the project this week I looked through the spreadsheet of borrowing records from the Royal High School of Edinburgh that one of the RAs has been preparing. I had a couple of questions about the spreadsheet, and I’m hoping to be able to process it next week. I also exported the records from one register for Gerry McKeever to work on, as these records now need to be split across two volumes rather than one.
Also this week I had an email conversation with Marc Alexander about a few issues, during which he noted that the Historical Thesaurus website was offline. Further investigation revealed that the entire server was offline, meaning several other websites were down too. I asked Arts IT Support to look into this, which took a little time as it was a physical issue with the hardware and they were all still working remotely. However, the following day they were able to investigate and address the issue, which they reckon was caused by a faulty network port.