Week Beginning 22nd May 2023

I’d taken Thursday and Friday off this week, ahead of next Monday’s public holiday.  I spent some time this week getting the website for Ophira Gamliel’s new project up and running.  This included updating the header image, updating the map and adding in some initial content to all of the pages.  The website is not publicly accessible quite yet, but it’s almost ready to launch.  I also updated the IPA charts on the Seeing Speech website to incorporate the new MRI 2 recordings, the new metadata and the description, which are now available at https://www.seeingspeech.ac.uk/ipa-charts/?chart=1&datatype=4&speaker=1 and updated the MRI 1 and animation metadata on both Seeing Speech and STAR.  I also helped with the migration of the Edinburgh Gazetteer website, which needed to be moved back to internal hosting due to its size and performed a few other tasks relating to the migration of sites.

For the DSL I updated the ‘back’ button that returns you to a dictionary entry from a bibliography entry.  Previously pressing on this took you to the top of the entry page but the editors wanted it to load the entry page at the point the user was at when they clicked the bib link, which is what the browser’s ‘back’ button does.  One reason this was rather complicated is that if someone opens the link in a new tab the browser’s ‘back’ button is disabled as there is no previous page in the new tab, but we still want people to potentially load up the entry in the new tab.  I’ve found a way to ensure this works, but it took a bit of time to get right.

The editors had also noticed a strange situation whereby a search for ‘gb’ was finding lots of occurrences of ‘gib’.  This was a strange one and also took some investigation.  It turns out that Solr allows you to set up synonyms, so that when (for example) someone searches for ‘TV’ they find occurrences of ‘television’.  What I didn’t know is that Solr comes with a default synonyms file for test purposes and this includes:

GB,gib,gigabyte,gigabytes

MB,mib,megabyte,megabytes

Television, Televisions, TV, TVs

And this explains why a search for ‘gb’ is finding ‘gib’ while ‘fb’ is not finding ‘fib’.  Thankfully we are in the middle of migrating the DSL’s Solr search platform and I was able to request that this synonyms file gets wiped in the new location.  Setting up the new Solr instance and re-ingesting the DSL data was pretty straightforward and currently the new instance is connected to a test version of the DSL website.  The DSL team have been testing it out and have noticed a couple of issues that I’ll need to investigate next week before we finalise the migration and turn the old Solr server off.  Also during this process we realised that the issue with Apache handling space characters that I discussed in my 8th of May post had also started to affect the DSL website, breaking all searches for terms that featured spaces.  Thankfully I knew how to fix this due to my investigations a couple of weeks ago.

I spent the rest of the week working for the Books and Borrowing project.  The PI Katie had suggested that we add author gender as a new field and I engaged in an email discussion about how best to implement this.  It will take some time as although it’s simple enough to add the extra field to the database, the CMS, the API, the Solr data export, the front-end searches, browses and display of data will all need to be updated to incorporate the new option.  I generated a list of the current authors in the system as a spreadsheet and sent this to Katie, who is going to fill in a new ‘gender’ field and send it back to me to work with, probably next week.

One of the major issues I tackled this week was what to do about the numbers of times a book is borrowed (which appears in several places throughout the front-end) not matching the number of associated borrowing records (as found in the search results).  The issue is caused by a book holding possibly consisting of multiple volumes.  One borrowing record may involve any number of volumes, so although in the search results we have one borrowing record, in the counts of borrowed books the total number for a book holding may be very different.

I decided to updated the system so that the number of borrowing records in addition to the number of times borrowed is stored in the cache and displayed on the site.  This meant updating the database and writing a new cache generation script, then updating the API to bring back the new data and the front-end to display it.  So for example the the Co-I Matt noticed a couple of weeks ago that “the Buffon [book] is listed like this: ‘Number of borrowings: 159’.  However, if you click on the holding title, the resulting search has 108 results”.  After the update the number of borrowings for the holding record now states: “Volumes of this book were borrowed 159 times in 108 borrowing records”.

I’ve also updated book edition counts, for example: “Volumes associated with this edition were borrowed 316 times in 233 borrowing records”.  The ‘top 100’ list also now uses the number of borrowing records to order things, although this generally doesn’t change things too drastically.  It should be noted that discrepancies between number of times borrowed and number of associated borrowing records only happen when a book has multiple volumes.  Therefore if a book doesn’t have multiple volumes then the text just states the number of borrowings.

I have updated the book edition page to display the new borrowing record information too, and I’ve also updated the ‘top ten lists’ in the ‘facts’ page to use the borrowing record figures.  In addition I have fixed the issue with the numbers in the ‘top ten authors’ lists being wrong. Figures relating to borrowers rather than books should have been correct anyway so I haven’t needed to update these. Hopefully this will considerably cut down on the potential for confusion.

Also this week I wrote a little script to add in the normalised borrower occupation for borrowers from the Royal High School and returned to the visualisations on the library ‘facts and figures’ page.  I’ve implemented a further donut chart for borrower occupations, this time showing the number of borrowings per occupation.  I’ve also updated the text above the chart to show more information about how the figures are derived.  For example for Chambers it now says:

“Of the 311 library borrowers 101 have one or more identified occupations. The total number of occupations associated with borrowers at this library is 114 and these are represented in the following chart”.

Hopefully this makes it clearer that the numbers represent occupations and these are not necessarily present for all borrowers and some borrowers may have more than one occupation.

The chart of borrowings per occupation makes for an interesting comparison with the chart for borrower occupations, for example in Chambers the occupation ‘Wife/Spouse’ (show in green) represents 9.65% of the total occupations in the library but 26.15% of the borrowing records, as you can see from the following screenshot:

Next week I’ll hopefully continue with the visualisations on this page.

 

Week Beginning 15th May 2023

I spent the majority of this week continuing to work for the Books and Borrowing project.  I added in images of both the Hunterian and Inverness Kirk Sessions library register pages, as these had not yet been integrated.  I then began working on the library ‘Facts and Figures’ page and added in the API endpoints and processing of library statistics, which now appear in a ‘summary’ section at the top of the page, as you can see for the Chambers library in the following screenshot:

We may need to include some further explanation of the above.  The number of borrowing records does not match the numbers split by gender, which may be caused by some borrowing records not having an associated borrower or others having more than one associated borrower.  Also as you can see a borrowing record in this library has an erroneous year, making it look like the borrowing records stretch into the distant future of 18292.

Beneath the summary are a series of ‘top ten’ lists, featuring the top ten borrowed books, authors and most prolific borrowers, both overall and further broken down by borrower gender, as you can see below:

These all link through to search results for the item in question.  Note that we once again face the issue of numbers here reflecting volumes borrowed while the search results count borrowing records, which can include any number of volumes.  I’m going to have a serious think about what we can do about this discrepancy as it is bound to cause a lot of confusion.  There is also something very wrong with the author figures that I need to investigate.  I think the issue is authors can be associated with any level of book record and I suspect authors in this list are getting counted multiple times.

There are also issues caused by the data still being worked on while the Solr cache and other cached data become outdated.  I spent quite a while on Tuesday trying to figure out why a book holding wasn’t appearing where it should or with the expected number of borrowings until I realised the data in the CMS had been updated (including the initial letter of the book title) while the cache hadn’t.

I then moved onto working on some visualisations.  I created a borrower occupations donut chart as described in the requirements document.  This chart shows the distribution of borrower occupations across library borrowers in a two-level pie chart with top-level occupations in the middle and these then subdivided into secondary level occupations in the outer ring.  Note that I haven’t further split ’Religion and Clergy’ > ‘Minister/Priest’ into its third level occupations as it is the only category that has three levels and it would have made the visualisation too complicated (both to develop and to comprehend in the available space).  Instead the individual ‘minister / priest’ categories are amalgamated into ‘minister / priest’.

The charts are a nice way to get an overall picture of the demographics of the selected library and also to compare the demographies of different libraries.  For example, Chambers has a very broad spread of occupations:

Whereas Advocates (as you’d expect) is more focussed:

You can hover over each segment to view the title of the occupation and the percentage of borrowers that have the occupation.  You can also click on a segment to perform a search for the occupation at the library in question.  You should bear in mind that a borrower can have multiple (or no) occupations so the number of borrowers and the number of occupations will be different.  The next thing I’ll do is create a further donut chart showing the number of borrowings per occupation, but I’ll leave that to next week.

I also thought some more about the thorny issue of number of borrowings versus number of borrowing records.  What I think I’ll try and do is ensure that the only number that is shown is the number of associated borrowing records.  I’ll need to experiment with how (or if) this might work, which I will leave until next week.

I also devoted some time to setting up the website for Ophira Gamliel’s new AHRC project.  I created several mockups of possible interface designs including a few different header images based on photographs of manuscripts that the project will analyse.  Ophira and Co-I Ines Weinrich picked the version they liked best and I then applied this to the live site.  I also worked on a static map of the area the project will focus on.  I created this by initially creating an interactive map using the lovely Stamen Watercolor basemap and then I used Photoshop to add other labels and to fade out areas the project is not focussing on.  The map still needs some work but here is an initial version:

I spent a bit of time working for the DSL, responding to a couple of queries from Pauline Graham about Google Analytics.  Pauline wondered whether it we could find search terms that people had entered that found no results in GA, but after some research I don’t think it would be possible as the page that GA will log is the same whether there are results or not.  We’d have to update the site structure to maybe load a different page (with the search terms in the URL) when there are no results and then we’d be able to isolate these page hits in GA.  However, we can limit the list of page stats in GA to a specific page and then order by number of views, which may give us some idea of which searches find nothing.  After selecting ‘Pages and screens’ in ‘Engagement’ pressing the blue plus in the results table header opens a pop-up  and selecting ‘Page / Screen’ then ‘Landing page + query string’ allows you to add a new column featuring the page URL.  Then above the table in the search area (with the magnifying glass) enter ‘/results/’ to limit the data to the search results page.  You can update the ‘rows per page’ and press the arrow next to ‘views’ to order by views from least to most.

I also spent some time on the Speech Star project this week, updating the ExtIPA charts to add in ‘sound name’ and ‘context’ to the metadata and updating the website to ensure this new information is displayed alongside the videos.  I also added in new metadata for the IPA charts, but for the moment this is only available for the MRI-2 videos.  I’ll need to add in metadata for the other video types later.

I also spent a bit more time working on the site migration, getting https://thepeoplesvoice.glasgow.ac.uk/ working again, fixing an issue with the place-name images upload for the Iona project and making updates to a number of other sites.

 

Week Beginning 8th May 2023

This was a three-day week for me as Monday was the coronation holiday and Tuesday was (thankfully) my final day of jury duty.  I spent some more time this week working on the migration of websites to external hosting and fixing a number of issues that have arisen during this process.  I also set up a website for Ophira Gamliel’s new project including engaging in an email discussion with the two project Co-Is about the interface and design.  I also made some further updates to the teaching version of the SpeechStar website, which has now launched (https://www.seeingspeech.ac.uk/speechstar/).  There are still some things to do for the website, such as adding in citation options for videos and adding more descriptive metadata to some of the videos, but the majority of content is now available.

I also investigated an issue the editor of the Anglo-Norman Dictionary was encountering with the proof-reader feature of the dictionary management system I’ve created.  This feature allows a ZIP file containing any number of dictionary entry XML files to be uploaded and these are then formatted in the same was a ‘live’ dictionary entries and displayed on one long page that can then be downloaded or copied into Word.  When Geert, the editor, was uploading a ZIP file the page was encountering an error and displaying a blank screen.

After some investigation I discovered this was an issue with the data.  The proof-reader is not an XML validator and expects valid XML files.  When it encounters a malformed XML file the proof-reader breaks.  It turned out there was one problematic XML file which includes a malformed tag:

<lemma>tresaiele/lemma>

This was causing a fatal error in the script, resulting in a blank page.  What I have done now is to update the proof-reader so it can handle problems with an XML file.  In such cases the file is now skipped over and a warning is displayed at the top of the page, such as  “WARNING: Malformed XML file skipped: test/tresaile.xml” while all other files get processed correctly.  This is a much better approach and will be very helpful in future.

I spent the rest of my week working on the Books and Borrowing project.  I sorted out the page numbers for one of the new registers I added to the system last week (the other one didn’t need sorting).  Page numbering was a real mess in this one and it’s taken quite some time to get our page numbers aligned with what is written in the images.

I also spent some time investigating a couple of issues that Matt had reported.  The first one was empty, unlabelled book items sometimes appearing in book holding records.  I think there is a bug in the CMS that can result in a blank book item getting created in certain circumstances.  As we’re sort of nearing the end of the project and debugging the CMS would likely be a long and painful process I pointed out that I would rather run a script to strip out all unlabelled book items that have no associated borrowings as a quick fix instead.  We could then run this script multiple times as required.  I ran a query that identified all unlabelled book items that have no associated borrowings, and there are about 2,400 of them.  I could then run a script to delete all of these items, which would remove the various ‘Volume (0 borrowings)’ from the book holding records.  What it won’t do is fix the unlabelled volumes that do have borrowings (e.g. ‘Volume (1 borrowing)’ items).  However, I could run another script to output a list of all book items that have no label but have at least one borrowing if that might help to fix these.

The second issue Matt pointed out is that the number of borrowing records listed for book holdings in the overall list of book editions does not match up with the number of search results found when you perform a search for the holding.  This is because the counts on the list of books page are the total number of times each book item for each book holding have been borrowed (so in Matt’s example adding up all of the times each book item was borrowed gives 159).  The search results display a count of borrowing records, and a single borrowing record can include multiple book items.  For example one borrowing record may involve three book items but only counts as one borrowing record.  In Matt’s example the 108 search results are borrowing records, not book items.

This is a bit confusing and I wonder what we could do to clarify this.   I did wonder whether the search results could also tally up the number of book items in the results and display this alongside the ‘your search matched x borrowing records’, but unfortunately we only have this data for a subset of results (up to 100 on each page of the results).  Perhaps we need to add some explanatory text somewhere, or perhaps the number of borrowings on the ‘Books’ page should also count the number of associated borrowing records rather than the total number of times each item was borrowed.  Or I could update the Solr index to cache the total number of book items for each borrowing record.

I also updated the overall list of borrowers in the front-end, adding in a ‘limit by library’ option.  I had hoped that this would only take a couple of hours to implement but it ended up being rather complicated to get working as I needed to update several API endpoints to incorporate the limit by library, update the way the URLs work on the borrowers page and update the JavaScript that processes the changing of the view and the selection of filter items, in addition to incorporating the list of libraries into the page.  However, I manage to get the feature completed by the end of the week.

One thing to note is that there is often a discrepancy between the number of borrowers listed in the occupations section and the number in the tabs.  For example, when limiting the list to female borrowers with an ‘Arts and Letters’ occupation the ‘Arts and Letters’ occupation shows 7 borrowers but if you count the borrowers in the tabs there are only 5.  The reason is a borrower can have multiple occupations.  For example, Anne Grant is both poet and author so counts as one borrower in the ‘G’ tab but one each for ‘Poet’ and ‘Author’ in the occupation counts.  I also wonder whether I should add in counts of borrowers at each library to the ‘limit by library’ page.  Something to consider, anyway.

I also encountered some weirdness with the site today – specifically any searches for things with spaces in them (e.g. occupation ‘Arts and Letters’) just displayed a 403 Forbidden page.  This was before I’d made any updates to the code so it wasn’t that I’d inadvertently broken something.  It turned out that a recent update to the Apache server software was breaking any URL that had a space in it.  If you really want to find out more you can read here:

https://stackoverflow.com/questions/75684314/ah10411-error-managing-spaces-and-20-in-apache-mod-rewrite

I had to spend some time figuring out how this change has affected the site and making updates to avoid the 403 error pages.  I think I’ve sorted most things out when using the front-end, although accessing the API directly is still causing problems.  I’ll need to sort this out next week.  I’ll also deal with the Hunterian images, which it turns out had not yet been added to the system, and will hopefully move on to the ‘Facts and Figures’ pages and their visualisations next week too.

Week Beginning 1st May 2023

Monday was a holiday this week, so ordinarily this would have been a four-day week for me.  However, I was unfortunately picked for jury duty and I was obliged to attend court on Wednesday and Friday, making it a two-day week.  I’m also going to have to attend court on Tuesday next week as well (after next Monday’s coronation holiday) but hopefully that will be an end to the disruption.

On Tuesday this week I spent a bit of time working on the migration of sites to external hosting and spent the remainder of the day adding the new MRI 2 recordings to the IPA chart on the Speech Star website. I uploaded all of the videos and added in a new ‘MRI 2’ video type.  I then uploaded and integrated all of the metadata.  It took quite a long time to get all of this working (pretty much all day), adding the data to all four of the IPA charts, but I got it all done.  I will need to update the charts on the Seeing Speech website too once everyone is happy with how the charts look.

On Thursday I made some further tweaks to the Edinburgh’s Enlightenment map and migrated three further sites to external hosting.  I also spent some time updating the shared spreadsheet we’re using to keep track of the Arts websites, adding in contact details for all of the sites I’m responsible for and making a note of the sites I’ve migrated.

I also made some tweaks to the Speech Star feedback pages I’d created last week, populated a few pages of the Speech Star website with content from Seeing Speech, added content to the ‘contact us’ page, fixed some broken links that people had spotted in the site, swapped a couple of video files around that needed fixed in the charts and added explanatory text to the extIPA chart page.  I also added in some new symbols to the IPA charts for sounds that were not present on the original versions but we now have videos for in the MRI 2 data.

I also investigated a strange issue that Jane Roberts had encountered when adding works to the Old English Thesaurus using the CMS.  Certain combinations of characters in the ‘notes’ field were getting blocked by Apache, and once I’d figured this out we were able to address the issue.

I also spent a bit of time on the Books and Borrowing project, running a query and generating data about all of the book holding records that currently have no associated book edition record in the system (there are about 10,000 such records).  We had also received the images for the final two registers in the Advocates Library from the NLS digitisation unit and I spent some time downloading these, processing the images to remove blank pages and update the filenames, uploading the images to our server and then running a script to generate register and page records for each page in both registers.  These should be the last registers that need to get added to the system so it’s something of a milestone.

Week Beginning 24th April 2023

This week I continued to fix and reinstate some of the websites that had been taken offline due to the College of Arts server compromise.  This took up at least two full days of my time.  I also began to set up the website for Ophira Gamliel’s new AHRC funded project and fixed an accessibility issue with the Edinburgh’s Enlightenment map.  I responded to a query about QR codes from Alison Wiggins and spoke to Jennifer Smith about the follow-on project for Speak For Yersel, which will expand the site to new areas beyond Scotland.

Other than these tasks I spent the remainder of the week working for the Speech Star project.  I added a new database of speech videos to both Speech Star sites.  This is the Edinburgh MRI modelled speech database, consisting of around 50 MRI videos, divided into several categories.  I retained the folder structure, with expandable / collapsible sections for the three main folders and also for the subfolders contained in each.  The display of the video information in the page is similar to the Central Scottish phonetic features page, with boxes for each video containing a ‘Play’ button and the title and lexical items for each video.  You can then press ‘Play’ to open the usual video overlay that contains the video, the metadata and the playback speed option.  I also added in the introductory text and the logos.  Below is a screenshot of how the new database looks with one section open:

I also updated the list of databases, replacing the list of buttons with more pleasing images and text, as you can see below:

I then moved onto a major update to the extIPA chart page.  Previously one chart was shown, but an updated extIPA chart was released in 2015 that has many more symbols on it.  Eleanor had arranged for a new set of MRI recordings to be made, including many of these newly added sounds.  I therefore had to add in a new ‘post-2015’ chart and also add in a secondary set of MRI recordings, in addition to the first set that still needed to be accessible.  In addition, a new selection of animations had been created and I needed to add these in.

This page now has tabs for ‘Pre-2015’ and ‘Post-2015’.  The ‘Pre-2015’ tab displays the table as it was originally with video type ‘MRI 1’ selected by default.  You can then change to ‘MRI 2’ to view the new video clips.  More sounds have videos available in ‘MRI 2’ so less symbols are greyed out.  The ‘Animation’ tab also now includes links to a lot more videos.

The ‘Post-2015’ tab features all of the rows, columns and symbols from the new chart.  It’s been rather tricky getting all of these symbols to display, but I’ve got them all working now.  The ‘MRI 1’ table contains all of the original videos, with lots of symbols greyed out as there are no videos for them.  ‘MRI 2’ features links to all of the new videos, with a few symbols greyed out as the sounds haven’t been recorded.  The ‘animation’ tab has links to the same videos as the ‘Pre-2015’ table, but presented on the new chart.  It has been very time-consuming getting this up and running, but I’m sure it will be worth it.  Below is a screenshot of the ‘post-2015’ table with all of its symbols and the new ‘MRI 2’ videos populating it.

I also added in a new feedback page to both Speech Star sites which links through to a survey hosted on Qualtrics.  I added in a pop-up that prompts people to take the survey using the same code that I created when we added a survey to Seeing Speech and Dynamic Dialects a few years ago.

Monday next week is a holiday, and I’ve also been called up for jury duty, meaning I have to go to court on Tuesday.  I really hope I won’t get picked to serve as I have an awful lot to do in the coming weeks, especially for Books and Borrowing – a project I didn’t have any time to work on this week.