Week Beginning 19th June 2023

I continued to work for the Books and Borrowing project this week, switching the search facilities over to use a new Solr index that includes author gender.  It is now possible to incorporate author gender into searches, for example bringing back all borrowing records involving books written by women.  This will be a hugely useful feature.  I also fixed an issue with a couple of page images of a register at Leighton library that weren’t displaying.

The rest of my time this week was spent developing a new Bootstrap powered interface for the project’s website, which is now live (https://borrowing.stir.ac.uk/).  You’d struggle to notice any difference between this new version and the old one as the point of creating this new theme was not to change the look of the website but to make Bootstrap (https://getbootstrap.com/) layout options available to the dev site.  This will allow me to make improvements to the layout of things like the advanced search forms.  I haven’t made any such updates yet, but that is what I’ll focus on next.

It has taken quite a bit of time to get the new theme working properly – blog posts with ‘featured images’ that replace the site’s header image proved to be particularly troublesome to get working – but I think all is functioning as it should be now.  There are a few minor differences between the new theme and the old one.  The new theme has a ‘Top’ button that appears in the bottom right when you scroll down a long page, which is something I find useful.  The drop-down menus in the navbar look a bit different, as does the compact navbar shown on narrow screens.  All pages now feature the sidebar whereas previously some (e.g. https://borrowing.stir.ac.uk/libraries/) weren’t showing it.  Slightly more text is shown in the snippets on the https://borrowing.stir.ac.uk/project-news/ and other blog index pages.  Our title font is now used for more titles throughout the site.  I’ve also added in a ‘favicon’ for the site, which appears in the browser tab.  It’s the head of the woman second from the right in the site banner, although it is a bit indistinct.  My first attempt was the book held by the woman in the middle of the banner but this just ended up as a beige blob.

Next week I’ll update the layout of the dev site pages to use Bootstrap.  I’m going to be on holiday the week after next and at a conference the week after that so this might be a good time to share the URL for feedback, as other than adding in book genre when this is available everything else should pretty much be finished.

For the Anglo-Norman Dictionary this week I participated in a conference call to discuss collaborative XML editing environments.  The team are wanting to work together directly on XML files and to have a live preview of how these changes appear.  The team are investigating https://www.fontoxml.com/ and also https://paligo.net/ and https://www.xpublisher.com/en/xeditor.  However, none of these solutions give any mention whatsoever of pricing on their websites, which is incredibly frustrating and off-putting.  I also mentioned the DPS system that the DSL uses (https://www.idmgroup.com/content-management/dps-info.html).  We’ll need to give this some further thought.

I also spent a bit of time writing a script to extract language tags from the data.  The script goes through each ‘active’ entry in the online database and picks out all of the language tags from the live entry XML and stores each language and the number of times each language appears in each entry (across all senses, subsenses, locutions).  It does the same for the ‘dms_entry_old’ XML data (i.e. the data that was originally stored in the current system before any transformations or edits were made) for each of these ‘active’ entries (if the XML data exists) and similarly stores each language and frequency as ‘old’ languages.  In addition, the script goes through each of the ‘R’ XML files and picks out all language tags contained in them, augmenting in the list of ‘old’ languages.  For each ‘active’ entry that has at least one ‘live’ or ‘old’ language the script exports the slug and the ‘live’ and ‘old’ languages, consisting of each language found and the number of times found in the entry.  This data is then saved in a spreadsheet.

There are 1908 entries that will need to be updated and this update will consist of removing all language tags from each sense / subsense in each listed entry, adding a new language tag at entry level (probably below the <pos> tag) for each distinct language found, updating the DTD to make the newly positioned tags valid and updating the XSLT to ensure the new tags get displayed properly in the web pages.

I also began to think about how I’ll implement date / part of speech searches and sparklines in the Dictionaries of the Scots Language and have started writing a requirements document for the new features.  We had previously discussed adding the date search and filter options to quotations searches only, but subsequent emails from the editor suggest that these would be offered for other searches too and that in addition we would add in a ‘first attested’ search / filter option.

The quotation search will look for a term in the quotations, with each quotation having an associated date or date range.  Filtering the results by date will then narrow the results to only those quotations that have a date within the period specified by the filter.  For example, a quotation search limited to SND for the term ‘dreich’ will find 113 quotations.  Someone could then use the date filter to enter the years 1900-1950 which will then limit the quotations to 26 (those that have dates in this period).

At the moment I’m not sure how a ‘first attested’ filter would work for a quotation search.  A ‘first attested’ date is something that would be stored at entry rather than quotation level and would presumably be the start date of the earliest quotation.  So for example the SND entry for ‘Driech’ has an earliest quotation date of 1721 and we would therefore store this as the ‘first attested’ date for this entry.

This could be a very useful filter for entry searches and although it could perhaps be useful in a quotation search it might just confuse users.  E.g. the above search for the term ‘Dreich’ in SND finds 113 quotations.  A ‘first attested’ filter would then be used to limit these to quotations associated with entries that have a first attested date in the period selected by the user.  So for example if the user enters 1700-1750 in the ‘Dreich’ results then the 113 quotations would then be limited to those belonging to entries that were first attested in this period, which would include the entry ‘Dreich’.  But the listed quotations would still include all of those for the entry ‘Driech’ that include with search term ‘Dreich’ not just those from 1700-1750 because the limit was placed on entries with a first attested date in that period and not quotations found in that period.

In addition, the term searched for would not necessarily appear in the quotation that gave the entry its first attested date.  An entry can only have one first attested date and (in the case of a quotation search) the results will only display quotations that feature the search term, which will quite possibly not include the earliest quotation.  A search for quotations featuring ‘dreich’ in SND will not return the earliest quotation for the entry SND ‘Dreich’ as the form in this quotation is actually ‘dreigh’.

If we do want to offer date searching / filtering for all entry searches and not just quotation searches we would also have to consider whether we would then just store the dates of the earliest and last quotations to denote the ‘active’ period for the entry or whether we would need to take into account any gaps in this period as will be demonstrated by the sparklines.  If it’s the former then the ‘active’ period for SND ‘Dreich’ would be 1721-2000, so someone searching the full text of entries for the term ‘dreich’ and then entering ‘1960-1980’ as a ‘use’ filter will then still find this entry.  If it’s the latter than this filter would not find the entry as we don’t have any quotations between 1954 and 1986.

Also this week I had to spend a bit of time fixing a number of sites after a server upgrade stopped a number of scripts working.  It took a bit of time to track all of these down and fix them.  I also responded to a couple of questions from Dimitra Fimi of the Centre for Fantasy and the Fantastic regarding WordPress stats and mailing list software and discussed a new conference website with Matthew Creasy in English Literature.

Week Beginning 12th June 2023

For the Books and Borrowing project this week I completed an initial version of the site-wide facts and figures page.  The page functions in a very similar way to the library-specific facts and figures page as discussed in previous weeks, but here you can view the data for any or all libraries rather than just a single one.  Libraries are selectable from a series of checkboxes as in the advanced search, allowing you to group together and compare library data, for example the data for the three University libraries or data from libraries in southern Scotland.  However, the page can take about 30 seconds to load as it’s processing an awful lot of data for the default view (which is data for every library).  I’m going to create a cached version of the data for this particular view and possible others, but I haven’t got round to it yet.

The ‘Summary’ section provides an overview of the amalgamated data for your chosen libraries and the ‘top ten lists’ are also amalgamated.  The book holding and prolific borrowers lists include the relevant library in square brackets after the number of associated borrowing records as these lists contain items that are related to specific libraries (borrowers and book holdings ‘belong’ to a specific library).  I’ve also added in the top ten borrowed book editions, as these are site-wide, as are authors.  The links through to the search results for these lists include your selected libraries to ensure the search results match up.  For example, Sir Walter Scott is the most borrowed author when viewing data for Dumfries, Selkirk and Wigtown and when you press on the author’s name you search is limited to these libraries.  The occupations and borrowings through time visualisations contain data for all selected libraries and rather than a ‘book holding frequency of borrowing’ chart at the bottom of the page there is a ‘book edition’ version as editions are site-wide.  As with other data, the editions and borrowings over time returned here are limited to your chosen libraries and links through to the search results also incorporate these.

I still need to address a couple of things, though.  Firstly, book edition titles have not currently been cut off after 50 characters as I realised doing so for titles with non-Latin characters (e.g. Greek) broke the page.  This is because characters such as Greek take up multiple bytes while Latin characters take up one byte each.  The default substring method for chopping up strings is not ‘multibyte safe’, meaning non-Latin characters can get split in the middle of the data, which results in an error.  There is a handy multibyte version of the substring method but unfortunately multibyte functions have not been installed on the server so I can’t use it.  Once I’ve managed to get this installed I’ll add in the limit.  Secondly, I’ve noticed that the visualisations don’t work very well on touchscreens so I’m going to have to rework them to try and improve matters.  For example, the donut charts allow you to click on a section to perform a search for the section’s occupation, but as touchscreens have no hover state this means on a touchscreen it’s not possible to view the tooltip that includes the name of the occupation and the exact percentage.  Also, the click through to a year on the ‘borrowings through time’ chart don’t seem to work on my iPad and the book edition autocomplete doesn’t seem to be firing either.

Also for the project this week I assigned some borrowing records for a borrower in Wigtown to another borrower, as it turned out these were the same person.  This also involved updating the borrower’s total number of borrowings and active borrowing period.  I then began to look into migrating the website’s theme to Bootstrap.  I’m going to be working on this on a local version of the site running on my laptop and I began getting things set up.

Also this week I uploaded a new set of videos and an updated version of the database for the Child Speech Error Database that as part of the Speech Star project and was involved in the migration of Seeing Speech and Dynamic Dialects to a new server.  I was also involved in the migration of two further WordPress sites to our external provider and sorted out some issues with a couple of other WordPress sites that had been moved to our new in-house server.

I also spent a bit of time working for the DSL, engaging in an email conversation about how the new part of speech search will work and replacing the ‘Share’ buttons on the entry page with an alternative as the ‘AddThis’ service that we previously used has been shut down by parent company Oracle (See https://www.addtoany.com/blog/replace-addthis-with-addtoany-the-addthis-alternative/).  This involved investigating an issue with one member of staff who was just seeing a blank space rather than the new options.  It turned out the Chrome plugin ‘DuckDuckGo privacy essentials’ was blocking the service so we’ll have to watch out for this.

Also this week I spent a bit of time working for the Anglo-Norman Dictionary.  I had an email discussion about collaborative XML tools with one of the project team and we will have a Zoom call next week to go over this in more detail.  I also wrote a little script to export a set of XML files from the live database, in this case all entries beginning with the letter ’U’.  After doing this the editor happened to notice that the ‘language’ tags seem to have disappeared from entries.

This then led onto some detailed investigation into this very worrying situation.  It would appear that some process that modified the XML has at some point removed the language tag.  This appears to have happened up to and including the import of R, but whatever was causing the issue had been rectified by the time we imported S.  Looking at specific examples I noticed that the language tag was present in the XML for ‘rother’ prior to upload, but is not present in the XML in the online database.  On a version of the data on my local PC I deleted ‘rother’ and ran it through the upload process that was used for ‘S’ and the language tag remained intact.  I can’t be certain but I presume the error was caused by the script that assigns IDs and numbers to <sense> and <senseInfo> tags as it is this script that edits the XML.  We currently have 493 entries that feature a language tag.  Heather asked me to export counts of languages found in entries in March last year and there were a similar number then.  We still have a column holding the initial data that was first imported into the new system and in the current data there are 1759 ‘active’ entries that feature a language tag.  This does not include any R data as it was added after this point.  Looking back through old versions of the database, the version of the data I have from 26/11/20 does not have the ‘old’ column and features 1867 entries with language tags.  The import of R must have happened sometime shortly after I created this version.  The version of the data I have from 03/26/21 features the R data and also is the first version I have with the ‘old’ column.  In this version the live data now only has 47 entries with language tags while ‘old’ has 1868 entries with the language tag.

In order to fix this issue I’m going to have to somehow write a script to look at ‘old’ for language tags then attempt to extract these.  I’ll also need to do this for the R data too, directly from the XML files that I was sent prior to upload.  Thankfully after further discussion with Geert it turns out that he thinks the language tag should be an ‘entry’ level rather than ‘sense’ / ‘subsense’ level.  Moving the language tag out of senses and placing it at entry level should actually make the data retrieval much more straightforward.  The biggest issue would have been figuring out which sense each language tag needs to be added to, as the old data does not have sense IDs or any other easy way of identifying which sense is which.  If instead a script just needs to pull out all language tags in the entry’s senses / subsenses and then add them to a new section directly in the entry that should be much simpler.

I tested out the DMS (downloading and then uploading the entry ‘slepewrt’ which features a language tag) and it is retaining the tag, so it is thankfully safe to continue to use the DMS.  Next week I’ll begin working on a script for reinstating the missing tags.

Week Beginning 5th June 2023

I continued to work on the Books and Borrowing project for most of this week.  One of my main tasks was to fully integrate author gender into the project’s various systems.  I had create the required database field and had imported the author gender data last week, but I still needed to spend some time adding author gender to the CMS, API, Solr and the front-end.  It is now possible for the team to add / edit author genders through the CMS wherever the author add / edit options are available and all endpoints involving authors in the API now bring back author gender.  I have also updated the front-end to display gender, adding it to the section in brackets before the dates of birth and death.

I have also added author gender to the advanced search forms but a search for author gender will not work until the Solr instance is updated.  I prepared the new data and have emailed our IT contact in Stirling to ask him to update it, but unfortunately he’s on holiday for the next two weeks so we’ll have to wait until he gets back.  I have tested the author gender search on a local instance on my laptop, however, and all seems to be working.  Currently author gender is not one of the search results filter options (the things that appear down the left of the search results).  I did consider adding it in (as with borrower gender) but there are so few female authors compared to male that I’m not sure a filter option would be all that useful most of the time – if you’re interested in female authors you’d be better off performing a search involving this field instead rather than filtering an existing search.

I then moved on to developing a further (and final) visualisation for the library facts and figures page.  This is a stacked column chart showing borrowings over time divided into sections for borrower occupation (a further version with divisions by book genre will be added once this data is available).  I’ve only included the top level occupation categories (e.g. ‘Education’) as otherwise there would be too many categories; as it is the chart for some libraries gets very cluttered.

It took quite some time to process the data for the visualisation but I completed an initial version after about two days of work.  This initial version wasn’t completely finished – I still needed to add in the option to press on a year, which will then load a further chart with the data split over the months in the chosen year.  Below is a screenshot of the visualisation for Haddington library:

With the above example you’ll see what I mean about the chart getting rather cluttered.  However, you can switch off occupations by pressing on them in the legend, allowing you to focus on the occupations you’re interested in.  For some libraries the visualisation is a little less useful, for example Chambers only has data for three years, as you can see below:

I also had to place a hard limit on the start and end years for the data because Chambers still has some dodgy data with a borrowing year thousands of years in the future which caused the script to crash my browser as it tried to generate the chart.  Note that the colours for occupations match those in the donut charts displayed higher up the page to help with cross-referencing.

Later in the week I completed work on the ‘drilldown’ into an individual year for the visualisation.  With this feature you can press on one of the year columns and the chart will reload with the data for that specific year, split across the twelve months.  This drilldown is pretty useful for Chambers as there are so many borrowings in each year.  It’s useful for other libraries too, of course.  You can press on the ‘Return to full period’ button above the chart to get back to the main view and choose a different year.  Below is an example of a ‘drilldown’ view of the Chambers library data for 1829:

As I developed the drilldown view I realised that the colours for the sections of the bar charts were different to the top-level view when the drilldown is loaded, as the colours are assigned based on the number of occupation categories that are present, and in the drilldown there may only be a subset of occupations.  This can mean ‘Religion’ may be purple in the main view but green in the drilldown view, for example.  Thankfully I managed to fix this, meaning the use of colour is consistent across all visualisations.

The next item on my agenda is to create the site-wide ‘Facts and Figures’ page, which will work in the same way as the library specific page, but will present the data for all libraries and also allow you to select / deselect libraries.  I spent some further time updating the API endpoints I’d developed for the visualisations so that multiple library IDs can be passed.  Next week I’ll create the visualisations and library selection options for the site-wide ‘Facts’ page.  After that I will probably move onto the redesign of the front-end to use Bootstrap, during which I will also redesign many of the front-end pages to make them a bit easier to use and navigate.  For example, the facts and figures page will probably be split into separate tabs as it’s a bit long at the moment and the search form needs to be reworked to make it more usable.

Also this week I spent about a day working for the DSL.  This involved having one of our six-monthly Zoom calls to discuss developments and my work for the months ahead.  As usual, it was a useful meeting.  We discussed the new ‘search quotations by year’ facility that I will be working on over the coming months, plus sparkline visualisations that I’ll be adding into the site, in addition to the ongoing work on the parts of speech and how a POS search could be incorporated.

I was also told about a video of the new OED interface that will launch soon (see https://www.youtube.com/watch?v=X-_NIqT3i7M).  It looks like a really great interface and access is going to be opened up.  They are also doing a lot more with the Historical Thesaurus data and are integrating it with the main dictionary data more effectively.  It makes me question how the Glasgow hosted HT site is going to continue, as I fear it may become rather superfluous once the new OED website launches.  We’ll see how things develop.

After the call I began planning how to implement the quotation date search, which will require a major change to the way the data is indexed by Solr.  I also instructed our IT people to remove the old Solr instance now the live site is successfully connected to the new instance without issue.

Also this week, after a bit more tweaking, Ophira Gamliel’s HiMuJe Malabar site went live and it can now be viewed at https://himuje-malabar.glasgow.ac.uk/.   I also had a meeting with Stephen Rawles of the Glasgow Emblems project.  I worked with Stephen to create two websites that launched almost 17 years ago now (e.g. https://www.emblems.arts.gla.ac.uk/french/) and Stephen wanted me to make a few updates to some of the site text, which I did.  I also fixed a few issues that had cropped up with other sites that had been migrated to external hosting.

Week Beginning 29th May 2023

Monday was a holiday this week and I spent most of the four working days on the Books and Borrowing project, although unfortunately my access to the Stirling University VPN stopped working on Wednesday and access wasn’t restored until late on Thursday afternoon.  As I am unable to access the project’s server and database without VPN access this limited what I could do, although I did manage to work on some code ‘blind’ (i.e. without uploading it to the server and testing it out) for much of Wednesday.

For the project this week I generated a list of Haddington book holdings that don’t have any borrowings so the team could test some things.  I also added in author gender as a field and wrote a script to import author gender from a spreadsheet, together with tweaks made to all of the author name fields.  I still need to fully integrate author gender into the CMS, API and front-end, which I will focus on next week.

I spent the rest of my time on the project this week developing a new visualisation for the library facts and figures page.  It is a line chart for plotting the number of borrowing records for book holdings over time and it features an autocomplete box where you can enter the name of a book holding.  As you type, any matching titles appear, with the total number of borrowings in brackets after the title.  If the title is longer than 50 characters it is cropped and ‘…’ is added.  Once you select a book title a line chart is generated with the book’s borrowings plotted.  You can repeat this process to add as many books as you want, and you can also remove a book by pressing on the ‘delete’ icon in the list of selected books above the chart.  You can also press on the book’s title here to perform a search to view all of the associated borrowing records.

Hopefully this chart will be of some use, but its usefulness will really depend on the library and the range and number of borrowings.  So for example the image below for Wigtown shows a comparison of the borrowings for four popular journals, which I think could be pretty useful.  A similar comparison at Chambers is less useful, though, due to the limited date range of borrowing records.

I also spent some time working for the DSL this week.  This included investigating an issue with one of the DSL’s website that I’m not involved with (https://www.macwordle.co.uk/) which was no longer generating any Google Analytics stats.  It turned out that the site had been updated recently and the GA code had not been carried over.  I also fixed another couple of occurrences of the ‘Forbidden’ error that had cropped up due to a change in the way Apache handles space characters.

The rest of my time was spent on the new Solr instance that we’d set up for the project last week.   The DSL team had been testing out the new instance, which I had connected to our test version of the DSL site, and had spotted some inconsistencies.  With the new Solr instance, when a search is limited to a particular source dictionary it fails to return any snippets (the sections of the search results with the term highlighted) while removing the ‘source’ part of the query works fine.  I was unable to find anything online about why this is happening, but by moving the ‘source’ part from the ‘q’ variable to ‘fq’ the highlighting and snippets are returned successfully.

There was also an issue with the length of the snippets being returned, and some snippets being amalgamated and featuring multiple terms rather than being treated separately.  It would appear that the new version of Solr uses a new highlighting method called ‘unified’ and this does not seem to be paying attention to the ‘fragsize’ variable that should set the desired size of the snippet.  I’d set this to 100 but some of the returned snippets were thousands of characters long.  In addition, it amalgamates snippets when the highlighted term is found multiple times in close proximity.  I have now figured out how to revert to the ‘original’ highlighting method and this seems to have got things working in the same way as the old Solr instance (and also addressed the ‘source’ issue mentioned above).  With this identified and fixed the results for the new Solr instance displayed identically to the old Solr instance and on Friday I made the switch to make the live DSL site use the new Solr instance.

Also this week I made some further updates to the STAR website, fixing a number of typos and changing the wording of a few sections.  I also added in a ‘spoiler’ effect to the phonetic transcriptions as found here: https://www.seeingspeech.ac.uk/speechstar/child-speech-error-database/ which blanks out the transcriptions until pressed on.  This is to facilitate teaching, enabling students to write their own transcriptions and then check theirs against the definitive version.  I also finally added citation information for all videos on the live STAR site now.  Video popups should all now contain citation information and links to the videos.  You can also now share URLs to specific videos, for example https://seeingspeech.ac.uk/speechstar/disordered-child-speech-sentences-database/#location=1 and this also works for multiple videos where this option is given, for example https://seeingspeech.ac.uk/speechstar/speech-database/#location=617|549.

Finally this week I made a number of changes to Ophira Gamliel’s new project website, which is now ready to launch.

Week Beginning 22nd May 2023

I’d taken Thursday and Friday off this week, ahead of next Monday’s public holiday.  I spent some time this week getting the website for Ophira Gamliel’s new project up and running.  This included updating the header image, updating the map and adding in some initial content to all of the pages.  The website is not publicly accessible quite yet, but it’s almost ready to launch.  I also updated the IPA charts on the Seeing Speech website to incorporate the new MRI 2 recordings, the new metadata and the description, which are now available at https://www.seeingspeech.ac.uk/ipa-charts/?chart=1&datatype=4&speaker=1 and updated the MRI 1 and animation metadata on both Seeing Speech and STAR.  I also helped with the migration of the Edinburgh Gazetteer website, which needed to be moved back to internal hosting due to its size and performed a few other tasks relating to the migration of sites.

For the DSL I updated the ‘back’ button that returns you to a dictionary entry from a bibliography entry.  Previously pressing on this took you to the top of the entry page but the editors wanted it to load the entry page at the point the user was at when they clicked the bib link, which is what the browser’s ‘back’ button does.  One reason this was rather complicated is that if someone opens the link in a new tab the browser’s ‘back’ button is disabled as there is no previous page in the new tab, but we still want people to potentially load up the entry in the new tab.  I’ve found a way to ensure this works, but it took a bit of time to get right.

The editors had also noticed a strange situation whereby a search for ‘gb’ was finding lots of occurrences of ‘gib’.  This was a strange one and also took some investigation.  It turns out that Solr allows you to set up synonyms, so that when (for example) someone searches for ‘TV’ they find occurrences of ‘television’.  What I didn’t know is that Solr comes with a default synonyms file for test purposes and this includes:

GB,gib,gigabyte,gigabytes

MB,mib,megabyte,megabytes

Television, Televisions, TV, TVs

And this explains why a search for ‘gb’ is finding ‘gib’ while ‘fb’ is not finding ‘fib’.  Thankfully we are in the middle of migrating the DSL’s Solr search platform and I was able to request that this synonyms file gets wiped in the new location.  Setting up the new Solr instance and re-ingesting the DSL data was pretty straightforward and currently the new instance is connected to a test version of the DSL website.  The DSL team have been testing it out and have noticed a couple of issues that I’ll need to investigate next week before we finalise the migration and turn the old Solr server off.  Also during this process we realised that the issue with Apache handling space characters that I discussed in my 8th of May post had also started to affect the DSL website, breaking all searches for terms that featured spaces.  Thankfully I knew how to fix this due to my investigations a couple of weeks ago.

I spent the rest of the week working for the Books and Borrowing project.  The PI Katie had suggested that we add author gender as a new field and I engaged in an email discussion about how best to implement this.  It will take some time as although it’s simple enough to add the extra field to the database, the CMS, the API, the Solr data export, the front-end searches, browses and display of data will all need to be updated to incorporate the new option.  I generated a list of the current authors in the system as a spreadsheet and sent this to Katie, who is going to fill in a new ‘gender’ field and send it back to me to work with, probably next week.

One of the major issues I tackled this week was what to do about the numbers of times a book is borrowed (which appears in several places throughout the front-end) not matching the number of associated borrowing records (as found in the search results).  The issue is caused by a book holding possibly consisting of multiple volumes.  One borrowing record may involve any number of volumes, so although in the search results we have one borrowing record, in the counts of borrowed books the total number for a book holding may be very different.

I decided to updated the system so that the number of borrowing records in addition to the number of times borrowed is stored in the cache and displayed on the site.  This meant updating the database and writing a new cache generation script, then updating the API to bring back the new data and the front-end to display it.  So for example the the Co-I Matt noticed a couple of weeks ago that “the Buffon [book] is listed like this: ‘Number of borrowings: 159’.  However, if you click on the holding title, the resulting search has 108 results”.  After the update the number of borrowings for the holding record now states: “Volumes of this book were borrowed 159 times in 108 borrowing records”.

I’ve also updated book edition counts, for example: “Volumes associated with this edition were borrowed 316 times in 233 borrowing records”.  The ‘top 100’ list also now uses the number of borrowing records to order things, although this generally doesn’t change things too drastically.  It should be noted that discrepancies between number of times borrowed and number of associated borrowing records only happen when a book has multiple volumes.  Therefore if a book doesn’t have multiple volumes then the text just states the number of borrowings.

I have updated the book edition page to display the new borrowing record information too, and I’ve also updated the ‘top ten lists’ in the ‘facts’ page to use the borrowing record figures.  In addition I have fixed the issue with the numbers in the ‘top ten authors’ lists being wrong. Figures relating to borrowers rather than books should have been correct anyway so I haven’t needed to update these. Hopefully this will considerably cut down on the potential for confusion.

Also this week I wrote a little script to add in the normalised borrower occupation for borrowers from the Royal High School and returned to the visualisations on the library ‘facts and figures’ page.  I’ve implemented a further donut chart for borrower occupations, this time showing the number of borrowings per occupation.  I’ve also updated the text above the chart to show more information about how the figures are derived.  For example for Chambers it now says:

“Of the 311 library borrowers 101 have one or more identified occupations. The total number of occupations associated with borrowers at this library is 114 and these are represented in the following chart”.

Hopefully this makes it clearer that the numbers represent occupations and these are not necessarily present for all borrowers and some borrowers may have more than one occupation.

The chart of borrowings per occupation makes for an interesting comparison with the chart for borrower occupations, for example in Chambers the occupation ‘Wife/Spouse’ (show in green) represents 9.65% of the total occupations in the library but 26.15% of the borrowing records, as you can see from the following screenshot:

Next week I’ll hopefully continue with the visualisations on this page.

 

Week Beginning 15th May 2023

I spent the majority of this week continuing to work for the Books and Borrowing project.  I added in images of both the Hunterian and Inverness Kirk Sessions library register pages, as these had not yet been integrated.  I then began working on the library ‘Facts and Figures’ page and added in the API endpoints and processing of library statistics, which now appear in a ‘summary’ section at the top of the page, as you can see for the Chambers library in the following screenshot:

We may need to include some further explanation of the above.  The number of borrowing records does not match the numbers split by gender, which may be caused by some borrowing records not having an associated borrower or others having more than one associated borrower.  Also as you can see a borrowing record in this library has an erroneous year, making it look like the borrowing records stretch into the distant future of 18292.

Beneath the summary are a series of ‘top ten’ lists, featuring the top ten borrowed books, authors and most prolific borrowers, both overall and further broken down by borrower gender, as you can see below:

These all link through to search results for the item in question.  Note that we once again face the issue of numbers here reflecting volumes borrowed while the search results count borrowing records, which can include any number of volumes.  I’m going to have a serious think about what we can do about this discrepancy as it is bound to cause a lot of confusion.  There is also something very wrong with the author figures that I need to investigate.  I think the issue is authors can be associated with any level of book record and I suspect authors in this list are getting counted multiple times.

There are also issues caused by the data still being worked on while the Solr cache and other cached data become outdated.  I spent quite a while on Tuesday trying to figure out why a book holding wasn’t appearing where it should or with the expected number of borrowings until I realised the data in the CMS had been updated (including the initial letter of the book title) while the cache hadn’t.

I then moved onto working on some visualisations.  I created a borrower occupations donut chart as described in the requirements document.  This chart shows the distribution of borrower occupations across library borrowers in a two-level pie chart with top-level occupations in the middle and these then subdivided into secondary level occupations in the outer ring.  Note that I haven’t further split ’Religion and Clergy’ > ‘Minister/Priest’ into its third level occupations as it is the only category that has three levels and it would have made the visualisation too complicated (both to develop and to comprehend in the available space).  Instead the individual ‘minister / priest’ categories are amalgamated into ‘minister / priest’.

The charts are a nice way to get an overall picture of the demographics of the selected library and also to compare the demographies of different libraries.  For example, Chambers has a very broad spread of occupations:

Whereas Advocates (as you’d expect) is more focussed:

You can hover over each segment to view the title of the occupation and the percentage of borrowers that have the occupation.  You can also click on a segment to perform a search for the occupation at the library in question.  You should bear in mind that a borrower can have multiple (or no) occupations so the number of borrowers and the number of occupations will be different.  The next thing I’ll do is create a further donut chart showing the number of borrowings per occupation, but I’ll leave that to next week.

I also thought some more about the thorny issue of number of borrowings versus number of borrowing records.  What I think I’ll try and do is ensure that the only number that is shown is the number of associated borrowing records.  I’ll need to experiment with how (or if) this might work, which I will leave until next week.

I also devoted some time to setting up the website for Ophira Gamliel’s new AHRC project.  I created several mockups of possible interface designs including a few different header images based on photographs of manuscripts that the project will analyse.  Ophira and Co-I Ines Weinrich picked the version they liked best and I then applied this to the live site.  I also worked on a static map of the area the project will focus on.  I created this by initially creating an interactive map using the lovely Stamen Watercolor basemap and then I used Photoshop to add other labels and to fade out areas the project is not focussing on.  The map still needs some work but here is an initial version:

I spent a bit of time working for the DSL, responding to a couple of queries from Pauline Graham about Google Analytics.  Pauline wondered whether it we could find search terms that people had entered that found no results in GA, but after some research I don’t think it would be possible as the page that GA will log is the same whether there are results or not.  We’d have to update the site structure to maybe load a different page (with the search terms in the URL) when there are no results and then we’d be able to isolate these page hits in GA.  However, we can limit the list of page stats in GA to a specific page and then order by number of views, which may give us some idea of which searches find nothing.  After selecting ‘Pages and screens’ in ‘Engagement’ pressing the blue plus in the results table header opens a pop-up  and selecting ‘Page / Screen’ then ‘Landing page + query string’ allows you to add a new column featuring the page URL.  Then above the table in the search area (with the magnifying glass) enter ‘/results/’ to limit the data to the search results page.  You can update the ‘rows per page’ and press the arrow next to ‘views’ to order by views from least to most.

I also spent some time on the Speech Star project this week, updating the ExtIPA charts to add in ‘sound name’ and ‘context’ to the metadata and updating the website to ensure this new information is displayed alongside the videos.  I also added in new metadata for the IPA charts, but for the moment this is only available for the MRI-2 videos.  I’ll need to add in metadata for the other video types later.

I also spent a bit more time working on the site migration, getting https://thepeoplesvoice.glasgow.ac.uk/ working again, fixing an issue with the place-name images upload for the Iona project and making updates to a number of other sites.

 

Week Beginning 8th May 2023

This was a three-day week for me as Monday was the coronation holiday and Tuesday was (thankfully) my final day of jury duty.  I spent some more time this week working on the migration of websites to external hosting and fixing a number of issues that have arisen during this process.  I also set up a website for Ophira Gamliel’s new project including engaging in an email discussion with the two project Co-Is about the interface and design.  I also made some further updates to the teaching version of the SpeechStar website, which has now launched (https://www.seeingspeech.ac.uk/speechstar/).  There are still some things to do for the website, such as adding in citation options for videos and adding more descriptive metadata to some of the videos, but the majority of content is now available.

I also investigated an issue the editor of the Anglo-Norman Dictionary was encountering with the proof-reader feature of the dictionary management system I’ve created.  This feature allows a ZIP file containing any number of dictionary entry XML files to be uploaded and these are then formatted in the same was a ‘live’ dictionary entries and displayed on one long page that can then be downloaded or copied into Word.  When Geert, the editor, was uploading a ZIP file the page was encountering an error and displaying a blank screen.

After some investigation I discovered this was an issue with the data.  The proof-reader is not an XML validator and expects valid XML files.  When it encounters a malformed XML file the proof-reader breaks.  It turned out there was one problematic XML file which includes a malformed tag:

<lemma>tresaiele/lemma>

This was causing a fatal error in the script, resulting in a blank page.  What I have done now is to update the proof-reader so it can handle problems with an XML file.  In such cases the file is now skipped over and a warning is displayed at the top of the page, such as  “WARNING: Malformed XML file skipped: test/tresaile.xml” while all other files get processed correctly.  This is a much better approach and will be very helpful in future.

I spent the rest of my week working on the Books and Borrowing project.  I sorted out the page numbers for one of the new registers I added to the system last week (the other one didn’t need sorting).  Page numbering was a real mess in this one and it’s taken quite some time to get our page numbers aligned with what is written in the images.

I also spent some time investigating a couple of issues that Matt had reported.  The first one was empty, unlabelled book items sometimes appearing in book holding records.  I think there is a bug in the CMS that can result in a blank book item getting created in certain circumstances.  As we’re sort of nearing the end of the project and debugging the CMS would likely be a long and painful process I pointed out that I would rather run a script to strip out all unlabelled book items that have no associated borrowings as a quick fix instead.  We could then run this script multiple times as required.  I ran a query that identified all unlabelled book items that have no associated borrowings, and there are about 2,400 of them.  I could then run a script to delete all of these items, which would remove the various ‘Volume (0 borrowings)’ from the book holding records.  What it won’t do is fix the unlabelled volumes that do have borrowings (e.g. ‘Volume (1 borrowing)’ items).  However, I could run another script to output a list of all book items that have no label but have at least one borrowing if that might help to fix these.

The second issue Matt pointed out is that the number of borrowing records listed for book holdings in the overall list of book editions does not match up with the number of search results found when you perform a search for the holding.  This is because the counts on the list of books page are the total number of times each book item for each book holding have been borrowed (so in Matt’s example adding up all of the times each book item was borrowed gives 159).  The search results display a count of borrowing records, and a single borrowing record can include multiple book items.  For example one borrowing record may involve three book items but only counts as one borrowing record.  In Matt’s example the 108 search results are borrowing records, not book items.

This is a bit confusing and I wonder what we could do to clarify this.   I did wonder whether the search results could also tally up the number of book items in the results and display this alongside the ‘your search matched x borrowing records’, but unfortunately we only have this data for a subset of results (up to 100 on each page of the results).  Perhaps we need to add some explanatory text somewhere, or perhaps the number of borrowings on the ‘Books’ page should also count the number of associated borrowing records rather than the total number of times each item was borrowed.  Or I could update the Solr index to cache the total number of book items for each borrowing record.

I also updated the overall list of borrowers in the front-end, adding in a ‘limit by library’ option.  I had hoped that this would only take a couple of hours to implement but it ended up being rather complicated to get working as I needed to update several API endpoints to incorporate the limit by library, update the way the URLs work on the borrowers page and update the JavaScript that processes the changing of the view and the selection of filter items, in addition to incorporating the list of libraries into the page.  However, I manage to get the feature completed by the end of the week.

One thing to note is that there is often a discrepancy between the number of borrowers listed in the occupations section and the number in the tabs.  For example, when limiting the list to female borrowers with an ‘Arts and Letters’ occupation the ‘Arts and Letters’ occupation shows 7 borrowers but if you count the borrowers in the tabs there are only 5.  The reason is a borrower can have multiple occupations.  For example, Anne Grant is both poet and author so counts as one borrower in the ‘G’ tab but one each for ‘Poet’ and ‘Author’ in the occupation counts.  I also wonder whether I should add in counts of borrowers at each library to the ‘limit by library’ page.  Something to consider, anyway.

I also encountered some weirdness with the site today – specifically any searches for things with spaces in them (e.g. occupation ‘Arts and Letters’) just displayed a 403 Forbidden page.  This was before I’d made any updates to the code so it wasn’t that I’d inadvertently broken something.  It turned out that a recent update to the Apache server software was breaking any URL that had a space in it.  If you really want to find out more you can read here:

https://stackoverflow.com/questions/75684314/ah10411-error-managing-spaces-and-20-in-apache-mod-rewrite

I had to spend some time figuring out how this change has affected the site and making updates to avoid the 403 error pages.  I think I’ve sorted most things out when using the front-end, although accessing the API directly is still causing problems.  I’ll need to sort this out next week.  I’ll also deal with the Hunterian images, which it turns out had not yet been added to the system, and will hopefully move on to the ‘Facts and Figures’ pages and their visualisations next week too.

Week Beginning 1st May 2023

Monday was a holiday this week, so ordinarily this would have been a four-day week for me.  However, I was unfortunately picked for jury duty and I was obliged to attend court on Wednesday and Friday, making it a two-day week.  I’m also going to have to attend court on Tuesday next week as well (after next Monday’s coronation holiday) but hopefully that will be an end to the disruption.

On Tuesday this week I spent a bit of time working on the migration of sites to external hosting and spent the remainder of the day adding the new MRI 2 recordings to the IPA chart on the Speech Star website. I uploaded all of the videos and added in a new ‘MRI 2’ video type.  I then uploaded and integrated all of the metadata.  It took quite a long time to get all of this working (pretty much all day), adding the data to all four of the IPA charts, but I got it all done.  I will need to update the charts on the Seeing Speech website too once everyone is happy with how the charts look.

On Thursday I made some further tweaks to the Edinburgh’s Enlightenment map and migrated three further sites to external hosting.  I also spent some time updating the shared spreadsheet we’re using to keep track of the Arts websites, adding in contact details for all of the sites I’m responsible for and making a note of the sites I’ve migrated.

I also made some tweaks to the Speech Star feedback pages I’d created last week, populated a few pages of the Speech Star website with content from Seeing Speech, added content to the ‘contact us’ page, fixed some broken links that people had spotted in the site, swapped a couple of video files around that needed fixed in the charts and added explanatory text to the extIPA chart page.  I also added in some new symbols to the IPA charts for sounds that were not present on the original versions but we now have videos for in the MRI 2 data.

I also investigated a strange issue that Jane Roberts had encountered when adding works to the Old English Thesaurus using the CMS.  Certain combinations of characters in the ‘notes’ field were getting blocked by Apache, and once I’d figured this out we were able to address the issue.

I also spent a bit of time on the Books and Borrowing project, running a query and generating data about all of the book holding records that currently have no associated book edition record in the system (there are about 10,000 such records).  We had also received the images for the final two registers in the Advocates Library from the NLS digitisation unit and I spent some time downloading these, processing the images to remove blank pages and update the filenames, uploading the images to our server and then running a script to generate register and page records for each page in both registers.  These should be the last registers that need to get added to the system so it’s something of a milestone.

Week Beginning 24th April 2023

This week I continued to fix and reinstate some of the websites that had been taken offline due to the College of Arts server compromise.  This took up at least two full days of my time.  I also began to set up the website for Ophira Gamliel’s new AHRC funded project and fixed an accessibility issue with the Edinburgh’s Enlightenment map.  I responded to a query about QR codes from Alison Wiggins and spoke to Jennifer Smith about the follow-on project for Speak For Yersel, which will expand the site to new areas beyond Scotland.

Other than these tasks I spent the remainder of the week working for the Speech Star project.  I added a new database of speech videos to both Speech Star sites.  This is the Edinburgh MRI modelled speech database, consisting of around 50 MRI videos, divided into several categories.  I retained the folder structure, with expandable / collapsible sections for the three main folders and also for the subfolders contained in each.  The display of the video information in the page is similar to the Central Scottish phonetic features page, with boxes for each video containing a ‘Play’ button and the title and lexical items for each video.  You can then press ‘Play’ to open the usual video overlay that contains the video, the metadata and the playback speed option.  I also added in the introductory text and the logos.  Below is a screenshot of how the new database looks with one section open:

I also updated the list of databases, replacing the list of buttons with more pleasing images and text, as you can see below:

I then moved onto a major update to the extIPA chart page.  Previously one chart was shown, but an updated extIPA chart was released in 2015 that has many more symbols on it.  Eleanor had arranged for a new set of MRI recordings to be made, including many of these newly added sounds.  I therefore had to add in a new ‘post-2015’ chart and also add in a secondary set of MRI recordings, in addition to the first set that still needed to be accessible.  In addition, a new selection of animations had been created and I needed to add these in.

This page now has tabs for ‘Pre-2015’ and ‘Post-2015’.  The ‘Pre-2015’ tab displays the table as it was originally with video type ‘MRI 1’ selected by default.  You can then change to ‘MRI 2’ to view the new video clips.  More sounds have videos available in ‘MRI 2’ so less symbols are greyed out.  The ‘Animation’ tab also now includes links to a lot more videos.

The ‘Post-2015’ tab features all of the rows, columns and symbols from the new chart.  It’s been rather tricky getting all of these symbols to display, but I’ve got them all working now.  The ‘MRI 1’ table contains all of the original videos, with lots of symbols greyed out as there are no videos for them.  ‘MRI 2’ features links to all of the new videos, with a few symbols greyed out as the sounds haven’t been recorded.  The ‘animation’ tab has links to the same videos as the ‘Pre-2015’ table, but presented on the new chart.  It has been very time-consuming getting this up and running, but I’m sure it will be worth it.  Below is a screenshot of the ‘post-2015’ table with all of its symbols and the new ‘MRI 2’ videos populating it.

I also added in a new feedback page to both Speech Star sites which links through to a survey hosted on Qualtrics.  I added in a pop-up that prompts people to take the survey using the same code that I created when we added a survey to Seeing Speech and Dynamic Dialects a few years ago.

Monday next week is a holiday, and I’ve also been called up for jury duty, meaning I have to go to court on Tuesday.  I really hope I won’t get picked to serve as I have an awful lot to do in the coming weeks, especially for Books and Borrowing – a project I didn’t have any time to work on this week.

Week Beginning 17th April 2023

On Monday this week I attended the Books and Borrowing conference in Stirling.  I gave a demonstration of the front-end I’ve been developing which I think went pretty well.  Everyone seems very pleased with the site and its features.

I also spent a bit of time working for the DSL.  I investigated an issue with square brackets in the middle of words which was causing the search to fail.  It would appear that Solr will happily ignore square brackets in the data when they are at the beginning or end of a word, but when they’re in the middle the word is then split into two.  We decided that in future we will strip square brackets out of the data before it gets ingested into Solr.  I also updated the site so that the show/hide quotations and etymology setting is no longer ‘remembered’ by the site when the user navigates between entries.  This was added in as a means of enabling users to tailor how they viewed the entries without having to change the view on every entry they looked at, but I can appreciate it might have been a bit confusing for people so it’s now been removed.  I also implemented a simple Cookie banner for the site and after consultation with the team this went live.

Also this week I responded to a query from Piotr Wegorowski in English Language and Linguistics about hosting podcasts and had an email chat with Mike Irwin, who started this week as head of IT for the College of Arts.  I spent the rest of the week working through the WordPress sites I manage and ensuring they all work via our new external host.  This involved fixing a number of issues with content management systems and APIs, updating plugins, creating child themes and installing new security plugins.  By the end of the week around 20 websites that had been offline since early February were back online again, including this one!