Week Beginning 19th June 2023

I continued to work for the Books and Borrowing project this week, switching the search facilities over to use a new Solr index that includes author gender.  It is now possible to incorporate author gender into searches, for example bringing back all borrowing records involving books written by women.  This will be a hugely useful feature.  I also fixed an issue with a couple of page images of a register at Leighton library that weren’t displaying.

The rest of my time this week was spent developing a new Bootstrap powered interface for the project’s website, which is now live (https://borrowing.stir.ac.uk/).  You’d struggle to notice any difference between this new version and the old one as the point of creating this new theme was not to change the look of the website but to make Bootstrap (https://getbootstrap.com/) layout options available to the dev site.  This will allow me to make improvements to the layout of things like the advanced search forms.  I haven’t made any such updates yet, but that is what I’ll focus on next.

It has taken quite a bit of time to get the new theme working properly – blog posts with ‘featured images’ that replace the site’s header image proved to be particularly troublesome to get working – but I think all is functioning as it should be now.  There are a few minor differences between the new theme and the old one.  The new theme has a ‘Top’ button that appears in the bottom right when you scroll down a long page, which is something I find useful.  The drop-down menus in the navbar look a bit different, as does the compact navbar shown on narrow screens.  All pages now feature the sidebar whereas previously some (e.g. https://borrowing.stir.ac.uk/libraries/) weren’t showing it.  Slightly more text is shown in the snippets on the https://borrowing.stir.ac.uk/project-news/ and other blog index pages.  Our title font is now used for more titles throughout the site.  I’ve also added in a ‘favicon’ for the site, which appears in the browser tab.  It’s the head of the woman second from the right in the site banner, although it is a bit indistinct.  My first attempt was the book held by the woman in the middle of the banner but this just ended up as a beige blob.

Next week I’ll update the layout of the dev site pages to use Bootstrap.  I’m going to be on holiday the week after next and at a conference the week after that so this might be a good time to share the URL for feedback, as other than adding in book genre when this is available everything else should pretty much be finished.

For the Anglo-Norman Dictionary this week I participated in a conference call to discuss collaborative XML editing environments.  The team are wanting to work together directly on XML files and to have a live preview of how these changes appear.  The team are investigating https://www.fontoxml.com/ and also https://paligo.net/ and https://www.xpublisher.com/en/xeditor.  However, none of these solutions give any mention whatsoever of pricing on their websites, which is incredibly frustrating and off-putting.  I also mentioned the DPS system that the DSL uses (https://www.idmgroup.com/content-management/dps-info.html).  We’ll need to give this some further thought.

I also spent a bit of time writing a script to extract language tags from the data.  The script goes through each ‘active’ entry in the online database and picks out all of the language tags from the live entry XML and stores each language and the number of times each language appears in each entry (across all senses, subsenses, locutions).  It does the same for the ‘dms_entry_old’ XML data (i.e. the data that was originally stored in the current system before any transformations or edits were made) for each of these ‘active’ entries (if the XML data exists) and similarly stores each language and frequency as ‘old’ languages.  In addition, the script goes through each of the ‘R’ XML files and picks out all language tags contained in them, augmenting in the list of ‘old’ languages.  For each ‘active’ entry that has at least one ‘live’ or ‘old’ language the script exports the slug and the ‘live’ and ‘old’ languages, consisting of each language found and the number of times found in the entry.  This data is then saved in a spreadsheet.

There are 1908 entries that will need to be updated and this update will consist of removing all language tags from each sense / subsense in each listed entry, adding a new language tag at entry level (probably below the <pos> tag) for each distinct language found, updating the DTD to make the newly positioned tags valid and updating the XSLT to ensure the new tags get displayed properly in the web pages.

I also began to think about how I’ll implement date / part of speech searches and sparklines in the Dictionaries of the Scots Language and have started writing a requirements document for the new features.  We had previously discussed adding the date search and filter options to quotations searches only, but subsequent emails from the editor suggest that these would be offered for other searches too and that in addition we would add in a ‘first attested’ search / filter option.

The quotation search will look for a term in the quotations, with each quotation having an associated date or date range.  Filtering the results by date will then narrow the results to only those quotations that have a date within the period specified by the filter.  For example, a quotation search limited to SND for the term ‘dreich’ will find 113 quotations.  Someone could then use the date filter to enter the years 1900-1950 which will then limit the quotations to 26 (those that have dates in this period).

At the moment I’m not sure how a ‘first attested’ filter would work for a quotation search.  A ‘first attested’ date is something that would be stored at entry rather than quotation level and would presumably be the start date of the earliest quotation.  So for example the SND entry for ‘Driech’ has an earliest quotation date of 1721 and we would therefore store this as the ‘first attested’ date for this entry.

This could be a very useful filter for entry searches and although it could perhaps be useful in a quotation search it might just confuse users.  E.g. the above search for the term ‘Dreich’ in SND finds 113 quotations.  A ‘first attested’ filter would then be used to limit these to quotations associated with entries that have a first attested date in the period selected by the user.  So for example if the user enters 1700-1750 in the ‘Dreich’ results then the 113 quotations would then be limited to those belonging to entries that were first attested in this period, which would include the entry ‘Dreich’.  But the listed quotations would still include all of those for the entry ‘Driech’ that include with search term ‘Dreich’ not just those from 1700-1750 because the limit was placed on entries with a first attested date in that period and not quotations found in that period.

In addition, the term searched for would not necessarily appear in the quotation that gave the entry its first attested date.  An entry can only have one first attested date and (in the case of a quotation search) the results will only display quotations that feature the search term, which will quite possibly not include the earliest quotation.  A search for quotations featuring ‘dreich’ in SND will not return the earliest quotation for the entry SND ‘Dreich’ as the form in this quotation is actually ‘dreigh’.

If we do want to offer date searching / filtering for all entry searches and not just quotation searches we would also have to consider whether we would then just store the dates of the earliest and last quotations to denote the ‘active’ period for the entry or whether we would need to take into account any gaps in this period as will be demonstrated by the sparklines.  If it’s the former then the ‘active’ period for SND ‘Dreich’ would be 1721-2000, so someone searching the full text of entries for the term ‘dreich’ and then entering ‘1960-1980’ as a ‘use’ filter will then still find this entry.  If it’s the latter than this filter would not find the entry as we don’t have any quotations between 1954 and 1986.

Also this week I had to spend a bit of time fixing a number of sites after a server upgrade stopped a number of scripts working.  It took a bit of time to track all of these down and fix them.  I also responded to a couple of questions from Dimitra Fimi of the Centre for Fantasy and the Fantastic regarding WordPress stats and mailing list software and discussed a new conference website with Matthew Creasy in English Literature.

Week Beginning 12th June 2023

For the Books and Borrowing project this week I completed an initial version of the site-wide facts and figures page.  The page functions in a very similar way to the library-specific facts and figures page as discussed in previous weeks, but here you can view the data for any or all libraries rather than just a single one.  Libraries are selectable from a series of checkboxes as in the advanced search, allowing you to group together and compare library data, for example the data for the three University libraries or data from libraries in southern Scotland.  However, the page can take about 30 seconds to load as it’s processing an awful lot of data for the default view (which is data for every library).  I’m going to create a cached version of the data for this particular view and possible others, but I haven’t got round to it yet.

The ‘Summary’ section provides an overview of the amalgamated data for your chosen libraries and the ‘top ten lists’ are also amalgamated.  The book holding and prolific borrowers lists include the relevant library in square brackets after the number of associated borrowing records as these lists contain items that are related to specific libraries (borrowers and book holdings ‘belong’ to a specific library).  I’ve also added in the top ten borrowed book editions, as these are site-wide, as are authors.  The links through to the search results for these lists include your selected libraries to ensure the search results match up.  For example, Sir Walter Scott is the most borrowed author when viewing data for Dumfries, Selkirk and Wigtown and when you press on the author’s name you search is limited to these libraries.  The occupations and borrowings through time visualisations contain data for all selected libraries and rather than a ‘book holding frequency of borrowing’ chart at the bottom of the page there is a ‘book edition’ version as editions are site-wide.  As with other data, the editions and borrowings over time returned here are limited to your chosen libraries and links through to the search results also incorporate these.

I still need to address a couple of things, though.  Firstly, book edition titles have not currently been cut off after 50 characters as I realised doing so for titles with non-Latin characters (e.g. Greek) broke the page.  This is because characters such as Greek take up multiple bytes while Latin characters take up one byte each.  The default substring method for chopping up strings is not ‘multibyte safe’, meaning non-Latin characters can get split in the middle of the data, which results in an error.  There is a handy multibyte version of the substring method but unfortunately multibyte functions have not been installed on the server so I can’t use it.  Once I’ve managed to get this installed I’ll add in the limit.  Secondly, I’ve noticed that the visualisations don’t work very well on touchscreens so I’m going to have to rework them to try and improve matters.  For example, the donut charts allow you to click on a section to perform a search for the section’s occupation, but as touchscreens have no hover state this means on a touchscreen it’s not possible to view the tooltip that includes the name of the occupation and the exact percentage.  Also, the click through to a year on the ‘borrowings through time’ chart don’t seem to work on my iPad and the book edition autocomplete doesn’t seem to be firing either.

Also for the project this week I assigned some borrowing records for a borrower in Wigtown to another borrower, as it turned out these were the same person.  This also involved updating the borrower’s total number of borrowings and active borrowing period.  I then began to look into migrating the website’s theme to Bootstrap.  I’m going to be working on this on a local version of the site running on my laptop and I began getting things set up.

Also this week I uploaded a new set of videos and an updated version of the database for the Child Speech Error Database that as part of the Speech Star project and was involved in the migration of Seeing Speech and Dynamic Dialects to a new server.  I was also involved in the migration of two further WordPress sites to our external provider and sorted out some issues with a couple of other WordPress sites that had been moved to our new in-house server.

I also spent a bit of time working for the DSL, engaging in an email conversation about how the new part of speech search will work and replacing the ‘Share’ buttons on the entry page with an alternative as the ‘AddThis’ service that we previously used has been shut down by parent company Oracle (See https://www.addtoany.com/blog/replace-addthis-with-addtoany-the-addthis-alternative/).  This involved investigating an issue with one member of staff who was just seeing a blank space rather than the new options.  It turned out the Chrome plugin ‘DuckDuckGo privacy essentials’ was blocking the service so we’ll have to watch out for this.

Also this week I spent a bit of time working for the Anglo-Norman Dictionary.  I had an email discussion about collaborative XML tools with one of the project team and we will have a Zoom call next week to go over this in more detail.  I also wrote a little script to export a set of XML files from the live database, in this case all entries beginning with the letter ’U’.  After doing this the editor happened to notice that the ‘language’ tags seem to have disappeared from entries.

This then led onto some detailed investigation into this very worrying situation.  It would appear that some process that modified the XML has at some point removed the language tag.  This appears to have happened up to and including the import of R, but whatever was causing the issue had been rectified by the time we imported S.  Looking at specific examples I noticed that the language tag was present in the XML for ‘rother’ prior to upload, but is not present in the XML in the online database.  On a version of the data on my local PC I deleted ‘rother’ and ran it through the upload process that was used for ‘S’ and the language tag remained intact.  I can’t be certain but I presume the error was caused by the script that assigns IDs and numbers to <sense> and <senseInfo> tags as it is this script that edits the XML.  We currently have 493 entries that feature a language tag.  Heather asked me to export counts of languages found in entries in March last year and there were a similar number then.  We still have a column holding the initial data that was first imported into the new system and in the current data there are 1759 ‘active’ entries that feature a language tag.  This does not include any R data as it was added after this point.  Looking back through old versions of the database, the version of the data I have from 26/11/20 does not have the ‘old’ column and features 1867 entries with language tags.  The import of R must have happened sometime shortly after I created this version.  The version of the data I have from 03/26/21 features the R data and also is the first version I have with the ‘old’ column.  In this version the live data now only has 47 entries with language tags while ‘old’ has 1868 entries with the language tag.

In order to fix this issue I’m going to have to somehow write a script to look at ‘old’ for language tags then attempt to extract these.  I’ll also need to do this for the R data too, directly from the XML files that I was sent prior to upload.  Thankfully after further discussion with Geert it turns out that he thinks the language tag should be an ‘entry’ level rather than ‘sense’ / ‘subsense’ level.  Moving the language tag out of senses and placing it at entry level should actually make the data retrieval much more straightforward.  The biggest issue would have been figuring out which sense each language tag needs to be added to, as the old data does not have sense IDs or any other easy way of identifying which sense is which.  If instead a script just needs to pull out all language tags in the entry’s senses / subsenses and then add them to a new section directly in the entry that should be much simpler.

I tested out the DMS (downloading and then uploading the entry ‘slepewrt’ which features a language tag) and it is retaining the tag, so it is thankfully safe to continue to use the DMS.  Next week I’ll begin working on a script for reinstating the missing tags.

Week Beginning 5th June 2023

I continued to work on the Books and Borrowing project for most of this week.  One of my main tasks was to fully integrate author gender into the project’s various systems.  I had create the required database field and had imported the author gender data last week, but I still needed to spend some time adding author gender to the CMS, API, Solr and the front-end.  It is now possible for the team to add / edit author genders through the CMS wherever the author add / edit options are available and all endpoints involving authors in the API now bring back author gender.  I have also updated the front-end to display gender, adding it to the section in brackets before the dates of birth and death.

I have also added author gender to the advanced search forms but a search for author gender will not work until the Solr instance is updated.  I prepared the new data and have emailed our IT contact in Stirling to ask him to update it, but unfortunately he’s on holiday for the next two weeks so we’ll have to wait until he gets back.  I have tested the author gender search on a local instance on my laptop, however, and all seems to be working.  Currently author gender is not one of the search results filter options (the things that appear down the left of the search results).  I did consider adding it in (as with borrower gender) but there are so few female authors compared to male that I’m not sure a filter option would be all that useful most of the time – if you’re interested in female authors you’d be better off performing a search involving this field instead rather than filtering an existing search.

I then moved on to developing a further (and final) visualisation for the library facts and figures page.  This is a stacked column chart showing borrowings over time divided into sections for borrower occupation (a further version with divisions by book genre will be added once this data is available).  I’ve only included the top level occupation categories (e.g. ‘Education’) as otherwise there would be too many categories; as it is the chart for some libraries gets very cluttered.

It took quite some time to process the data for the visualisation but I completed an initial version after about two days of work.  This initial version wasn’t completely finished – I still needed to add in the option to press on a year, which will then load a further chart with the data split over the months in the chosen year.  Below is a screenshot of the visualisation for Haddington library:

With the above example you’ll see what I mean about the chart getting rather cluttered.  However, you can switch off occupations by pressing on them in the legend, allowing you to focus on the occupations you’re interested in.  For some libraries the visualisation is a little less useful, for example Chambers only has data for three years, as you can see below:

I also had to place a hard limit on the start and end years for the data because Chambers still has some dodgy data with a borrowing year thousands of years in the future which caused the script to crash my browser as it tried to generate the chart.  Note that the colours for occupations match those in the donut charts displayed higher up the page to help with cross-referencing.

Later in the week I completed work on the ‘drilldown’ into an individual year for the visualisation.  With this feature you can press on one of the year columns and the chart will reload with the data for that specific year, split across the twelve months.  This drilldown is pretty useful for Chambers as there are so many borrowings in each year.  It’s useful for other libraries too, of course.  You can press on the ‘Return to full period’ button above the chart to get back to the main view and choose a different year.  Below is an example of a ‘drilldown’ view of the Chambers library data for 1829:

As I developed the drilldown view I realised that the colours for the sections of the bar charts were different to the top-level view when the drilldown is loaded, as the colours are assigned based on the number of occupation categories that are present, and in the drilldown there may only be a subset of occupations.  This can mean ‘Religion’ may be purple in the main view but green in the drilldown view, for example.  Thankfully I managed to fix this, meaning the use of colour is consistent across all visualisations.

The next item on my agenda is to create the site-wide ‘Facts and Figures’ page, which will work in the same way as the library specific page, but will present the data for all libraries and also allow you to select / deselect libraries.  I spent some further time updating the API endpoints I’d developed for the visualisations so that multiple library IDs can be passed.  Next week I’ll create the visualisations and library selection options for the site-wide ‘Facts’ page.  After that I will probably move onto the redesign of the front-end to use Bootstrap, during which I will also redesign many of the front-end pages to make them a bit easier to use and navigate.  For example, the facts and figures page will probably be split into separate tabs as it’s a bit long at the moment and the search form needs to be reworked to make it more usable.

Also this week I spent about a day working for the DSL.  This involved having one of our six-monthly Zoom calls to discuss developments and my work for the months ahead.  As usual, it was a useful meeting.  We discussed the new ‘search quotations by year’ facility that I will be working on over the coming months, plus sparkline visualisations that I’ll be adding into the site, in addition to the ongoing work on the parts of speech and how a POS search could be incorporated.

I was also told about a video of the new OED interface that will launch soon (see https://www.youtube.com/watch?v=X-_NIqT3i7M).  It looks like a really great interface and access is going to be opened up.  They are also doing a lot more with the Historical Thesaurus data and are integrating it with the main dictionary data more effectively.  It makes me question how the Glasgow hosted HT site is going to continue, as I fear it may become rather superfluous once the new OED website launches.  We’ll see how things develop.

After the call I began planning how to implement the quotation date search, which will require a major change to the way the data is indexed by Solr.  I also instructed our IT people to remove the old Solr instance now the live site is successfully connected to the new instance without issue.

Also this week, after a bit more tweaking, Ophira Gamliel’s HiMuJe Malabar site went live and it can now be viewed at https://himuje-malabar.glasgow.ac.uk/.   I also had a meeting with Stephen Rawles of the Glasgow Emblems project.  I worked with Stephen to create two websites that launched almost 17 years ago now (e.g. https://www.emblems.arts.gla.ac.uk/french/) and Stephen wanted me to make a few updates to some of the site text, which I did.  I also fixed a few issues that had cropped up with other sites that had been migrated to external hosting.

Week Beginning 29th May 2023

Monday was a holiday this week and I spent most of the four working days on the Books and Borrowing project, although unfortunately my access to the Stirling University VPN stopped working on Wednesday and access wasn’t restored until late on Thursday afternoon.  As I am unable to access the project’s server and database without VPN access this limited what I could do, although I did manage to work on some code ‘blind’ (i.e. without uploading it to the server and testing it out) for much of Wednesday.

For the project this week I generated a list of Haddington book holdings that don’t have any borrowings so the team could test some things.  I also added in author gender as a field and wrote a script to import author gender from a spreadsheet, together with tweaks made to all of the author name fields.  I still need to fully integrate author gender into the CMS, API and front-end, which I will focus on next week.

I spent the rest of my time on the project this week developing a new visualisation for the library facts and figures page.  It is a line chart for plotting the number of borrowing records for book holdings over time and it features an autocomplete box where you can enter the name of a book holding.  As you type, any matching titles appear, with the total number of borrowings in brackets after the title.  If the title is longer than 50 characters it is cropped and ‘…’ is added.  Once you select a book title a line chart is generated with the book’s borrowings plotted.  You can repeat this process to add as many books as you want, and you can also remove a book by pressing on the ‘delete’ icon in the list of selected books above the chart.  You can also press on the book’s title here to perform a search to view all of the associated borrowing records.

Hopefully this chart will be of some use, but its usefulness will really depend on the library and the range and number of borrowings.  So for example the image below for Wigtown shows a comparison of the borrowings for four popular journals, which I think could be pretty useful.  A similar comparison at Chambers is less useful, though, due to the limited date range of borrowing records.

I also spent some time working for the DSL this week.  This included investigating an issue with one of the DSL’s website that I’m not involved with (https://www.macwordle.co.uk/) which was no longer generating any Google Analytics stats.  It turned out that the site had been updated recently and the GA code had not been carried over.  I also fixed another couple of occurrences of the ‘Forbidden’ error that had cropped up due to a change in the way Apache handles space characters.

The rest of my time was spent on the new Solr instance that we’d set up for the project last week.   The DSL team had been testing out the new instance, which I had connected to our test version of the DSL site, and had spotted some inconsistencies.  With the new Solr instance, when a search is limited to a particular source dictionary it fails to return any snippets (the sections of the search results with the term highlighted) while removing the ‘source’ part of the query works fine.  I was unable to find anything online about why this is happening, but by moving the ‘source’ part from the ‘q’ variable to ‘fq’ the highlighting and snippets are returned successfully.

There was also an issue with the length of the snippets being returned, and some snippets being amalgamated and featuring multiple terms rather than being treated separately.  It would appear that the new version of Solr uses a new highlighting method called ‘unified’ and this does not seem to be paying attention to the ‘fragsize’ variable that should set the desired size of the snippet.  I’d set this to 100 but some of the returned snippets were thousands of characters long.  In addition, it amalgamates snippets when the highlighted term is found multiple times in close proximity.  I have now figured out how to revert to the ‘original’ highlighting method and this seems to have got things working in the same way as the old Solr instance (and also addressed the ‘source’ issue mentioned above).  With this identified and fixed the results for the new Solr instance displayed identically to the old Solr instance and on Friday I made the switch to make the live DSL site use the new Solr instance.

Also this week I made some further updates to the STAR website, fixing a number of typos and changing the wording of a few sections.  I also added in a ‘spoiler’ effect to the phonetic transcriptions as found here: https://www.seeingspeech.ac.uk/speechstar/child-speech-error-database/ which blanks out the transcriptions until pressed on.  This is to facilitate teaching, enabling students to write their own transcriptions and then check theirs against the definitive version.  I also finally added citation information for all videos on the live STAR site now.  Video popups should all now contain citation information and links to the videos.  You can also now share URLs to specific videos, for example https://seeingspeech.ac.uk/speechstar/disordered-child-speech-sentences-database/#location=1 and this also works for multiple videos where this option is given, for example https://seeingspeech.ac.uk/speechstar/speech-database/#location=617|549.

Finally this week I made a number of changes to Ophira Gamliel’s new project website, which is now ready to launch.