Week Beginning 2nd October 2023

This was a week of many different projects.  On Monday I completed work on a new project website for Petra Poncarova in Scottish Literature, and it is now publicly accessible (see https://erskine.glasgow.ac.uk/).  I also added a blog page to Ophira Gamliel’s project website, created a page for their first blog post (now available here: https://himuje-malabar.glasgow.ac.uk/reconnecting-the-split-moon/) and updated the site to include a link to the blog in the site menu.  This required shifting a few things around to make room for the new menu item.  I also investigated an issue Luca was having in migrating one of Graeme Cannon’s old websites which was similarly structured to the House of Fraser Archive site and managed to find the section of code that was causing the problem (a flag in a regular expression that has since been deprecated).

On Tuesday I completed my work on the CSV endpoints for the Books and Borrowing project, ensuring all nested arrays are ‘flattened’ when producing the two-dimensional CSV file.  This has been a lengthy and tedious task, but it’s good that it’s done, and it should mean that future researchers will be able to extract and reuse the data in a relatively straightforward manner.

On Wednesday I met Luca and Stevie, two of my fellow College of Arts developers to have a catch-up, which was hugely useful as always.  We’ll hopefully meet up again in the next couple of months.  I also responded to a request from Luca to help get some screenshots ready for print publication.  Screenshots are generally 72DPI but this is too low for print.  I’ve previously got around this using Photoshop by loading the image then going to image -> image size.  In the options you can then untick ‘Resample Image’ and then update the ‘resolution’ to whatever you want.  I’ve never actually printed the resulting images to check any difference, but I’ve never had anyone come back and ask for better versions.  I guess another option would be to take the screenshots on something like an iPad that natively runs at a higher DPI.

Also on Wednesday I spent some time on the DSL, investigating an issue with Google Analytics for Pauline Graham and then investigating a problem with phrase searching and highlighting that Pauline had also noticed on both the live and test sites.  When a phrase was searched for each individual word in the phrase was being highlighted in the entry, and then if you returned to the search results and went back to an entry from there no highlighting worked.  Also some search results were not featuring snippets.  This turned out to be three separate issues that needed to be investigated and fixed:

  1. Separate word highlighting: The default setting in the highlighting library I installed a few months ago highlighted each word in a string.  If there were multiple words separated by spaces then all matching words would be highlighted.  Thankfully the library (https://markjs.io/) has a setting that only matches the entire string and I’ve activated this now.  Now if you perform a search for ‘off or on’ or something and navigate to a result only the exact term will be highlighted.
  2. Losing the highlighting when navigating back to the results and then to an entry: This was a problem with spaces getting encoded between pages.  They were becoming the URL encoded equivalent ‘%B’ or ‘+’ and after that the string no longer matched.  I’ve sorted this.
  3. Lack of snippets: The issue was down to the length of the entry.  In Solr, the snippet generation is a separate process to the search matching.  While the search checks the entire entry the snippet generation by default only looks at the first 51,200 characters.  An entry such as ‘Mak’ is a long entry and if the search term only matches text quite far down the entry a snippet doesn’t get created.  After discovering this I’ve updated the setting so that 100,000 characters are analysed instead and this has fixed the issue.  More information about this can be found at https://stackoverflow.com/questions/52511154/solr-empty-highlight-entry-on-match.

This investigation took some of Thursday as well, after which I moved back to the Books and Borrowing project, for which I spent some time generating data relating to the Royal High School for checking purposes.  I also received some bid documentation for a proposal Gavin Miller is putting together.  Gavin wanted me to read through the documentation and add in some further sections relating to the data.  The data will consist of a directory of projects and resources which will be available to search and browse, plus will be visualised on an interactive map.  I added in some information and hopefully the proposal is a success.

On Friday I made some further updates to the Speech Star websites, adding in some new videos to the Edinburgh MRI Modelled Speech Corpus (https://www.seeingspeech.ac.uk/speechstar/edinburgh-mri-modelled-speech-corpus/) and arranging their layout a bit better.  I also replied to a request from Rhona Brown, who would like a website to be set up for a new project she’s starting work on soon.  I listed a few options we could pursue and I need to wait to hear more from her now.

I also spent quite of bit of time investigating some minor issues Ann Ferguson had spotted with the predictive search on the DSL website, most of which will thankfully be sorted when the new Solr based headword search goes live.

Finally, I had a meeting with the Placenames of Iona project to discuss the development of a new ‘map first’ interface for the data.  I met with Thomas, Sofia and Alasdair and it was really great to actually have an in person meeting with them, having never done so before.  We discussed many aspects of the interface and had some really useful discussions.  I’ll be starting on the development of the front-end in the coming weeks.

Week Beginning 5th June 2023

I continued to work on the Books and Borrowing project for most of this week.  One of my main tasks was to fully integrate author gender into the project’s various systems.  I had create the required database field and had imported the author gender data last week, but I still needed to spend some time adding author gender to the CMS, API, Solr and the front-end.  It is now possible for the team to add / edit author genders through the CMS wherever the author add / edit options are available and all endpoints involving authors in the API now bring back author gender.  I have also updated the front-end to display gender, adding it to the section in brackets before the dates of birth and death.

I have also added author gender to the advanced search forms but a search for author gender will not work until the Solr instance is updated.  I prepared the new data and have emailed our IT contact in Stirling to ask him to update it, but unfortunately he’s on holiday for the next two weeks so we’ll have to wait until he gets back.  I have tested the author gender search on a local instance on my laptop, however, and all seems to be working.  Currently author gender is not one of the search results filter options (the things that appear down the left of the search results).  I did consider adding it in (as with borrower gender) but there are so few female authors compared to male that I’m not sure a filter option would be all that useful most of the time – if you’re interested in female authors you’d be better off performing a search involving this field instead rather than filtering an existing search.

I then moved on to developing a further (and final) visualisation for the library facts and figures page.  This is a stacked column chart showing borrowings over time divided into sections for borrower occupation (a further version with divisions by book genre will be added once this data is available).  I’ve only included the top level occupation categories (e.g. ‘Education’) as otherwise there would be too many categories; as it is the chart for some libraries gets very cluttered.

It took quite some time to process the data for the visualisation but I completed an initial version after about two days of work.  This initial version wasn’t completely finished – I still needed to add in the option to press on a year, which will then load a further chart with the data split over the months in the chosen year.  Below is a screenshot of the visualisation for Haddington library:

With the above example you’ll see what I mean about the chart getting rather cluttered.  However, you can switch off occupations by pressing on them in the legend, allowing you to focus on the occupations you’re interested in.  For some libraries the visualisation is a little less useful, for example Chambers only has data for three years, as you can see below:

I also had to place a hard limit on the start and end years for the data because Chambers still has some dodgy data with a borrowing year thousands of years in the future which caused the script to crash my browser as it tried to generate the chart.  Note that the colours for occupations match those in the donut charts displayed higher up the page to help with cross-referencing.

Later in the week I completed work on the ‘drilldown’ into an individual year for the visualisation.  With this feature you can press on one of the year columns and the chart will reload with the data for that specific year, split across the twelve months.  This drilldown is pretty useful for Chambers as there are so many borrowings in each year.  It’s useful for other libraries too, of course.  You can press on the ‘Return to full period’ button above the chart to get back to the main view and choose a different year.  Below is an example of a ‘drilldown’ view of the Chambers library data for 1829:

As I developed the drilldown view I realised that the colours for the sections of the bar charts were different to the top-level view when the drilldown is loaded, as the colours are assigned based on the number of occupation categories that are present, and in the drilldown there may only be a subset of occupations.  This can mean ‘Religion’ may be purple in the main view but green in the drilldown view, for example.  Thankfully I managed to fix this, meaning the use of colour is consistent across all visualisations.

The next item on my agenda is to create the site-wide ‘Facts and Figures’ page, which will work in the same way as the library specific page, but will present the data for all libraries and also allow you to select / deselect libraries.  I spent some further time updating the API endpoints I’d developed for the visualisations so that multiple library IDs can be passed.  Next week I’ll create the visualisations and library selection options for the site-wide ‘Facts’ page.  After that I will probably move onto the redesign of the front-end to use Bootstrap, during which I will also redesign many of the front-end pages to make them a bit easier to use and navigate.  For example, the facts and figures page will probably be split into separate tabs as it’s a bit long at the moment and the search form needs to be reworked to make it more usable.

Also this week I spent about a day working for the DSL.  This involved having one of our six-monthly Zoom calls to discuss developments and my work for the months ahead.  As usual, it was a useful meeting.  We discussed the new ‘search quotations by year’ facility that I will be working on over the coming months, plus sparkline visualisations that I’ll be adding into the site, in addition to the ongoing work on the parts of speech and how a POS search could be incorporated.

I was also told about a video of the new OED interface that will launch soon (see https://www.youtube.com/watch?v=X-_NIqT3i7M).  It looks like a really great interface and access is going to be opened up.  They are also doing a lot more with the Historical Thesaurus data and are integrating it with the main dictionary data more effectively.  It makes me question how the Glasgow hosted HT site is going to continue, as I fear it may become rather superfluous once the new OED website launches.  We’ll see how things develop.

After the call I began planning how to implement the quotation date search, which will require a major change to the way the data is indexed by Solr.  I also instructed our IT people to remove the old Solr instance now the live site is successfully connected to the new instance without issue.

Also this week, after a bit more tweaking, Ophira Gamliel’s HiMuJe Malabar site went live and it can now be viewed at https://himuje-malabar.glasgow.ac.uk/.   I also had a meeting with Stephen Rawles of the Glasgow Emblems project.  I worked with Stephen to create two websites that launched almost 17 years ago now (e.g. https://www.emblems.arts.gla.ac.uk/french/) and Stephen wanted me to make a few updates to some of the site text, which I did.  I also fixed a few issues that had cropped up with other sites that had been migrated to external hosting.

Week Beginning 29th May 2023

Monday was a holiday this week and I spent most of the four working days on the Books and Borrowing project, although unfortunately my access to the Stirling University VPN stopped working on Wednesday and access wasn’t restored until late on Thursday afternoon.  As I am unable to access the project’s server and database without VPN access this limited what I could do, although I did manage to work on some code ‘blind’ (i.e. without uploading it to the server and testing it out) for much of Wednesday.

For the project this week I generated a list of Haddington book holdings that don’t have any borrowings so the team could test some things.  I also added in author gender as a field and wrote a script to import author gender from a spreadsheet, together with tweaks made to all of the author name fields.  I still need to fully integrate author gender into the CMS, API and front-end, which I will focus on next week.

I spent the rest of my time on the project this week developing a new visualisation for the library facts and figures page.  It is a line chart for plotting the number of borrowing records for book holdings over time and it features an autocomplete box where you can enter the name of a book holding.  As you type, any matching titles appear, with the total number of borrowings in brackets after the title.  If the title is longer than 50 characters it is cropped and ‘…’ is added.  Once you select a book title a line chart is generated with the book’s borrowings plotted.  You can repeat this process to add as many books as you want, and you can also remove a book by pressing on the ‘delete’ icon in the list of selected books above the chart.  You can also press on the book’s title here to perform a search to view all of the associated borrowing records.

Hopefully this chart will be of some use, but its usefulness will really depend on the library and the range and number of borrowings.  So for example the image below for Wigtown shows a comparison of the borrowings for four popular journals, which I think could be pretty useful.  A similar comparison at Chambers is less useful, though, due to the limited date range of borrowing records.

I also spent some time working for the DSL this week.  This included investigating an issue with one of the DSL’s website that I’m not involved with (https://www.macwordle.co.uk/) which was no longer generating any Google Analytics stats.  It turned out that the site had been updated recently and the GA code had not been carried over.  I also fixed another couple of occurrences of the ‘Forbidden’ error that had cropped up due to a change in the way Apache handles space characters.

The rest of my time was spent on the new Solr instance that we’d set up for the project last week.   The DSL team had been testing out the new instance, which I had connected to our test version of the DSL site, and had spotted some inconsistencies.  With the new Solr instance, when a search is limited to a particular source dictionary it fails to return any snippets (the sections of the search results with the term highlighted) while removing the ‘source’ part of the query works fine.  I was unable to find anything online about why this is happening, but by moving the ‘source’ part from the ‘q’ variable to ‘fq’ the highlighting and snippets are returned successfully.

There was also an issue with the length of the snippets being returned, and some snippets being amalgamated and featuring multiple terms rather than being treated separately.  It would appear that the new version of Solr uses a new highlighting method called ‘unified’ and this does not seem to be paying attention to the ‘fragsize’ variable that should set the desired size of the snippet.  I’d set this to 100 but some of the returned snippets were thousands of characters long.  In addition, it amalgamates snippets when the highlighted term is found multiple times in close proximity.  I have now figured out how to revert to the ‘original’ highlighting method and this seems to have got things working in the same way as the old Solr instance (and also addressed the ‘source’ issue mentioned above).  With this identified and fixed the results for the new Solr instance displayed identically to the old Solr instance and on Friday I made the switch to make the live DSL site use the new Solr instance.

Also this week I made some further updates to the STAR website, fixing a number of typos and changing the wording of a few sections.  I also added in a ‘spoiler’ effect to the phonetic transcriptions as found here: https://www.seeingspeech.ac.uk/speechstar/child-speech-error-database/ which blanks out the transcriptions until pressed on.  This is to facilitate teaching, enabling students to write their own transcriptions and then check theirs against the definitive version.  I also finally added citation information for all videos on the live STAR site now.  Video popups should all now contain citation information and links to the videos.  You can also now share URLs to specific videos, for example https://seeingspeech.ac.uk/speechstar/disordered-child-speech-sentences-database/#location=1 and this also works for multiple videos where this option is given, for example https://seeingspeech.ac.uk/speechstar/speech-database/#location=617|549.

Finally this week I made a number of changes to Ophira Gamliel’s new project website, which is now ready to launch.

Week Beginning 15th May 2023

I spent the majority of this week continuing to work for the Books and Borrowing project.  I added in images of both the Hunterian and Inverness Kirk Sessions library register pages, as these had not yet been integrated.  I then began working on the library ‘Facts and Figures’ page and added in the API endpoints and processing of library statistics, which now appear in a ‘summary’ section at the top of the page, as you can see for the Chambers library in the following screenshot:

We may need to include some further explanation of the above.  The number of borrowing records does not match the numbers split by gender, which may be caused by some borrowing records not having an associated borrower or others having more than one associated borrower.  Also as you can see a borrowing record in this library has an erroneous year, making it look like the borrowing records stretch into the distant future of 18292.

Beneath the summary are a series of ‘top ten’ lists, featuring the top ten borrowed books, authors and most prolific borrowers, both overall and further broken down by borrower gender, as you can see below:

These all link through to search results for the item in question.  Note that we once again face the issue of numbers here reflecting volumes borrowed while the search results count borrowing records, which can include any number of volumes.  I’m going to have a serious think about what we can do about this discrepancy as it is bound to cause a lot of confusion.  There is also something very wrong with the author figures that I need to investigate.  I think the issue is authors can be associated with any level of book record and I suspect authors in this list are getting counted multiple times.

There are also issues caused by the data still being worked on while the Solr cache and other cached data become outdated.  I spent quite a while on Tuesday trying to figure out why a book holding wasn’t appearing where it should or with the expected number of borrowings until I realised the data in the CMS had been updated (including the initial letter of the book title) while the cache hadn’t.

I then moved onto working on some visualisations.  I created a borrower occupations donut chart as described in the requirements document.  This chart shows the distribution of borrower occupations across library borrowers in a two-level pie chart with top-level occupations in the middle and these then subdivided into secondary level occupations in the outer ring.  Note that I haven’t further split ’Religion and Clergy’ > ‘Minister/Priest’ into its third level occupations as it is the only category that has three levels and it would have made the visualisation too complicated (both to develop and to comprehend in the available space).  Instead the individual ‘minister / priest’ categories are amalgamated into ‘minister / priest’.

The charts are a nice way to get an overall picture of the demographics of the selected library and also to compare the demographies of different libraries.  For example, Chambers has a very broad spread of occupations:

Whereas Advocates (as you’d expect) is more focussed:

You can hover over each segment to view the title of the occupation and the percentage of borrowers that have the occupation.  You can also click on a segment to perform a search for the occupation at the library in question.  You should bear in mind that a borrower can have multiple (or no) occupations so the number of borrowers and the number of occupations will be different.  The next thing I’ll do is create a further donut chart showing the number of borrowings per occupation, but I’ll leave that to next week.

I also thought some more about the thorny issue of number of borrowings versus number of borrowing records.  What I think I’ll try and do is ensure that the only number that is shown is the number of associated borrowing records.  I’ll need to experiment with how (or if) this might work, which I will leave until next week.

I also devoted some time to setting up the website for Ophira Gamliel’s new AHRC project.  I created several mockups of possible interface designs including a few different header images based on photographs of manuscripts that the project will analyse.  Ophira and Co-I Ines Weinrich picked the version they liked best and I then applied this to the live site.  I also worked on a static map of the area the project will focus on.  I created this by initially creating an interactive map using the lovely Stamen Watercolor basemap and then I used Photoshop to add other labels and to fade out areas the project is not focussing on.  The map still needs some work but here is an initial version:

I spent a bit of time working for the DSL, responding to a couple of queries from Pauline Graham about Google Analytics.  Pauline wondered whether it we could find search terms that people had entered that found no results in GA, but after some research I don’t think it would be possible as the page that GA will log is the same whether there are results or not.  We’d have to update the site structure to maybe load a different page (with the search terms in the URL) when there are no results and then we’d be able to isolate these page hits in GA.  However, we can limit the list of page stats in GA to a specific page and then order by number of views, which may give us some idea of which searches find nothing.  After selecting ‘Pages and screens’ in ‘Engagement’ pressing the blue plus in the results table header opens a pop-up  and selecting ‘Page / Screen’ then ‘Landing page + query string’ allows you to add a new column featuring the page URL.  Then above the table in the search area (with the magnifying glass) enter ‘/results/’ to limit the data to the search results page.  You can update the ‘rows per page’ and press the arrow next to ‘views’ to order by views from least to most.

I also spent some time on the Speech Star project this week, updating the ExtIPA charts to add in ‘sound name’ and ‘context’ to the metadata and updating the website to ensure this new information is displayed alongside the videos.  I also added in new metadata for the IPA charts, but for the moment this is only available for the MRI-2 videos.  I’ll need to add in metadata for the other video types later.

I also spent a bit more time working on the site migration, getting https://thepeoplesvoice.glasgow.ac.uk/ working again, fixing an issue with the place-name images upload for the Iona project and making updates to a number of other sites.