I completed work on the integration of genre into the Books and Borrowing systems this week. It took a considerable portion of the week to finalise the updates but it’s really great to have it done, as it’s the last major update to the project.
My first task was to add genre selection to the top-level ‘Browse Editions’ page, which I’m sure will be very useful. As you can see in the following screenshot, genres now appear as checkboxes as with the search form, allowing users to select one or more they’re interested in. This can be done in combination with publication date too. The screenshot shows the book editions that are either ‘Fiction’ or ‘Travel’ that were published between 1625 and 1740. The selection is remembered when the user changes to a different view (i.e. authors or ‘top 100’) and when they select a different letter from the tabs.
It proved to be pretty tricky and time-consuming to implement. I realised that not only did the data that is displayed need to be updated to reflect the genre selection, but the counts in the letter tabs needed to be updated too. This may not seem like a big thing, but the queries behind it took a great deal of thought. I also realised whilst working on the book counts that the counts in the author tabs were wrong – they were only counting direct author associations at edition level rather than taking higher level associations from works into consideration. Thankfully this was not affecting the actual data that was displayed, just the counts in the tabs. I’ve sorted this too now, which also took some time.
With this in place I then added a similar option to the in-library ‘Book’ page. This works in the same way as the top-level ‘Editions’ page, allowing you to select one or more genres to limit the list of books that are displayed, for example only books in the genres of ‘Belles Lettres’ and ‘Fiction’ at Chambers, ordered by title or the most popular ‘Travel’ books at Chambers. This did unfortunately take some time to implement as Book Holdings are not exactly the same as Editions in terms of their structure and connections so even though I could use much of the same code that I’d written for Editions many changes needed to be made.
The new Solr core was also created and populated at Stirling this week, after which I was able to migrate my development code from my laptop to the project server, which meant I could share my work with others, which was good.
I then moved onto adding genre to the in-library ‘facts’ page and the top-level ‘facts’ page. Below is a very long screenshot of the entire ‘facts’ page for Haddington library and I’ll discuss the new additions below:
The number of genres found at the library is now mentioned in the ‘Summary’ section and there is now a ‘Most popular genres’ section, which is split by gender as with the other lists. I also added in pie charts showing book genres represented at the library and the percentage of borrowings of each genre. Unfortunately these can get a bit cluttered due to there being up to 20-odd genres present, so I’ve added in a legend showing which colour is which genre. You can hover over a slice to view the genre title and name and you can click on a slice to perform a search for borrowing records featuring a book of the genre in the library. Despite being a bit cluttered I think the pies can be useful, especially when comparing the two charts – for example at Haddington ‘Theology’ books make up more than 36% of the library but only 8% of the borrowings.
Due to the somewhat cluttered nature of the pie charts I also experimented with a treemap view of Genre. I had stated we would include such a view in the requirements document, but at that time I had thought genre would be hierarchical, and a treemap would display the top-level genres and the division of lower level genres within these. Whilst developing the genre features I realised that without this hierarchy the treemap would merely replicate the pie chart and wouldn’t be worth including.
However, when the pie charts turned out to be so cluttered I decided to experiment with treemaps as an alternative. The results currently appear after the pie charts in the page. I initially liked how they looked – the big blocks look vaguely ‘bookish’ and having the labels in the blocks makes it easier to see what’s what. However, there are downsides. Firstly, it can be rather difficult to tell which genre is the biggest, due to the blocks having different dimensions – does a tall, thin block have a larger area than a shorter, fatter block, for example. It’s also much more difficult to compare two treemaps as the position of the genres changes depending on their relative size. Thankfully the colour stays the same, but it takes longer than it should to ascertain where a genre has moved to in the other treemap and how its size compares. I met with the team on Friday to discuss the new additions and we agreed that we could keep the treemaps, but that I’d add them to a separate tab, with only the pie charts visible by default.
I then added in the ‘borrowings over time by genre’ visualisation to the in-library and top level ‘facts’ pages. As you can see from the above screenshot, these divide the borrowings in a stacked bar chart per year (other month if a year is clicked on) into genre, much in the same way as the preceding ‘occupations’ chart. Note however that the total numbers for each year are not the same as for the occupations through time visualisation as books may have multiple genres and borrowers may have multiple occupations and the counts reflect the number of times a genre / occupation is associated with a borrowing record each year (or month if you drill down into a year). We might need to explain this somewhere.
We met on Friday to discuss the outstanding tasks. We’ll probably go live with the resource in January, but I will try to get as many of my outstanding tasks completed before Christmas as possible.
Also this week I fixed another couple of minor issues with the Dictionaries of the Scots Language. The WordPress part of the site had defaulted to using the new, horrible blocks interface for widgets after a recent update, meaning the widgets I’d created for the site no longer worked. Thankfully installing the ‘Classic Widgets’ plugin fixed the issue. I also needed to tweak the CSS for one of the pages where the layout was slightly wonky.
I also made a minor update to the Speech Star site and made a few more changes to the new Robert Fergusson site, which has now gone live (see https://robert-fergusson.glasgow.ac.uk/). I also had a chat with our IT people about a further server switch that is going to take place next week and responded to some feedback about the new interactive map of Iona placenames I’m developing.
Also this week I updated the links to one of the cognate reference websites (FEW) from entries in the Anglo-Norman Dictionary, as the website had changed its URL and site structure. After some initial investigation it appeared that the new FEW website made it impossible to link to a specific page, which is not great for an academic resource that people will want to bookmark and cite. Ideally the owners of the site should have placed redirects from the pages of the old resource to the corresponding page on the new resource (as I did for the AND).
The old links to the FEW as were found in the AND (e.g. the FEW link that before the update was on this page: https://anglo-norman.net/entry/poer_1) were formatted like so: https://apps.atilf.fr/lecteurFEW/lire/volume/90/page/231 which now gives a ‘not found’ error. The above URL has the volume number (9, which for reasons unknown to me was specified as ‘90’) and the page number (213). The new resource as found here: https://lecteur-few.atilf.fr/ and it lets you select a volume (e.g. 9: Placabilis-Pyxis) and enter a page (e.g. 231), which then updates the data on the page (e.g. showing ‘posse’ as the original link from AND ‘poer 1’ used to do). But crucially, their system does not update the URL in the address bar, meaning no-one can cite or bookmark their updated view and it looked like we couldn’t link to a specific view.
Thankfully Geert noticed that another cognate reference site (the DMF) had updated their links to use new URLs that are not documented on the FEW site, but do appear to work (e.g. https://lecteur-few.atilf.fr/lire/90/231). This was quite a relief to discover as otherwise we would not have been able to link to specific FEW pages. Once I knew this URL structure was available updating the URLs across the site was a quick update.
Finally this week, I had a meeting with Clara Cohen and Maria Dokovova to discuss a possible new project that they are putting together. This will involve developing a language game aimed at primary school kids and we discussed some possible options for this during our meeting. After wards I wrote up my notes and gave the matter some further thought.
I spent most of this week working towards adding genre to the Books and Borrowing front-end, working on a version running on my laptop. My initial task was to update the Solr index to add in additional fields for genre. With the new fields added I then had to update my script that generates the data for Solr to incorporate the fields. The Solr index is of borrowing records so as with authors, I needed to extract all genre associations at all book levels (work, edition, holding, item) for each book that was associated with a borrowing record, ensuring lower level associations replaced any higher level associations and removing any duplicates. This is all academic for now as all genre associations are at Work level, but this may not always be the case. It took a few attempts to get the data just right (e.g. after one export I realised it would be good to have genre IDs in the index as well as their names) and each run-through took about an hour or so to process, but all is looking good now. I’ll need to ask Stirling IT to create a new Solr core and ingest the new data on the server at Stirling as this is not something I have the access to do myself, and I’ll do this next week. The screenshot below shows one of the records in Solr with the new genre fields present.
With Solr updated I then began updating the front-end, in a version of the site running on my laptop. This required making significant updates to the API that generates all of the data for the front-end by connecting to both Solr and the database as well as updating the actual output to ensure genre is displayed. I updated the Advanced Search forms (simple and advanced) to add in a list of genres from which you can select any you’re interested in (see the following two screenshots) and updated the search facilities to enable the selected genres to be searched, either on their own or in combination with the other search options.
On the search results page any genres associated with a matching record are displayed, with associations at higher book levels cascading down to lower book levels (unless the lower book level has its own genre records). Genres appear in the records as clickable items, allowing you to perform a search for a genre you’re interested in by clicking on it. I’ve also added in genre as a filter option down the left of the results page. Any genres present in the results are listed, together with a count of the number of associated records, and you can filter the results by pressing on a genre, as you can see in the following screenshot, which shows the results of a quick search for ‘Egypt’, displaying the genre filter options and showing the appearance of genre in the records.
Genre is displayed in a similar way wherever book records appear elsewhere in the site, for example the lists of books for a library, the top-level ‘book editions’ page and when viewing a specific page in a library register.
There is still more to be done with genre, which I’ll continue with next week. This includes adding in new visualisations for genre, adding in new ‘facts and figures’ relating to genre and adding in facilities to limit the ‘browse books’ pages to specific genres. I’ll keep you posted next week.
I also spent some time going through the API and front-end fixing any notifications and warnings given by the PHP scripting language. These are not errors as such, just messages that PHP logs when it thinks there might be an issue, for example if a variable is referenced without it being explicitly instantiated first. These messages get added to a log file and are never publicly displayed (unless the server is set to display them) but it’s better to address them to avoid cluttering up the log files so I’ve (hopefully) sorted them all now. Also for the project this week I generated a list of all book editions that currently have no associated book work. There are currently 2474 of these and they will need to be investigated by the team.
I also met with Luca Guariento and Stevie Barret to have a catch-up and also to compile a list of key responsibilities for a server administrator who would manage the Arts servers. We discovered this week that Arts IT Support is no longer continuing, with all support being moved to central IT Services. We still have our own servers and require someone to manage them so hopefully our list will be taken into consideration and we will be kept informed of any future developments.
Also this week I created a new blog for a project Gavin Miller is setting up, fixed an issue that took down every dictionary entry in the Anglo-Norman Dictionary (caused by one of the project staff adding an invalid ID to the system) and completed the migration of the old Arts server to our third-party supplier.
I also investigated an issue with the Place-names of Mull and Ulva CMS that was causing source details to be wiped. The script that populates the source fields when an existing source is selected from the autocomplete list was failing to load in data. This meant that all other fields for the source were left blank, so when the ‘Add’ button was pressed the script assumed the user wanted all of the other fields to be blank and therefore wiped them. This situation was only happening very infrequently and what I reckon happened is that the data for the source that failed included a character that is not permitted in JSON data (maybe a double quote or a tab), meaning when the script tried to grab the data it failed to parse it and silently failed to populate the required fields. I therefore updated the script that returns the source fields so that double quotes and tab characters are stripped out of the fields before the data is returned. I also created a script based on this that outputs all sources as JSON data to check for errors and thankfully the output is valid JSON.
I also made a couple of minor tweaks to the Dictionaries of the Scots Language site, fixing an issue with the display of the advanced search results that had been introduced when I updated the code prior to the site’s recent migration to a new server and updating the wording of the ‘About this entry’ box. I also had an email conversation with Craig Lamont about a potential new project and spoke to Clara Cohen about a project she’s putting together.
This week I began the major task of integrating book genre with the Books and Borrowing dataset. The team had been working on a spreadsheet that enabled them to assign top-level Book Work records to more than 13,000 Book Edition records and also assign up to three genres to each Work. I had to write a script to parse this data, which involved extracting and storing the distinct genres, creating Book Work records, assigning Book Work authors, adding in associations to Book Edition records, deleting any author associations at Edition level and creating associations between Works and genres. It took the best part of two days to create and test the script, running it on a local version of the data stored on my laptop. After final testing the number of active Book Works increased from 75 to 9808 and the number of active Book Editions that have a Work association grew from 72 to 13,099. The number of genre connections for Works stood at 11,536 and the number of active Book Works that have at least one author association stood at 9,808, up from 70, while the number of active Book Editions with at least one direct author association decreased to 2,191 from 14,384, due to the author association being shifted up to Work (and it will cascade from there).
With the data import sorted I then moved onto updating the project’s content management system to incorporate all facilities to add, edit, browse and delete genres. This included creating facilities for associating genres with book records at any level (from Work down to Item) wherever books can be edited in the CMS. The ‘Browse Genres’ page works in a similar way to ‘Browse Authors’, giving you a list of genres and a count of the number of each book at each level that has an association, as the following screenshot shows:
Pressing on a number opens a pop-up containing a list of the associated books and you can connect through to each book record from this. As with authors, genre will cascade down from whichever level of book it is associated with to all lower levels. You only need to make an association at a lower level if it differs from the genre at a higher level. The counts in the ‘browse’ page show only the direct associations, so for now there are no editions or lower with any numbers listed. Wherever a book at any level can be edited in the CMS a new ‘Genre’ section has been added to the edit form. This consists of a list of genres with checkboxes beside them, as the following screenshot demonstrates:
You can tick as many checkboxes as are required and when updating the record the changes will be made. I tested out the new genre features in the CMS and all seem to be working well. I also imported all of the genre data so hopefully everything is now in place. Next week I will move onto the front-end, where there is much to do – not only making genre visible wherever books are viewed but updating the search facilities and adding in a number of new visualisations for genre as well. I also fixed a few issues with images of Registers from the Royal High School – a few that were missing I added in and the order of others needed to be updated.
Also this week I finalised the new project website for Rhona Brown’s new project. It’s not live yet, but my work on it is now complete. I was also involved in the migration of a number of my sites to a new server. As always seems to be the case, the DSL website migration did not go very smoothly, with the DNS update taking a many hours to propagate and in the meantime the domain was serving up the Anglo-Norman Dictionary, which was not good at all. This wasn’t something I had any direct control over, unfortunately, but thankfully the situation rectified itself the following day.
I also had to make a number of tweaks to the data in the Child speech error database for Speech Star, after many transcriptions were revised. I also updated the Mull / Ulva place-names CMS to add in a facility to export place-names for publication limited by one or more selected islands. In addition I began creating a new website for a project Gavin Miller is running and I created some new flat spreadsheet exports of the Historical Thesaurus for Fraser Dallachy and Marc Alexander to work with.
I spent the first two days of this week finishing off checking my locally hosted sites were compatible with PHP8, upgrading the version of jQuery they use and dealing with any warnings and notifications thrown by PHP. The dictionary sites (Anglo-Norman and the Dictionaries of the Scots Language) were especially time consuming to deal with, and the DSL sites needed to be upgraded from a very early version of jQuery and jQuery UI which was a bit of a nightmare. However, I’d completed all of the required work on Tuesday, which was quite a relief as it’s been a long and tedious task. I discovered later in the week that the server these sites is hosted on, which we purchased less than two years ago is now going to be decommissioned, and all of the sites will be moved to an entirely different server. This was a bit of a shock, and will require yet another round of migration and dealing with issues. I’m not directly involved with most of this, but I did have to help out with setting up Solr cores and important data into Solr on the new server on Friday this week, which took some time to sort out when I was really needing to focus on other things, which was rather frustrating.
I spent most of the rest of the week continuing to develop the new map interface for the Iona place-names project, working on the Advanced Search, which is accessed by pressing the ‘Advanced search’ button in the ‘Search’ section of the map’s left-hand menu. This opens up a pop-up that contains all of the fields as found in previous place-names projects (e.g. https://berwickshire-placenames.glasgow.ac.uk/place-names/?p=search), but in a more compact layout. This pop-up is an ‘in map’ pop-up meaning that it still works when the map is in full-screen mode. It has been trickier than you might think to implement – it wasn’t just a case of taking the existing form and sticking it in the pop-up as fields such as parishes, codes and element languages are dynamically generated based on the available data and the existing form was generated on the server-side whereas the new form is generated on the client-side (i.e. by code running in the user’s browser). I therefore had to write a new series of scripts to generate the data on the server and make it available to be pulled into the form in the browser. This also applies to the ‘autocomplete’ fields (‘Source’ and ‘Element’), where you can type a few characters in and view a list of matching items.
Having the form in a pop-up in a map also made it tricky to get the tooltips (when you hover-over one of the ‘(?)’ icons) and the autocomplete selection lists working. The webpage is comprised of different elements, some of which sit on top of each other, and the order of these is controlled by something called the ‘z-index’. This tells the browser (for example) that when the pop-up opens, the map layer should be covered by a transparent grey layer and on top of this layer the popup should sit. But the tooltips and autocomplete use a different library that doesn’t expect there to be a map and then a popup sitting over the webpage, meaning these elements were appearing underneath the popup. It took a while to figure out these elements were actually working, but were being hidden.
However, I managed to deal with these issues and the upshot is that the advanced search form works. The following screenshot is an example of the form, with ‘Scottish Gaelic’ chosen as ‘Element Language’ and the autocomplete list visible for ‘Element’ after typing in the characters ‘du’.
There are still a number of tweaks I need to make to the search. Most importantly, your search form choices are not yet ‘remembered’ by the system and when you return to the search form any previously entered options are lost. I will sort this, along with the other things previously mentioned such as updating the URL to reflect the map contents, enabling bookmarks and citations, and also ensuring the map ‘remembers’ your display option choices too. I also implemented the ‘Attribution and Copyright’ pop-up this week (press on the link in the very bottom right to view this).
Also this week I sorted an issue that was preventing the bulk download of texts as a ZIP from the SCOTS corpus from working. Thankfully this was a simple permissions issue and was quick to resolve. I also received the data about Genre from Matt Sangster for the Books and Borrowing project and have arranged to focus on implementing this over the coming weeks. This will involve setting up the data structures to store genre data, writing and testing the scripts to import the data then running the scripts, updating the CMS to enable data to be managed, updating the Solr index and the scripts for generating Solr data for the front-end search to incorporate genre and updating the front-end to incorporate genre including the display, search and browse of genre. This is going to be a pretty major job.
I also spent some time adding new videos and metadata for American speakers to the Speech Star database (see https://www.seeingspeech.ac.uk/speechstar/speech-database/), spoke to Gavin Miller about the new project website I’m setting up for him, helped Matthew Creasy with an issue with his James Joyce conference website and created an interface for Rhona Brown’s new project website.
I came down with some sort of flu-like illness last Friday evening and was still unwell on Monday and unable to work. Thankfully I was well enough to work again on Tuesday, although getting through the day was hard work. I was also off on holiday on Friday this week so only ended up working three days. I’ll be on holiday all of next week as well as it’s the school half-term and we have a family holiday booked.
I was involved in the migration of the Historical Thesaurus website to a new server for a lot of this week. This required a lot of testing of the newly migrated site and a significant number of small updates to the code to ensure everything worked properly. Thankfully by Thursday all was working well and I was able to go on my holiday without worrying about the site.
Also this week I did some further work on the Books and Borrowing project, which included generating several different spreadsheets of book holdings that have no associated borrowing records and discussing the options of creating downloadable bundles of all data associated with each specific library.
I also did some work for the Dictionaries of the Scots Language, including investigating an issue with the new quotations search that is not yet live but is running on our test server. A phrase search for quotations was now working, but an identical phrase search using the full-text index was working fine. This was a bit of a strange one as it looks like the new Solr quotation search is not picking up the fact that a phrase search is being run. I tried running the search directly on the Solr instance I’d set up on my laptop and the same thing was happening: I gave it a phrase surrounded by double quotes but these were being ignored. An identical search on the fulltext Solr index picked up the presence of quotes and successfully performed a search for the phrase. The only difference between the two fields is that the fulltext fields was set to ‘text_general’ while the quote search was set to ‘text_en’. I therefore set up a new version of the quote index with the field set to ‘text_general’ and this solved the problem. I’m still in the dark as to why, though, and I can’t find any information online about the issue.
I also responded to a request from Craig Lamont in Scottish Literature about a new proposal he’s putting together. If it gets funded I’ll be involved with the project, making a website, an interactivem map and a timeline. I also had a conversation with Rhona Brown about the website for her new project, which I’ll set up after I’m back from my holiday.
This was a week of many different projects. On Monday I completed work on a new project website for Petra Poncarova in Scottish Literature, and it is now publicly accessible (see https://erskine.glasgow.ac.uk/). I also added a blog page to Ophira Gamliel’s project website, created a page for their first blog post (now available here: https://himuje-malabar.glasgow.ac.uk/reconnecting-the-split-moon/) and updated the site to include a link to the blog in the site menu. This required shifting a few things around to make room for the new menu item. I also investigated an issue Luca was having in migrating one of Graeme Cannon’s old websites which was similarly structured to the House of Fraser Archive site and managed to find the section of code that was causing the problem (a flag in a regular expression that has since been deprecated).
On Tuesday I completed my work on the CSV endpoints for the Books and Borrowing project, ensuring all nested arrays are ‘flattened’ when producing the two-dimensional CSV file. This has been a lengthy and tedious task, but it’s good that it’s done, and it should mean that future researchers will be able to extract and reuse the data in a relatively straightforward manner.
On Wednesday I met Luca and Stevie, two of my fellow College of Arts developers to have a catch-up, which was hugely useful as always. We’ll hopefully meet up again in the next couple of months. I also responded to a request from Luca to help get some screenshots ready for print publication. Screenshots are generally 72DPI but this is too low for print. I’ve previously got around this using Photoshop by loading the image then going to image -> image size. In the options you can then untick ‘Resample Image’ and then update the ‘resolution’ to whatever you want. I’ve never actually printed the resulting images to check any difference, but I’ve never had anyone come back and ask for better versions. I guess another option would be to take the screenshots on something like an iPad that natively runs at a higher DPI.
Also on Wednesday I spent some time on the DSL, investigating an issue with Google Analytics for Pauline Graham and then investigating a problem with phrase searching and highlighting that Pauline had also noticed on both the live and test sites. When a phrase was searched for each individual word in the phrase was being highlighted in the entry, and then if you returned to the search results and went back to an entry from there no highlighting worked. Also some search results were not featuring snippets. This turned out to be three separate issues that needed to be investigated and fixed:
- Separate word highlighting: The default setting in the highlighting library I installed a few months ago highlighted each word in a string. If there were multiple words separated by spaces then all matching words would be highlighted. Thankfully the library (https://markjs.io/) has a setting that only matches the entire string and I’ve activated this now. Now if you perform a search for ‘off or on’ or something and navigate to a result only the exact term will be highlighted.
- Losing the highlighting when navigating back to the results and then to an entry: This was a problem with spaces getting encoded between pages. They were becoming the URL encoded equivalent ‘%B’ or ‘+’ and after that the string no longer matched. I’ve sorted this.
- Lack of snippets: The issue was down to the length of the entry. In Solr, the snippet generation is a separate process to the search matching. While the search checks the entire entry the snippet generation by default only looks at the first 51,200 characters. An entry such as ‘Mak’ is a long entry and if the search term only matches text quite far down the entry a snippet doesn’t get created. After discovering this I’ve updated the setting so that 100,000 characters are analysed instead and this has fixed the issue. More information about this can be found at https://stackoverflow.com/questions/52511154/solr-empty-highlight-entry-on-match.
This investigation took some of Thursday as well, after which I moved back to the Books and Borrowing project, for which I spent some time generating data relating to the Royal High School for checking purposes. I also received some bid documentation for a proposal Gavin Miller is putting together. Gavin wanted me to read through the documentation and add in some further sections relating to the data. The data will consist of a directory of projects and resources which will be available to search and browse, plus will be visualised on an interactive map. I added in some information and hopefully the proposal is a success.
On Friday I made some further updates to the Speech Star websites, adding in some new videos to the Edinburgh MRI Modelled Speech Corpus (https://www.seeingspeech.ac.uk/speechstar/edinburgh-mri-modelled-speech-corpus/) and arranging their layout a bit better. I also replied to a request from Rhona Brown, who would like a website to be set up for a new project she’s starting work on soon. I listed a few options we could pursue and I need to wait to hear more from her now.
I also spent quite of bit of time investigating some minor issues Ann Ferguson had spotted with the predictive search on the DSL website, most of which will thankfully be sorted when the new Solr based headword search goes live.
Finally, I had a meeting with the Placenames of Iona project to discuss the development of a new ‘map first’ interface for the data. I met with Thomas, Sofia and Alasdair and it was really great to actually have an in person meeting with them, having never done so before. We discussed many aspects of the interface and had some really useful discussions. I’ll be starting on the development of the front-end in the coming weeks.
I had my PDR session on Monday this week, which was all very positive. There was also one further UCU strike day on Wednesday this week, cutting my working days down to four. The project I devoted the most of the available time to was Books and Borrowing. Last week I had begun reworking the API to make it more usable and this week I completed this task, adding in a few endpoints that I’d created but hadn’t added to the documentation. I then moved onto the task of adding ‘Download data’ links to the front-end. These links now appear as buttons beside the ‘Cite’ button on any page that displays data, as you can see in the following screenshot:
Pressing on the button loads the API endpoint used to return the data found on the page with ‘CSV’ rather than ‘JSON’ selected as the file type. This then prompts the file to be downloaded by the browser rather than loading the data into the browser tab. It took a bit of time to add these links to every required page on the site, but I think I’ve got them all. However, the CSV downloads still needed quite a lot of work doing to them. When formatted as JSON any data held in nested arrays are properly transformed and usable, but a CSV is a flat file consisting of columns and rows and the data has a more complicated structure than this. For example, if we have one row in the CSV file for each borrowing record on a register page the record may have multiple associated borrowers, each with any number of occupations consisting of multiple fields. The record’s book holding may have any number of book items and may be associated with multiple book editions and there may be multiple authors associated with any level of book record (item, holding, edition and work). Representing this structure in a simple two-dimensional spreadsheet is very tricky and requires the data to be ‘flattened’. In order to do so a script needs to work out the maximum number of each variable items a record in the returned data has in order to create the required columns (with heading labels) and to pad out any other records that don’t have the maximum number of items with empty columns so that the columns of all records line up.
So, for example, when looking at borrowers: If borrowing row number 16 out of 20 has a borrower with five occupations then column headings need to be added for five sets of occupation columns and the data for the remaining 19 rows needs to be padded out with empty data to ensure any columns that appear after occupations continue to line up. As a borrowing may involve multiple borrowers this then becomes even more complicated.
I managed to update the API to ensure nested arrays were flattened for several of the most complicated endpoints, such as a page of records and the search results. The resulting CSV files can become quite monstrously large, with over 200 columns of data a regular occurrence. However, with the data properly structured and labelled it should hopefully make it easier for users who are interested in the data to download the CSV and then delete the columns they are not interested in, resulting in a more manageable file. I still need to complete the ‘flattening’ of CSV data for a few other endpoints, which I hope to tackle next week.
Also this week I had an email discussion with Petra Poncarova, a researcher in Scottish Literature who is beginning a research project and requires a project website. I’ve arranged for hosting to be set up for this and by the end of the week we had the desired subdomain and WordPress installation. I spent a bit of time on Friday afternoon getting the structure and plugins in place and next week I’ll work on the interface for the website.
I also made a couple of further updates to the House of Fraser Archive this week. I’d completed most of the work last week but hadn’t managed to get the search facility working. After some suggestions from Luca I managed to figure out what the problem was (it turned out to be the date search part of the query that was broken) and the search is now operational. We even managed to get results highlighting in the records working again, which is something I wasn’t sure we’d be able to do.
The rest of my time was spent making updates to the Speech Star websites (and Seeing Speech). Eleanor had noticed some errors in the metadata for a couple of the videos in the IPA charts so I fixed these. There were also some better quality videos to add to the ExtIPA charts and some further updates to the metadata here too. Also for this project Jane Stuart-Smith contacted me to say that I had been erroneously categorised as ‘Directly Incurred’ rather than ‘Directly Allocated’ when the grant application had been processed, which is now causing some bother. I may have to create timesheets for my work on the project, but we’ll see what transpires.
On Monday and Tuesday this week I participated in the UCU strike action. On my return to work on Wednesday I focussed on writing a Data Management Plan for Jennifer Smith’s ESRC proposal that uses some of the SCOSYA data. After a few follow-up conversations I completed a version of the plan that Jennifer was happy with. I informed her that I’d be happy to help out with any further changes or discussions, but other than that my involvement is now complete.
I spent a fair bit of the remainder of the week trying to fix an old resource. I created the House of Fraser Archive site (https://housefraserarchive.ac.uk/) with Graeme Cannon more than twelve years ago, with Graeme doing the XML parts via an eXist-DB system and me doing the interface and all of the parts that processed and displayed data returned from eXist. Unfortunately the server the site was running on had to be taken offline and the system moved elsewhere. A newer version of eXist was required and the old libraries that were used to connect to the XML database would no longer work. I figured out a way to connect via an alternative method, but this then returned the data in a different structure. This meant I needed to update every page of the site that processed data to not only update the way the system was queried but also update the way the returned data was handled. This took quite a lot of time but I managed to get all of the ‘browse’ options plus the display of records, tags and images working. The only thing I couldn’t get to work was the search, as this seems to use further libraries that are no longer available. I the issue is structuring the query to work with eXist, but I am not much of an expert with eXist and I’m not really sure how to untangle things. I’ve asked Luca if he could have a look at it, as he’s use eXist a lot more than I have. I’ve not heard back from him yet, but hopefully we’ll manage to get the search working, otherwise we may have to remove the search and get people to rely on the browse functions to access the data instead.
For the rest of the week I returned to working on the Books and Borrowing project. One thing on my ‘to do’ list is to sort out the API. There are a few endpoints that I haven’t documented yet, plus the existing documentation and structuring of the API could be improved. I spent some time adding in a license statement and a ‘table of contents’ that lists all endpoints. I’m currently in the middle of adding in the missing endpoint descriptions. After that I’ll need to ensure the examples given all work and make sense and then I need to ensure the CSV output works properly for all data types. I’m fairly certain that some data held in arrays will not output properly as CSV at the moment and this definitely needs sorted.
I spent most of this week working for the Dictionaries of the Scots Language, working on the new quotation date search. I decided to work on the update on a version of the site and its data running on my laptop initially, as I have direct control over the Solr instance running on my laptop – something I don’t have on the server. My first task was to create a new Solr index for the quotations and to write a script to export data from the database in a format that Solr could then index. With over 700,000 quotations this took a bit of time, and I did encounter some issues, such as several tens of thousands of quotations not having date tags, meaning dates for the quotations could not be extracted. I had a lengthy email conversation with the DSL team about this and thankfully it looks like the issue is not something I need to deal with: data is being worked on in their editing system and the vast majority of the dating issues I’d encountered will be fixed the next time the data is exported for me to use. I also encountered some further issues that needed o be addressed as I worked with the data. For example, I realised I needed to add a count of the total number of quotes for an entry to each quote item in Solr to be able to work out the ranking algorithm for entries and this meant updating the export script, the structure of the Solr index and then re-exporting all 700,000 quotations. Below is screenshot of the Solr admin interface, showing a query of the new quotation index – a search for ‘barrow’.
With this in place I then needed to update the API that processes search requests, connects to Solr and spits out the search results in a suitable format for use on the website. This meant completely separating out and overhauling the quotation search, as it needed to connect to a different Solr index that featured data that had a very different structure. I needed to ensure quotations could be grouped by their entries and then subjected to the same ‘max results’ limitations as other searches. I also needed to create the ranking algorithm for entries based on the number of returned quotes vs the total number of quotes, sort the entries based on this and also ensure a maximum of 10 quotes per entry were displayed. I also had to add in a further search option for dates, as I’d already detailed in the requirements document I’d previously written. The screenshot below is of the new quotation endpoint in the API, showing a section of the results for ‘barrow’ in ‘snd’ between 1800 and 1900.
The next step was to update the front-end to add in the new ‘date’ drop-down when quotations are selected and then to ensure the new quotation search information could be properly extracted, formatted and passed to the API to return the relevant data. The following screenshot shows the search form. The explanatory text still needs some work as it currently doesn’t feel very elegant – I think there’s a ‘to’ missing somewhere.
The final step for the week was to deal with the actual results themselves, as they are rather different in structure to the previous results, as entries now potentially have multiple quotes, each of which contains information relating to the quote (e.g. dates, bib ID) and each of which may feature multiple snippets, if the term appears several times within a single quote. I’ve managed to get the results to display correctly and the screenshot below shows the results of a search for ‘barrow’ in snd between 1800 and 1900.
The new search also now lets you perform a Boolean search on the contents of individual quotations rather than all quotations in an entry. So for example you can search for ‘Messages AND Wean’ in quotes from 1980-1999 and only find those that match whereas previously if an entry featured one quote with ‘messages’ and another with ‘wean’ it would get returned. The screenshot below shows the new results.
There are a few things that I need to discuss with the team, though. Firstly the ranking system. As previously agreed upon, entries are ranked based on the proportion of quotes that contain the search term. But this is possibly ranking entries that only have one quote too highly. If there is only one quote and it features the term then 100% of quotes feature the term so the entry is highly ranked, but longer, possibly more important entries are ranked lower because (for example) out of 50 quotes 40 feature the term. We might want to look into weighting entries that have more quotes overall. For example, an SND quotation search for ‘prince’ (see below). ‘Prince’ is ranked first, but then results 2-6 appear because they only have one quote, which happens to feature ‘prince’.
The second issues is that the new system cuts off quotations for entries after the tenth (as you can see for ‘Prince’, above). We’d agreed on this approach to stop entries with lots of quotes swamping the results, but currently nothing is displayed to say that the results have been snipped. We might want to add a note under the tenth quote.
The third issue is that the quote field in Solr is currently stemmed, meaning the stems of words are stored and Solr can then match alternative forms. This can work well – for example the ‘messages AND wean’ results include results for ‘message’ and ‘weans’ too. But it can also be a bit too broad. See for example the screenshot below, which shows a quotation search for ‘aggressive’. As you can see, it has returned quotations that feature ‘aggression’, ‘aggressively’ and ‘aggress’ in addition to ‘aggressive’. This might be useful, but it might cause confusion and we’ll need to discuss this further at some point.
Next week I’ll hopefully start work on the filtering of search results for all search types, which will involve a major change to the way headword searches work and more big changes to the Solr indexes.
Also this week I investigated applying OED DOIs to the OED lexemes we link to in the Historical Thesaurus. Each OED sense now has its own DOI that we can get access to, and I was sent a spreadsheet containing several thousand as an example. The idea is that links from the HT’s lexemes to the OED would be updated to use these DOIs rather than performing a search of the OED for the work, which is what currently happens.
After a few hours of research I reckoned it would be possible to apply the DOIs to the HT data, but there are some things that we’ll need to consider. The OED spreadsheet looks like it will contain every sense and the HT data does not, so much of the spreadsheet will likely not match anything in our system. I wrote a little script to check the spreadsheet against the HT’s OED lexeme table and 6186 rows in the spreadsheet match one (or more) lexeme in the database table while 7256 don’t. I also noted that the combination of entry_id and element_id (in our database called refentry and refid) is not necessarily unique in the HT’s OED lexeme table. This can be if a word appears in multiple categories, plus there is a further ID called ‘lemmaid’ that was sometimes used to differentiate specific lexemes in combination with the other two IDs. In the spreadsheet there are 1180 rows that match multiple rows in the HT’s OED lexeme table. However, this also isn’t a problem and usually just means a word appears in multiple categories. It just means that the same DOI would apply to multiple lexemes.
What is potentially a problem is that we haven’t matched up all of the OED lexeme records with the HT lexeme records. While 6186 rows in the spreadsheet match one or more rows in the OED lexeme table, only 4425 rows in the spreadsheet match one or more rows in the HT’s lexeme table. We will not be able to update the links to switch to DOIs for any HT lexemes that aren’t matched to an OED lexeme. After checking I discovered that there are 87,713 non-OE lexemes in the HT lexeme table that are not linked to an OED lexeme. None of these will be able to have a DOI (and neither will the OE words, presumably).
Another potential problem is that the sense an HT lexeme is linked to is not necessarily the main sense for the OED lexeme. In such cases the DOI then leads to a section of the OED entry that is only accessible to logged in users of the OED site. An example from the spreadsheet is ‘aardvark’. Our HT lexeme links to entry_id 22, element_id 16201412, which has the DOI https://doi.org/10.1093/OED/1516256385 which when you’re not logged in displays a ‘Please purchase a subscription’ page. The other entry for ‘aardvark’ in the spreadsheet has entry_id 22 and element_id 16201390, which has the DOI https://doi.org/10.1093/OED/9531538482 which leads to the summary page, but the HT’s link will be the first DOI above and not the second. Note that currently we link to the search results on the OED site, which actually might be more useful for many people. Aarkvark as found here: https://ht.ac.uk/category/?type=search&qsearch=aardvark&page=1#id=39313 currently links to this OED page: https://www.oed.com/search/dictionary/?q=aard-vark
To summarise: I can update all lexemes in the HT’s OED lexeme table that match the entry_id and element_id columns in the spreadsheet to add in the relevant DOI. I can also then ensure that any HT lexeme records linked to these OED lexemes also feature the DOI, but this will apply to less lexemes due to there still being many HT lexemes that are not linked. I could then update the links through to the OED for these lexemes, but this might not actually work as well as the current link to search results due to many OED DOIs leading to restricted pages. I’ll need to hear back from the rest of the team before I can take this further.
Also this week I had a meeting with Pauline Mackay and Craig Lamont to discuss an interactive map of Burns’ correspondents. We’d discussed this about three years ago and the are now reaching a point where they would like to develop the map. We discussed various options for base maps, data categorisation and time sliders and I gave them a demonstration of the Books and Borrowing project’s Chamber’s library map, which I’d previously developed (https://borrowing.stir.ac.uk/chambers-library-map/). They were pretty impressed with this and thought it would be a good model for their map. Pauline and Craig are now going to work on some sample data to get me started, and once I receive this I’ll be able to begin development. We had our meeting in the café of the new ARC building, which I’d never been to before, so it was a good opportunity to see the place.
Also this week I fixed some issues with images for one of the library registers for the Royal High School for the Books and Borrowing project. These had been assigned the wrong ID in the spreadsheet I’d initially used to generate the data and I needed to write a little script to rectify this.
Finally, I had a chat with Joanna Kopaczyk about a potential project she’s putting together. I can’t say much about it at this stage, but I’ll probably be able to use the systems I developed last year for the Anglo-Norman Dictionary’s Textbase (see https://anglo-norman.net/textbase-browse/ and https://anglo-norman.net/textbase-search/). I’m meeting with Joanna to discuss this further next week.
I was back at work this week after a lovely two-week holiday (although I did spend a couple of hours making updates to the Speech Star website whilst I was away). After catching up with emails, getting back up to speed with where I’d left off and making a helpful new ‘to do’ list I got stuck into fixing the language tags in the Anglo-Norman Dictionary.
In June the editor Geert noticed that language tags had disappeared from the XML files of many entries. Further investigation by me revealed that this probably happened during the import of data into the new AND system and had affected entries up to and including the import of R; entries that were part of the subsequent import of S had their language tags intact. It is likely that the issue was caused by the script that assigns IDs and numbers to <sense> and <senseInfo> tags as part of the import process, as this script edits the XML. Further testing revealed that the updated import workflow that was developed for S retained all language tags, as does the script that processes single and batch XML uploads as part of the DMS. This means the error has been rectified, but we still need to fix the entries that have lost their language tags.
I was able to retrieve a version of the data as it existed prior to batch updates being applied to entry senses and from this I was able to extract the missing language tags for these entries. I was also able to run this extraction process on the R data as it existed prior to upload. I then ran the process on the live database to extract language tags from entries that featured them, for example entries uploaded during the import of S. The script was also adapted to extract the ‘certainty’ attribute from the tags if present. This was represented in the output as the number 50, separated from the language by a bar character (e.g. ‘Arabic|50’). Where an entry featured multiple language tags these were separated by a comma (e.g. ‘Latin,Hebrew’).
Geert made the decision that language tags, which were previously associated with specific senses or subsenses, should instead be associated with entries as a whole. This structural change will greatly simplify the reinstatement of missing tags and it will also make it easier to add language tags to entries that do not already feature them.
The language data that I compiled was stored in a spreadsheet featuring three columns: Slug: the unique form of a headword used in entry URLs; Live Langs: language tags extracted from the live database; Old Langs: language tags extracted from the data prior to processing. A fourth column was also added where manual overrides to the preceding two columns could be added by Geert. This column could also be used to add entries that did not previously have a language tag but needed one.
Two further issues were addressed at this stage. The first related to compound words, where the language applied to one part of the word. In the original data these were represented by combining the language with ‘A.F.’, for example ‘M.E._and_A.F.’. Continuing with this approach would make it more difficult to search for specific languages and the decision was made to only store the non-A.F. language with a note that the word is a compound. This was encoded in the spreadsheet with a bar character followed by ‘Y’. To ensure the data could be more easily machine-readable the compound character would always be the third part of the language data, whether or not certainty was present in the second part. For example ‘M.E.|50|Y’ represents a word that is possibly from M.E. and is a compound while ‘M.E.||Y’ represents a word that is definitely from M.E and is a compound.
The second issue to be addressed was how to handle entries that featured languages but whose language tags were not needed. In such cases Geert added the characters ‘$$’ to the fourth column.
The spreadsheet was edited by Geert and currently features 2741 entries that are to be updated. Each entry in the spreadsheet will be edited using the following workflow:
- All existing language tags in the entry will be deleted. These generally occur in senses or subsenses, but some entries feature them in the <head> element.
- If the entry has ‘$$’ in column 4 then no further updates will be made
- If there is other data in column 4 this will be used
- If there is no data in column 4 then data from column 2 will be used
- If there is no data in columns 4 or 2 then data from column 3 will be usedWhere there are multiple languages separated by a comma these will be split and treated separately.
- For each language the presence of a certainty value and / or a compound will be ascertained
- In the XML the new language tags will appear below the <head> tag.
- An entry will feature one language tag for each language specified
- The specific language will be stored in the ‘lang’ attribute
- Certainty (if present) will be stored in the ‘cert’ attribute which may only contain ‘50’ to represent ‘uncertain’.
- Compound (if present) will be stored in a new ‘compound’ attribute which may only contain ‘true’ to denote the word is a compound.
- For example, ‘Latin|50,Greek|50’ will be stored as two <language> tags beneath the <head> tag as follows: <language lang=”Latin” cert=”50” /><language lang=”Greek” cert=”50” /> while ‘M.E.||Y’ will be stored as: <language lang=”M.E.” compound=”true” />
I ran and tested the update on a local version of the data and the output was checked by Geert and me. After backing up the live database I then ran the update on it and all went well. The dictionary’s DTD also needed to be updated to ensure the new language tag can be positioned as an optional child element of the ‘main_entry’ element. The DTD was also updated to remove language as a child of ‘sense’, ‘subsense’ and ‘head’.
Previously the DTD had a limited list of languages that can appear in the ‘lang’ attribute, but I’m uncertain whether this ever worked as the XML definitely included languages that were not in the list. Instead I created a ‘picklist’ for languages that pulls its data from a list of languages stored in the online database. We use this approach for other things such as semantic labels so it was pretty easy to set up. I also added in the new optional ‘compound’ attribute.
With all of this in place I then updated the XSLT and some of the CSS in order to display the new language tags, which now appear as italicised text above any part of speech. For example, an entry with multiple languages, one of which is uncertain: https://anglo-norman.net/entry/ris_3 and an entry that’s a compound with another language: https://anglo-norman.net/entry/rofgable. Eventually I will update the site further to enable searches for language tags, but this will come at a later date.
Also this week I spent a bit of time in email conversations with the Dictionaries of the Scots Language people, discussing updates to bibliographical entries, the new part of speech system, DOST citation dates that were later than 1700 and making further tweaks to my requirements document for the date and part of speech searches based on feedback received from the team. We’re all in agreement about how the new feature will work now, which means I’ll be able to get started on the development next week, all being well.
I also gave some advice to Gavin Miller about a new proposal he’s currently putting together, helped out Matthew Creasy with the website for his James Joyce Symposium website, spoke to Craig Lamont about the Burns correspondents project and checked how the stats are working on sites that were moved to our newer server a while back (all thankfully seems to be working fine).
I spent the remainder of the week implementing a ‘cite this page’ feature for the Books and Borrowing project, and the feature now appears on every page that features data. A ‘Cite this page’ button appears in the right-hand corner of the page title. Pressing the button brings up a pop-up containing citation options in a variety of styles. I’ve taken this from other projects I’ve been involved with (e.g. the Historical Thesaurus) and we might want to tweak it, but at the moment something along the lines of the following is displayed (full URL crudely ‘redacted’ as the site isn’t live yet):
Developing this feature has taken a bit of time due to the huge variation in the text that describes the page. This can also make the citation rather long, for example:
Advanced search for ‘Borrower occupation: Arts and Letters, Borrower occupation: Author, Borrower occupation: Curator, Borrower occupation: Librarian, Borrower occupation: Musician, Borrower occupation: Painter/Limner, Borrower occupation: Poet, Borrower gender: Female, Author gender: Female’. 2023. In Books and Borrowing: An Analysis of Scottish Borrowers’ Registers, 1750-1830. University of Stirling. Retrieved 18 August 2023, from [very long URL goes here]
I haven’t included a description of selected filters and ‘order by’ options, but these are present in the URL. I may add filters and orders to the description, or we can just leave it as it is and let people tweak their citation text if they want.
The ‘cite this page’ button appears on all pages that feature data, not just the search results. For example register pages and the list of book editions. Hopefully the feature will be useful once the site goes live.