Week Beginning 21st August 2023

I spent most of this week working for the Dictionaries of the Scots Language, working on the new quotation date search.  I decided to work on the update on a version of the site and its data running on my laptop initially, as I have direct control over the Solr instance running on my laptop – something I don’t have on the server.  My first task was to create a new Solr index for the quotations and to write a script to export data from the database in a format that Solr could then index.  With over 700,000 quotations this took a bit of time, and I did encounter some issues, such as several tens of thousands of quotations not having date tags, meaning dates for the quotations could not be extracted.  I had a lengthy email conversation with the DSL team about this and thankfully it looks like the issue is not something I need to deal with:  data is being worked on in their editing system and the vast majority of the dating issues I’d encountered will be fixed the next time the data is exported for me to use.  I also encountered some further issues that needed o be addressed as I worked with the data.  For example, I realised I needed to add a count of the total number of quotes for an entry to each quote item in Solr to be able to work out the ranking algorithm for entries and this meant updating the export script, the structure of the Solr index and then re-exporting all 700,000 quotations.  Below is screenshot of the Solr admin interface, showing a query of the new quotation index – a search for ‘barrow’.

With this in place I then needed to update the API that processes search requests, connects to Solr and spits out the search results in a suitable format for use on the website.  This meant completely separating out and overhauling the quotation search, as it needed to connect to a different Solr index that featured data that had a very different structure.  I needed to ensure quotations could be grouped by their entries and then subjected to the same ‘max results’ limitations as other searches.  I also needed to create the ranking algorithm for entries based on the number of returned quotes vs the total number of quotes, sort the entries based on this and also ensure a maximum of 10 quotes per entry were displayed.  I also had to add in a further search option for dates, as I’d already detailed in the requirements document I’d previously written.  The screenshot below is of the new quotation endpoint in the API, showing a section of the results for ‘barrow’ in ‘snd’ between 1800 and 1900.

The next step was to update the front-end to add in the new ‘date’ drop-down when quotations are selected and then to ensure the new quotation search information could be properly extracted, formatted and passed to the API to return the relevant data.  The following screenshot shows the search form.  The explanatory text still needs some work as it currently doesn’t feel very elegant – I think there’s a ‘to’ missing somewhere.

The final step for the week was to deal with the actual results themselves, as they are rather different in structure to the previous results, as entries now potentially have multiple quotes, each of which contains information relating to the quote (e.g. dates, bib ID) and each of which may feature multiple snippets, if the term appears several times within a single quote.  I’ve managed to get the results to display correctly and the screenshot below shows the results of a search for ‘barrow’ in snd between 1800 and 1900.

The new search also now lets you perform a Boolean search on the contents of individual quotations rather than all quotations in an entry.  So for example you can search for ‘Messages AND Wean’ in quotes from 1980-1999 and only find those that match whereas previously if an entry featured one quote with ‘messages’ and another with ‘wean’ it would get returned.  The screenshot below shows the new results.

There are a few things that I need to discuss with the team, though.  Firstly the ranking system.  As previously agreed upon, entries are ranked based on the proportion of quotes that contain the search term.  But this is possibly ranking entries that only have one quote too highly.  If there is only one quote and it features the term then 100% of quotes feature the term so the entry is highly ranked, but longer, possibly more important entries are ranked lower because (for example) out of 50 quotes 40 feature the term.  We might want to look into weighting entries that have more quotes overall.  For example, an SND quotation search for ‘prince’ (see below).  ‘Prince’ is ranked first, but then results 2-6 appear because they only have one quote, which happens to feature ‘prince’.

The second issues is that the new system cuts off quotations for entries after the tenth (as you can see for ‘Prince’, above).  We’d agreed on this approach to stop entries with lots of quotes swamping the results, but currently nothing is displayed to say that the results have been snipped.  We might want to add a note under the tenth quote.

The third issue is that the quote field in Solr is currently stemmed, meaning the stems of words are stored and Solr can then match alternative forms.  This can work well – for example the ‘messages AND wean’ results include results for ‘message’ and ‘weans’ too.  But it can also be a bit too broad.  See for example the screenshot below, which shows a quotation search for ‘aggressive’.  As you can see, it has returned quotations that feature ‘aggression’, ‘aggressively’ and ‘aggress’ in addition to ‘aggressive’.  This might be useful, but it might cause confusion and we’ll need to discuss this further at some point.

Next week I’ll hopefully start work on the filtering of search results for all search types, which will involve a major change to the way headword searches work and more big changes to the Solr indexes.

Also this week I investigated applying OED DOIs to the OED lexemes we link to in the Historical Thesaurus.  Each OED sense now has its own DOI that we can get access to, and I was sent a spreadsheet containing several thousand as an example.  The idea is that links from the HT’s lexemes to the OED would be updated to use these DOIs rather than performing a search of the OED for the work, which is what currently happens.

After a few hours of research I reckoned it would be possible to apply the DOIs to the HT data, but there are some things that we’ll need to consider.  The OED spreadsheet looks like it will contain every sense and the HT data does not, so much of the spreadsheet will likely not match anything in our system.  I wrote a little script to check the spreadsheet against the HT’s OED lexeme table and 6186 rows in the spreadsheet match one (or more) lexeme in the database table while 7256 don’t.  I also noted that the combination of entry_id and element_id (in our database called refentry and refid) is not necessarily unique in the HT’s OED lexeme table.  This can be if a word appears in multiple categories, plus there is a further ID called ‘lemmaid’ that was sometimes used to differentiate specific lexemes in combination with the other two IDs.  In the spreadsheet there are 1180 rows that match multiple rows in the HT’s OED lexeme table.  However, this also isn’t a problem and usually just means a word appears in multiple categories.  It just means that the same DOI would apply to multiple lexemes.

What is potentially a problem is that we haven’t matched up all of the OED lexeme records with the HT lexeme records.  While 6186 rows in the spreadsheet match one or more rows in the OED lexeme table, only 4425 rows in the spreadsheet match one or more rows in the HT’s lexeme table.  We will not be able to update the links to switch to DOIs for any HT lexemes that aren’t matched to an OED lexeme.  After checking I discovered that there are 87,713 non-OE lexemes in the HT lexeme table that are not linked to an OED lexeme.  None of these will be able to have a DOI (and neither will the OE words, presumably).

Another potential problem is that the sense an HT lexeme is linked to is not necessarily the main sense for the OED lexeme.  In such cases the DOI then leads to a section of the OED entry that is only accessible to logged in users of the OED site.  An example from the spreadsheet is ‘aardvark’.  Our HT lexeme links to entry_id 22, element_id 16201412, which has the DOI https://doi.org/10.1093/OED/1516256385 which when you’re not logged in displays a ‘Please purchase a subscription’ page.  The other entry for ‘aardvark’ in the spreadsheet has entry_id 22 and element_id 16201390, which has the DOI https://doi.org/10.1093/OED/9531538482 which leads to the summary page, but the HT’s link will be the first DOI above and not the second.  Note that currently we link to the search results on the OED site, which actually might be more useful for many people.  Aarkvark as found here: https://ht.ac.uk/category/?type=search&qsearch=aardvark&page=1#id=39313 currently links to this OED page: https://www.oed.com/search/dictionary/?q=aard-vark

To summarise:  I can update all lexemes in the HT’s OED lexeme table that match the entry_id and element_id columns in the spreadsheet to add in the relevant DOI.  I can also then ensure that any HT lexeme records linked to these OED lexemes also feature the DOI, but this will apply to less lexemes due to there still being many HT lexemes that are not linked.  I could then update the links through to the OED for these lexemes, but this might not actually work as well as the current link to search results due to many OED DOIs leading to restricted pages.  I’ll need to hear back from the rest of the team before I can take this further.

Also this week I had a meeting with Pauline Mackay and Craig Lamont to discuss an interactive map of Burns’ correspondents.  We’d discussed this about three years ago and the are now reaching a point where they would like to develop the map.  We discussed various options for base maps, data categorisation and time sliders and I gave them a demonstration of the Books and Borrowing project’s Chamber’s library map, which I’d previously developed (https://borrowing.stir.ac.uk/chambers-library-map/).  They were pretty impressed with this and thought it would be a good model for their map.  Pauline and Craig are now going to work on some sample data to get me started, and once I receive this I’ll be able to begin development.  We had our meeting in the café of the new ARC building, which I’d never been to before, so it was a good opportunity to see the place.

Also this week I fixed some issues with images for one of the library registers for the Royal High School for the Books and Borrowing project.  These had been assigned the wrong ID in the spreadsheet I’d initially used to generate the data and I needed to write a little script to rectify this.

Finally, I had a chat with Joanna Kopaczyk about a potential project she’s putting together.  I can’t say much about it at this stage, but I’ll probably be able to use the systems I developed last year for the Anglo-Norman Dictionary’s Textbase (see https://anglo-norman.net/textbase-browse/ and https://anglo-norman.net/textbase-search/).  I’m meeting with Joanna to discuss this further next week.