Week Beginning 1st July 2019

I’d taken Thursday and Friday off this week due to getting married on the Friday, so had to squeeze quite a lot into three days this week.  I dealt with a few queries from people regarding several projects, including Thomas Clancy’s Iona proposal, the Galloway Glens project, SCOSYA, the Mary Queen of Scots letters project, and a couple of new projects for Gavin Miller, but I spent the majority of my time on the DSL, with a little bit of time on Historical Thesaurus duties.

For the DSL I fixed some layout issues with the bibliography on the live site and engaged in an email conversation about handling updates to the data.  This also involved running the script Thomas Widmann had left to export all the recent data from the DSL’s in-house server, which we hoped would grab all updates made since the last export.  Unfortunately it looks like the export is identical to the previous one, so something is not working quite right somewhere.

The bulk of my DSL time this week was spent continuing to develop the new API, looking for the first time at full-text searches.  The existing API set up by Peter uses Apache Solr for full-text searches, but I wanted to explore handling these directly through the online database instead, as Arts IT Support are not very happy to support the set-up that the existing API currently uses (the API powered by the Python-based Django framework and full-text searches by Solr).

The full-text search actually requires three different versions of the entries, all with any XML tags removed: the full entry; the full entry without the quotations; the quotations only.  My first task was to write a script to generate and store these different versions.  So far I’ve focussed on just the first, and I wrote a script to strip all XML tags and store the resulting plain text in the database for each of the 89,000 or so entries.  This took some time to run, as you might expect.  I then set the field containing this text to be indexed as ‘fulltext’ in the database, which then allows full text queries to be executed on the field.  Unfortunately my experiments with running queries on this field have been disappointing.  The search is slow, the types of searching that are possible are surprisingly limited, and there is no way to bring back ‘snippets’ of results that show where the terms appear (at least not directly via a single database query).

It is not possible to use single character wildcards (e.g. ‘m_kill’) and it is not possible to use an asterisk wildcard at the start of a search term (e.g. ‘*kill’).  It also doesn’t ignore punctuation, so a search for ‘mekill’ will not find an entry that has ‘mekill.’ in it.  Finally, it only indexes words that are over 3 characters in length, so a search for ‘boece iv’ ignores the ‘iv’ part.  What it can do pretty well is Boolean searches (AND OR and NOT), wildcard searching at the end of a term (e.g. ‘kill*’) and exact phrase searching (e.g. “fyftie aught”).

I would personally not want to replace the current full-text search with something that is slower and more limited in functionality so I then started working with Solr, which is not something I’ve had much experience with (at least not for about 10 years, back when it was horribly flaky and difficult to ingest data into).  It turns out that it is pretty easy to set up and work with these days, and I set up a test version on my PC, ingested a few sample entries and got to grips with the querying facilities.  I think it would be much better to continue to use Solr for the full-text searching, but this does mean getting Arts IT Support to agree to host it on a new server.  If we do get the go-ahead to install Solr on the server where the new API resides it should then be fairly straightforward to set up fields for full text, full text minus quotes, quotes only and to write a script that will generate data to populate these fields.  I will document the process so that whenever we need to update the data in the online database we know how to update the data in Solr too.  The full-text search would then function in the same way as the current full-text search, in terms of Boolean and wildcard searches, but will also offer an exact phrase search too.  It would also return ‘snippets’ as the current full-text search does too.  All this does rather depend on whether we can get Solr onto the server, though.

For the Historical Thesaurus I checked through a new batch of data that the OED people had sent to Fraser this week, which now includes quotation dates and labels.  It looks like it should be possible to grab this data and use it to replace the existing HT dates and labels, which is encouraging.  I also updated a stats script I’d previously prepared to link through to all of the words that meet certain criteria.  I also worked on a new script to match HT and OED lexemes based on the search terms tables that we have (these split lexemes with brackets and slashes and other such things up into multiple forms).  I realised I hadn’t generated the search terms for the new OED lexeme data we had been given so first of all I had to process this.  The script took a long time to run as almost 1 million variants needed to be generated.  Once they had I could run my matching script, and the results are pretty promising, with 15,587 matches that could be ticked off (once checked).  A match is only listed if the HT lexeme ID is linked to exactly one OED lexeme ID (and vice-versa).  We currently have 85,454 unmatched OED lexemes in matched categories, so this is a fair chunk.

I also spent a bit of time helping Fraser to get some up to date stats for our paper for the DH conference next week.  I’m going to be at the conference all next week and then on holiday the week after.