Week Beginning 28th August 2023

I spent pretty much the whole week working on the new date facilities for the Dictionaries of the Scots Language.  I have now migrated the headword search to Solr, which was a fairly major undertaking, but was necessary to allow headword searches to be filtered.  I decided to create a new Solr core for DSL entries that would be used by both the headword search and the fulltext / fulltext with no quotes searches.  This made sense because I would otherwise have needed to update the Solr core for fulltext to add the additional fields needed for filtering anyway.  With the new core in place I then updated the script I wrote to generate the Solr index to include the new fields (e.g. headword, forms, dates) and generated the data, which I then imported into Solr.

With the new Solr core populated with data I then updated the API to work with it, replacing the existing headword search which queried the database with a new search than instead connects to Solr.  As all of the fields that get returned by a search are now stored in Solr the database no longer needs to be queried at all, which should make things faster.  Previously the fulltext search queried the Solr index and then once entries were returned the database was then queried for each of these to add in the other necessary fields, which was a bit inefficient.

With the API updated the website (still only the version running on my laptop) then automatically used the new Solr index for headword searches: the quick search, the predictive search and the headword search in the advanced search.  I did some tests comparing the site on my laptop to the live site and things are looking good, but I will need to tweak the default use of wildcards.  The new headword search matches exact terms by default, which is the equivalent on the live site of surrounding the term with quotes, (something the quick search does by default anyway).  I can’t really tweak this easily until I move the new site to our test server, though, as Windows (which my laptop uses) can’t cope with asterisks and quotes in filenames, which means the website on my laptop breaks if a URL includes these characters.

With the new headword and fulltext index in place I then moved on to implementing the date filtering options.  In order to do so I realised I would also have to add the entry attestation dates (from and to) to the quotation index as well, as a ‘first attested’ filter on a quotation search will use these dates.  This meant updating the quotation Solr index structure, tweaking my script that generates the quotation data for Solr, running the script to output the data and then ingesting this into Solr, all of which took some time.

I then worked with the Solr admin interface to figure out how to perform a filter query for both first attestation and the ‘in use’ period.  ‘First attested’ was pretty straightforward as a single year is queried.  Say the year is 1658 a filter would bring it back if the filter was a single year that matched (i.e. 1658) or a range that contained the year (e.g. 1650-1700).  The ‘in use’ filter was much more complex to figure out as the data to be queried is a range.  If the attestation period is 1658-1900 and a single year filter is given (e.g. 1700) then we need to check whether this year is within the range.  What is more complicated is when the filter is also a range.  E.g. 1600-1700 needs to return the entry even though 1600 is less than 1658 and 1850-2000 needs to return the entry even though 2000 is greater than 1900.  1600-2000 also needs to return the entry even though both ends extend beyond the period, but 1660-1670 also needs to return the entry as both ends are entirely within the period.

The answer to this headache-inducing problem was to run a query that checked whether the start date of the filter range was less than the end date of the attestation range and the end date of the filter range was greater than the start date of the attestation range.  So for example the attestation range is 1658-1900.  Filter range 1 is 1600-1700.  1600 is less than 1900 and 1700 is greater than 1658 so the entry is returned.  Filter range 2 is 1850-2000.  1850 is less than 1900 and 200 is greater than 1658 so the entry is returned.  Filter range 3 is 1600-2000.  1600 is less than 1900 and 2000 is greater than 1658 so the entry is returned.  Filter range 4 is 1660-1670.  1660 is less and 1900 and 1670 is greater than 1658 so the entry is returned.  Filter range 5 is 1600-1650.  1600 is less than 1900 but 1650 is not greater than 1658 so the entry is not returned.  Filter range 6 is 1901-1950.  1901 is not less than 1900 but 1950 is greater than 1658 so the entry is not returned.

Having figured out how to implement the filter query in Solr I then needed to update the API to take filter query requests, process them, format the query and then pass this to Solr.  This was a pretty major update and took quite some time to implement, especially as the quotation search needed to be handled differently for the ‘in use’ search, which as agreed with the team was to query the dates of the individual quotations rather than the overall period of attestation for an entry.  I managed to get it all working, though, allowing me to pass searches to the API by changing variables in a URL and filter the results by passing further variables.

With this in place I could then update the front-end to add in the option of filtering the results.  I decided to add the option as a box above the search results.  Originally I was going to place it down the left-hand side, but space is rather limited due to the two-column layout of a search result that covers both SND and DOST.  The new ‘Filter the results’ box consists of buttons for choosing between ‘First attested’ and ‘In use’ and ‘from’ and ‘to’ boxes where years can be entered.  There is also an ‘Update’ button and a ‘clear’ button.  Supplying a filter and pressing ‘Update’ reloads the page with the results filtered based on your criteria.  It will be possible to bookmark or cite a filtered search as the filters are added to the page URL.  The filter box appears on all search results pages, including the quick search, and seems to be working as intended.

So for example the screenshot below shows a filtered fulltext search for ‘burn’.  Without the filter this brings back more results than we allow, but if you filter the results to only those that are first attested between 1600 and 1700 a more reasonable number is returned, as the screenshot shows:

The second screenshot shows the entries that were in use during this period rather than first attested, which as you can see gives a larger number of results:

As mentioned, the ‘in use’ filter works differently for quotations, limiting those that are displayed to ones in the filter period.  The screenshot below shows an ‘in use’ filter of 1650 to 1750 for a quotation search for ‘burn’:

The filter is ‘remembered’ when you navigate to an entry and then use the ‘back to search results’ button.  You can clear the filter by pressing on the ‘clear’ button or deleting the years in the ‘from’ and ‘to’ boxes and pressing ‘update’.  Previously if there was only one search result the results page would automatically redirect to the entry.  This was also happening when an applied filter only gave one result, which I found very annoying so instead if a filter is present and only one result is returned the results page is still displayed instead. Next week I’ll work on adding the first attested dates to the search results and I’ll also begin to develop the sparklines.

Other than this I had a meeting with Joanna Kopaczyk to further discuss a project she’s putting together.  It looks like it will be a fairly small pilot project to begin with and I’ll only be involved in a limited capacity, but it has great potential.