Week Beginning 14th November 2022

I spent almost all of this week working with a version of Apache Solr installed on my laptop, experimenting with data from the Books and Borrowing project and getting to grips with setting up a data core and customising a schema for the data, preparing data for ingest into Solr, importing the data and running queries on it, including facetted searching.

I started the week experimenting with our existing database, creating a cache table and writing a script to import a sample of 100 records.  This cache table could hold all of the data that the quick search would need to query and would be very speedy to search, but I realised that other aspects related to the searching would still be slow.  Facetted searching would still require several other database queries to be executed, as would extracting all of the fields that would be necessary to display the search results and it seemed inadvisable to try and create all of this functionality myself when an existing package like Solr could already do it all.

Solr is considerably faster than using the database approach and its querying is much more flexible.  It also offers facetted search options that are returned pretty much instantaneously which would be hopelessly slow if I attempted to create something comparable directly with the database.  For example, I can query the Solr data to find all borrowing records that involve a book holding record with a standardised title that includes the word ‘Roman’, returning 3325 records, but Solr can then also return a breakdown of the number of records by other fields, for example publication place:

“London”,2211,

“Edinburgh”,119,

“Dublin”,100,

“Paris”,30,

“Edinburgh; London”,16,

“Cambridge”,4,

“Eton”,3,

“Oxford”,3,

“The Hague”,3,

“Naples”,2,

“Rome”,2,

“Berlin”,1,

“Glasgow”,1,

“Lausanne”,1,

“Venice”,1,

“York”,1

 

Format:

“8vo”,842,

“octavo”,577,

“4to”,448,

“quarto”,433,

“4to.”,88,

“8vo., plates, port., maps.”,88,

“folio”,76,

“duodecimo”,67,

“Folio”,33,

“12mo”,19,

“8vo.”,17,

“8vo., plates: maps.”,16

 

Borrower gender:

“Male”,3128,

“Unknown”,109,

“Female”,64,

“Unclear”,2

These would then allow me to build in the options to refine the search results further by one (or more) of the above criteria.  Although it would be possible to build such a query mechanism myself using the database it is likely that such an approach would be much slower and would take me time to develop.  It seems much more sensible to use an existing solution if this is going to be possible.

In my experiments with Solr on my laptop I Initially imported 100 borrowing records exported via the API call I created to generate the search results page.  This gave me a good starting point to experiment with Solr’s search capabilities, but the structure of the JSON file returned from the API was rather more complicated than we’d need purely for search purposes and includes a lot of data that’s not really needed either, as the returned data contains everything that’s needed to display the full borrowing record.  I therefore worked out a simpler JSON structure that would only contain the fields that we would either want to search or could be used in a simplified search results page.  Here’s an example:

{

“bnid”: 1379,

“lid”: 6,

“slug”: “glasgow-university”,

“lname”: “Glasgow University Library”,

“rid”: 2,

“rname”: “3”,

“syear”: 1760,

“eyear”: 1765,

“rtype”: “Student”,

“pid”: 107,

“fnum”: “4r”,

“transcription”: “Euseb: Eclesiastical History”,

“bday”: 17,

“bmonth”: 9,

“byear”: 1760,

“rday”: 1,

“rmonth”: 10,

“ryear”: 1760,

“borrowed”: “1760-09-17”,

“returned”: “1760-10-01”,

“bdayofweek”: “Wednesday”,

“rdayofweek”: “Wednesday”,

“originaltitle”: “”,

“standardisedtitle”: “Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”,

“brids”: [“1”],

“bfnames”: [“Charles”],

“bsnames”: [“Wilson”],

“bfullnames”: [“Charles Wilson”],

“boccs”: [“University Student”, “Education”],

“bgenders”: [“Male”],

“aids”: [“74”],

“asnames”: [“Eusebius of Caesarea”],

“afullnames”: [” Eusebius of Caesarea”],

“beids”: [“88”],

“edtitles”: [“Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”],

“estcs”: [“R21513”],

“langs”: [“English”],

“pubplaces”: [“London”],

“formats”: [“folio”]

}

I wrote a script that would export individual JSON files like the above for each active borrowing record in our system (currently 141,335 records).  I ran this on a version of the database stored on my laptop rather than running it on the server to avoid overloading the server.  I then created a Solr Core for the data and specified an appropriate schema.  This defines each of the above fields and the types of data the fields can hold (e.g. some fields can hold multiple values, such as borrower occupations, some fields are text strings, some are integers, some are dates). I then ran the Solr script that ingests the data.

It took a lot of time to get things working as I needed to experiment with the structure of the JSON files that my script generated in order to account for various complexities in the data.  I also encountered some issues with the data that only became apparent at the point of ingest when records were rejected.  These issues only affected a few records out of nearly 150,000 so I needed to tweak and re-run the data export many times until all issues were ironed out.  As both the data export and the ingest scripts took quite a while to run the whole process took several days to get right.

Some issues encountered include:

  1. Empty fields in the data resulting in no data for the corresponding JSON field (e.g. “bday”: <nothing here> ) which invalidated the JSON file structure. I needed to update the data export script to ensure such empty fields were not included.
  2. Solr’s date structure requiring a full date (e.g. 1792-02-16) and partial dates (e.g. 1792) therefore failing. I ended up reverting to an integer field for returned dates as these are generally much more vague and having to generate placeholder days and months where required for the borrowed date.
  3. Solr’s default (and required) ID field having to be a string rather than an integer, which is what I’d set it to in order to match our BNID field. This was a bit of a strange one as I would have expected an integer ID to be allowed and it took some time to investigate why my nice integer ID was failing.
  4. Realising more fields should be added to the JSON output as I went on and therefore having to regenerate the data each time (e.g. I added in borrower gender and IDs for borrowers, editions, works and authors )
  5. Issues with certain characters appearing in the text fields causing the import to break. For example, double quotes needed to be converted to the entity ‘&quote;’ as their appearance in the JSON caused the structure to be invalid.  I therefore updated the translation, original title and standardised title fields, but then the import still failed as a few borrowers also have double quotes in their names.

However, once all of these issues were addressed I managed to successfully import all 141,355 borrowing records into the Solr instance running on my laptop and was able to experiment with queries, all of which are running very quickly and will serve our needs very well.  And now that the data export script is properly working I’ll be able to re-run this and ingest new data very easily in future.

The big issue now is whether we will be allowed to install an Apache Solr instance on a server at Stirling.  We would need the latest release of Solr (v9 https://solr.apache.org/downloads.html) to be installed on a server.  This requires Java JRE version 11 or higher (https://solr.apache.org/guide/solr/latest/deployment-guide/system-requirements.html).  Solr uses the Apache Lucene search library and as far as I know it fires up a Java based server called Jetty when it runs.  The deployment guide can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html

When Solr runs a web-based admin interface is available through which the system can be managed and the data can be queried.  This would need securing, and instructions about doing so can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/securing-solr.html

I think basic authentication would be sufficient, ideally with access limited to on-campus / VPN users.  Other than for testing purposes there should only be one script that connects to the Solr URL (our API) so we could limit access to the IP address of this server,  or if Solr is going to be installed on the same server then limiting access to localhost could work.

In terms of setting up the Solr instance, we would only need a single node installation (not SolrCloud).  Once Solr is running we’d need a Core to be created.  I have the schema file the core would require and can give instructions about setting this up.  I’m assuming that I would not be given command-line access to the server, which would unfortunately mean that someone in Stirling’s IT department would need to execute a few commands for me, including setting up the Core and ingesting the data each time we have a new update.

One downside to using Solr is it is a separate system to the B&B database and will not reflect changes made to the project’s data until we run a new data export / ingest process.  We won’t want to do this too frequently as exporting the data takes at least an hour, then transferring the files to the server for ingest will take a long time (uploading hundreds of thousands of small files to a server can take hours.  Zipping them up then uploading the zip file and extracting the file also takes a long time).  Then someone with command-line access to the server will need to run the command to ingest the data.  We’ll need to see if Stirling are prepared to do this for us.

Until we hear more about the chances of using Solr I’ll hold off doing any further work on B&B.  I’ve got quite a lot to do for other projects that I’ve been putting off whilst I focus on this issue so I need to get back into that.

Other than the above B&B work I did spent a bit of time on other projects.  I answered a query about a potential training event based on Speak For Yersel that Jennifer Smith emailed me about and I uploaded a video to the Speech Star site.  I deleted a spurious entry from the Anglo-Norman Dictionary and fixed a typo on the ‘Browse Textbase’ page.  I also had a chat with the editor about further developments of the Dictionary Management System that I’m going to start looking into next week.  I also began doing some research into semantic web technologies for structuring thesaurus data in preparation for a paper I’ll be giving in Zurich in January.

Finally, I investigated potential updates to the Dictionaries of the Scots Language quotations search after receiving a series of emails from the team, who had been meeting to discuss how dates will be used in the site.

Currently the quotations are stripped of all tags to generate a single block of text that is then stored in the Solr indexing system and queried against when an advanced search ‘quotes only’ search is performed.  So for example in a search for ‘driech’ (https://dsl.ac.uk/results/dreich/quotes/full/both/) Solr looks for the term in the following block of text for the entry https://dsl.ac.uk/entry/snd/dreich (block snipped to save space):

<field name=”searchtext_onlyquotes”>I think you will say yourself it is a dreich business.

Sic dreich wark. . . . For lang I tholed an’ fendit.

Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain.

And he’ll no fin his day’s dark ae hue the dreigher for wanting his breakfast on account of sic a cause.

It’s a dreich job howkin’ tatties wi’ the caul’ win’ in yer duds.

Driche and sair yer pain.

And even the ugsome driech o’ this Auld clarty Yirth is wi’ your kiss Transmogrified.

See a blanket of September sorrows unremitting drich and drizzle permeates our light outerwear.

</field>

The way Solr handles returning snippets is described on this page: https://solr.apache.org/guide/8_7/highlighting.html and the size of the snippet is set by the hl.fragsize variable, which “Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100.”.  We don’t currently override this default so 100 characters is what we use per snippet (roughly – it can extend more than this to ensure complete words are displayed).

The hl.snippets variable specifies the maximum number of highlighted snippets that are returned per entry and this is currently set to 10.  If you look at the SND result for ‘Dreich adj’ you will see that there are 10 snippets listed and this is because the maximum number of snippets has been reached.  ‘Dreich’ actually occurs many more than 10 times in this entry.  We can change this maximum, but I think 10 gives a good sense that the entry in question is going to be important.

As the quotations block of text is just one massive block and isn’t split into individual quotations the snippets don’t respect the boundaries between quotations.  So the first snippet for ‘Dreich Adj’ is:

“I think you will say yourself it is a dreich business. Sic dreich wark. . . . For lang I tholed an”

Which actually comprises the text from almost the entire first two quotes, while the next snippet:

“’ fendit. Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain. And he’ll no fin his day’s”

 

Includes the last word of the second quote, all of the third quote and some of the fourth quote (which doesn’t actually include ‘dreich’ but ‘dreigher’ which is not highlighted).

So essentially while the snippets may look like they correspond to individual quotes this is absolutely not the case and the highlighted word is generally positioned around the middle of around 100 characters of text that can include several quotations.  It also means that it is not possible to limit a search to two terms that appear within one single quotation at the moment because we don’t differentiate individual quotations – the search doesn’t know where one quotation ends and the next begins.

I have no idea how Solr works out exactly how to position the highlighted term within the 100 characters, and I don’t think this is something we have any control over.  However, I think we will need to change the way we store and query quotations in order to better handle the snippets, allow Boolean searches to be limited to the text of specific quotes rather than the entire block and to enable quotation results to be refined by a date / date range, which is what the team wants.

We’ll need to store each quotation for an entry individually, each with its own date fields and potentially other fields later on such as part of speech.  This will ensure snippets will in future only feature text from the quotation in question and will ensure that Boolean searches will be limited to text within individual queries.  However, it is a major change and it will require some time and experimentation to get working correctly and it may introduce other unforeseen issues.

I will need to change the way the search data is stored in Solr and I will need to change how the data is generated for ingest into Solr.  The display of the search results will need to be reworked as the search will now be based around quotations rather than entries.  I’ll need to group quotations into entries and we’ll need to decide whether to limit the number of quotations that get displayed per entry as for something like ‘dreich adj’ we would end up with many tens of quotations being returned, which would swamp the results page and make it difficult to use.  It is also likely that the current ranking of results will no longer work as individual quotations will be returned rather than entire entries.  The quotations themselves will be ranked, but that’s not going to be very helpful if we still want the results to be grouped by entry.  I’ll need to look at alternatives, such as ranking entries by the number of quotations returned.

The DSL team has proposed that a date search could be provided as a filter on the search results page and we would certainly be able to do this, and incorporate other filters such as POS in future.  This is something called ‘facetted searching’ and it’s the kind of thing you see in online shops:  you view the search results then you see a list of limiting options, generally to the left of the results, often as a series of checkboxes with a number showing how many of the results the filter applies to.  The good news is that Solr has these kind of faceting options built in (in fact it is used to power many online shops).  More good news is that this fits in with the work I’m already doing for the Books and Borrowing project as discussed at the start of this post, so I’ll be able to share my expertise between both projects.