Week Beginning 9th August 2021

I’d taken last week off as our final break of the summer, and we spent it on the Kintyre peninsula.  We had a great time and were exceptionally lucky with the weather.  The rains began as we headed home and I returned to a regular week of work.  My major task for the week was to begin work on the search facilities for the Anglo-Norman Dictionary’s textbase, a collection of almost 80 lengthy texts for which I had previously created facilities to browse and view texts.  The editors wanted me to replicate the search options that were available through the old site, which enabled a user to select which texts to search (either individual texts or groups of texts arranged by genre), enter a single term to search (either a full match or partial match at the beginning or end of a word), select a specific term from a list of possible matches and then view each hit via a keyword in context (KWIC) interface, showing a specific number of words before and after the hit, with a link through to the full text opened at that specific point.

This is a pretty major development and I decided initially that I’d have two major tasks to tackle.  I’d have to categorise the texts by their genre and I’d have to research how best to handle full text searching including limiting to specific texts, KWIC and reordering KWIC, and linking through to specific pages and highlighting the results.  I reckoned it was potentially going to be tricky as I don’t have much experience with this kind of searching.  My initial thought was to see whether Apache Solr might be able to offer the required functionality.  I used this for the DSL’s advanced search, which searches the full text of the entries and returns snippets featuring the word, with the word highlighted and the word then highlighted throughout the entry when an entry in the results is loaded (e.g. https://dsl.ac.uk/results/dreich/fulltext/withquotes/both/).  This isn’t exactly what is required here, but I hoped that there might be further options I can explore.  Failing that I wondered whether I could repurpose the code for the Scottish Corpus of Texts and Speech.  I didn’t create this site, but I redeveloped it significantly a few years ago and may be able to borrow parts from the concordance search. E.g. https://scottishcorpus.ac.uk/advanced-search/ and select ‘general’ then ‘word search’ then ‘word / phrase (concordance)’ then search for ‘haggis’ and scroll down to the section under the map.  When opening a document you can then cycle through the matching terms, which are highlighted, e.g. https://scottishcorpus.ac.uk/document/?documentid=1572&highlight=haggis#match1.

After spending some further time with the old search facility and considering the issues I realised there are a lot of things to be considered regarding preparing the texts for search purposes.  I can’t just plug the entire texts in as only certain parts of them should be used for searching – no front or back matter, no notes, textual apparatus or references.  In addition, in order to properly ascertain which words follow on from each other all XML tags need to be removed too, and this introduces issues where no space has been entered between tags but a space needs to exist between the contents of the tags, e.g. ‘dEspayne</item><item>La charge’ would otherwise become ‘dEspayneLa charge’.

As I’d need to process the texts no matter which search facility I end up using I decided to focus on this first, and set up some processing scripts and a database on my local PC to work with the texts.  Initially I managed to extract the page contents for each required page, remove notes etc and strip the tags and line breaks so that the page content is one continuous block of text.

I realised that the old search seems to be case sensitive, which doesn’t seem very helpful.  E.g. search for ‘Leycestre’ and you find nothing – you need to enter ‘leycestre’, even though all 264 occurrences actually have a capital L.  I decided to make the new search case insensitive – so searching for ‘Leycestre’, ‘leycestre’ or ‘LEYCESTRE’ will bring back the same results.  Also, the old search limits the keyword in context display to pages.  E.g. the first ‘Leycestre’ hit has no text after it as it’s the last word on the page.  I’m intending to take the same approach as I’m processing text on a page-by-page basis.  I may be able to fill out the KWIC with text from the preceding / subsequent page if you consider this to be important, but it would be something I’d have to add in after the main work is completed.  The old search also limits the KWIC to text that’s on the same line, e.g. in a search for ‘arcevesque’ the result ‘L’arcevesque puis metre en grant confundei’ has no text before because it’s on a different line (it also chops off the end of ‘confundeisun’ for some reason).  The new KWIC will ignore breaks in the text (other than page breaks) when displaying the context.  I also realised that I need to know what to do about words that have apostrophes in them.  The old search splits words on the apostrophe, so for example you can search for arcevesque but not l’arcevesque.  I’m intending to do the same.  The old search retains both parts before and after the apostrophe as separate search terms, so for example in “qu’il” you can search for “qu” and “il” (but not “qu’il”).

After some discussions with the editor, I updated my system to include textual apparatus, stored in a separate field to the main page text.  With all of the text extracted I decided that I’d just try and make my own system initially, to see whether it would be possible.  I therefore created a script that would take each word from the extracted page and textual apparatus fields and store this in a separate table, ensuring that words with apostrophes in them are split into separate words and for search purposes all non-alphanumeric characters are removed and the text is stored as lower-case.  I also needed to store the word as it actually appears in the text, the word order on the page and whether the word is a main page word or in the textual apparatus.  This is because after finding a word I’ll need to extract those around it for the KWIC display.  After running my script I ended up with around 3.5 million rows in the ‘words’ table, and this is where I ran into some difficulties.

I ran some test queries on the local version of the database and all looked pretty promising, but after copying the data to the server and running the same queries it appeared that the server is unusably slow.  On my desktop a query  to find all occurrences of ‘jour’, with the word table joined to the page table and then to the text table completed in less than 0.5 seconds but on the server the same query took more than 16 seconds, so about 32 times slower.  I tried the same query a couple of times and the results are roughly the same each time.  My desktop PC is a Core i5 with 32GB of RAM, and the database is running on an NVMe M.2 SSD, which no doubt makes things quicker, but I wouldn’t expect it to be 32 times quicker.

I then did some further experiments with the server.  When I query the table containing the millions of rows on its own the query is fast (much less than a second).  I added a further index to the column that is used for the join to the page table (previously it was indexed, but in combination with other columns) and then when limiting the query to just these two tables the query runs at a fairly decent speed (about 0.5 seconds).  However, the full query involving all three tables still takes far too long, and I’m not sure why.  It’s very odd as there are indexes on the joining columns and the additional table is not big – it only has 77 rows.  I read somewhere that ordering the results by a column in the joined table can make things slower, as can using descending order on a column, so I tried updating the ordering but this has had no effect.  It’s really weird – I just can’t figure out why adding the table has such a negative effect on the performance and I may end up just having to incorporate some of the columns from the text table into the page table, even though it will mean duplicating data.  I also still don’t know why the performance is so different on my local PC either.

One final thing I tried was to change the database storage type.  I noticed that the three tables were set to use MyISAM storage rather than InnoDB, which the rest of the database was set to.  I migrated the tables to InnoDB in the hope that this might speed things up, but it’s actually slowed things down, both on my local PC and the server.  The two-table query now takes several seconds while the three-table query now takes about the same, so is quicker, but still too slow.  On my desktop PC the speed has doubled to about 1 second.  I therefore reverted back to using MyISAM.

I decided to leave the issue of database speed at that point and to focus on other things instead.  I added a new ‘genre’ column to the texts and added in the required categorisation.  I then updated the API to add in this new column and updated the ‘browse’ and ‘view’ front-ends so that genre now gets displayed.  I then began work on the front-end for the search, focussing on the options for listing texts by genre and adding in the options to select / deselect specific texts or entire genres of text.  This required quite a bit of HTML, JavaScript and CSS work and made a nice change from all of the data processing.  By the end of the week I’d completed work on the text selection facility, and next week I’ll tackle the actual processing of the search, at which point I’ll know whether my database way of handling things will be sufficiently speedy.

Also this week I had a chat with Eleanor Lawson about the STAR project that has recently begun.  There was a project meeting last week that unfortunately I wasn’t able to attend due to my holiday, so we had an email conversation about some of the technical issues that were raised at the meeting, including how it might be possible to view videos side by side and how a user may choose to select multiple videos to be played automatically one after the other.

I also fixed a couple of minor formatting issues for the DSL people and spoke to Katie Halsey, PI of the Books and Borrowing project about the development of the API for the project and the data export facilities.  I also received further feedback from Kirsteen McCue regarding the Data Management Plan for her AHRC proposal and went through this, responding to the comments and generating a slightly tweaked version of the plan.