Week Beginning 16th August 2021

I continued to work on the new textbase search facilities for the Anglo-Norman Dictionary this week.  I completed work on the required endpoints for the API, creating the facilities that would process a search term (with optional wildcards), limit the search to selected books and or genres and return either full search results in the case of an exact search for a term or a list of possible matching terms and the number of occurrences of each term.  I then worked on the front-end to enable a query to be processed and submitted to the API based on the choices made by the user.

By default any text entered will match any term that contains the text – e.g. enter ‘jour’ (without apostrophes) and you’ll find all forms containing the characters ‘jour’ anywhere e.g. ‘adjourner’, ‘journ’.  If you want to do an exact match you have to use double quotes – “jour”.  You can also use an asterisk at the beginning or end to match forms starting or ending with the term – ‘jour*’ and ‘*jour’ or an asterisk at both ends ‘*jour*’ will only find forms that contain the term somewhere in the middle.  You can also use a question mark wildcard to denote any single character, e.g. ‘am?n*’ will find words beginning ‘aman’, ‘amen’ etc.

If your selected form in your selected books / genres matches multiple forms then an intermediary page bringing up a list of matching forms and a count of the number of times each form appears will be displayed.  This is the same as how the ‘translation’ advanced search works, for example, and I wanted to maintain a consistent way of doing things across the site.  Select a specific form and the actual occurrences of each item in the texts will appear.  Above this list is a ‘Select another form’ button that returns you to the intermediary page.  If your search only brings back one form the intermediary page is skipped, and as all selection options appear in the URL it’s possible to bookmark / cite the search results too.

Whilst working on this I realised that I’d need to regenerate the data, as it became clear that many words have been erroneously joined together due to there being no space between words when one tag is closed and a following one is opened.  When the tags are then stripped out the forms get squashed together, which has led to some crazy forms such as ‘amendeezreamendezremaundez’.  Previously I’d not added spaces between tags as I was thinking that a space would have to be ended before a closing tag (e.g. ‘</’ becomes ‘ </’) and this would potentially mess up words that have tags in them, such as superscript tags in names like McDonald.  However, I realised I could instead do a find and replace to add spaces between a closing tag and an opening tag (‘><’ becomes ‘> <’, which would not mess up individual tags within words and wouldn’t have any further implications as I strip out all additional spaces when processing the texts for search purposes anyway.

I also decided that I should generate the ‘key-word in context’ (KWIC) for each word and store this in the database.  I was going to generate this on the fly every time a search results page was displayed but it seems more efficient to generate and store this once rather than do it every time.  I therefore updated by data processing script to generate the KWIC for each of the 3.5 million words as they were extracted from the texts.  This took some time to both implement and execute.  I decided to pull out the 10 words on either side of the term, which used the ‘word order’ column that gets generated as each page is processed.  Some complications were introduced in cases where the term is either before the tenth word on the page or there are less than ten words after the term on the page.  I such cases the script needed to look at the page before or after the current page in order to pull out the words and fill out the KWIC with the appropriate words on the other pages.

With the updates to data processing in place and a fair bit of testing of the KWIC facility carried out, I re-ran my scripts to regenerate the data and all looked good.  However, after inserting the KWIC data the querying of the tables slowed to a crawl.  On my local PC queries which were previously taking 0.5 seconds were taking more than 10 seconds, while on the server execution time was almost 30 seconds.  It was really baffling as the only difference was the search words table now had two additional fields (KWIC left and KWIC right), neither of which were being queried or returned in the query.  It seemed really strange that adding new columns could have such an effect if they were not even being used in a query.  I had to spend quite a bit of time investigating this, including looking at MySQL settings such as key buffer size and trying again to change storage engines, switching from MyISAM to InnoDB and back again to see what was going on.  Eventually I looked again at the indexes I’d created for the table, and decided to delete them and start over, in case this somehow jump-started the search speed.  I previously had the ‘word stripped’ column indexed in a multiple column index with page ID and word type (either main page or textual apparatus).  Instead I created an index of the ‘word stripped’ column on its own, and this immediately boosted performance.  Queries that were previously taking close to 30 seconds to execute on the server were now taking less than a second.  It was such a relief to have figured out what the issue was, as I had been considering whether my whole approach would need to be dropped and replaced by something completely different.

As I now had a useable search facility I continued to develop the front-end that would use this facility.  Previously the exact match for a term was bringing up just the term in question and a link through to the page the term appeared on, but now I could begin to incorporate the KWIC text too.  My initial idea was to use a tabular layout, with each word of the KWIC in a different column, with a clickable table heading that would allow the data to be ordered by any of the columns (e.g. order the data alphabetically by the first word to the left of the term).  However, after creating such a facility I realised it didn’t work very well.  The text just didn’t scan very well due to columns having to be the width of whatever the longest word in the column was, and the text just took up too much horizontal space.  Instead, I decided to revert to using an unordered list, with the KWIC left and KWIC right in separate spans, with KWIC left right aligned to push it up against the search term no matter what the length of the KWIC left text.  I split the KWIC text up into individual words and stored this in an array to enable each search result to be ordered by any word in the KWIC, and began working on a facility to change the order using a select box above the search results.  This is as far as I got this week, but I’m pretty confident that I’ll get things finished next week.  Here’s a screenshot of how the KWIC looks so far:

Also this week I had an email conversation with the other College of Arts developers about professional web designers after Stevie Barrett enquired about them, arranged to meet with Gerry Carruthers to discuss the journal he would like us to host, gave some advice to Thomas Clancy about mailing lists and spoke to Joanna Kopaczyk about a website she would like to set up for a conference she’s organising next year.