This week I completed work on a first version of the textbase search facilities for the Anglo-Norman Dictionary. I’ve been working on this over the past three weeks and it’s now fully operational, quick to use and does everything that was required of it. I completed work on the KWIC ordering facilities, adding in a drop-down list that enables the user to order the results either by the term or any word to the left or right of the term. When results are ordered by a word to the left or right of the search term that word is given a yellow highlight so you can easily get your eye on the word that each result is being ordered by. I ran into a few difficulties with the ordering, for example accented initial characters were being sorted after ‘z’, and upper case characters were all sorted before lower case characters, but I’ve fixed these issues. I also updated the textbase page so that when you load a text from the results a link back to the search results appears at the top of the page. You can of course just use the ‘back’ button to return to the search results. Also, all occurrences of the search term throughout the text are highlighted in yellow. There are possibly some further enhancements that could be made here (e.g. we could have a box that hovers on the screen like the ‘Top’ button that contains a summary of your search and a link back to the results, or options to load the next or previous result) but I’ll leave things as they are for now as what’s there might be good enough. I also fixed some bugs that were cropping up, such as an exact search term not appearing in the search box when you return to refine your results (caused by double quotes needing to be changed to the code ‘%22’).
I then began thinking about the development of a proximity search for the textbase. As with the old site, this will allow the user to enter two search terms and specify the maximum number of words before or after the first term the second one appears. The results will then be displayed in a KWIC form with both terms highlighted. It took quite some time to think through the various possibilities for this feature. The simplest option from a technical point of view would be to process the first term as with the regular search, retrieve the KWIC for each result and then search this for the second term. However, this wouldn’t allow the user to search for an exact match for the second term, or use wildcards, as the KWIC only contains the full text as written, complete with punctuation. Instead I decided to make the proximity search as similar to and as consistent with the regular textbase search as possible. This means the user will be able to enter the two terms with wildcards and two lists of possible exact matches will be displayed, from which the user can select term 1 and term 2. Then at this point the exact matches for term 1 will be returned and in each case a search will be performed to see whether term 2 is found however number of words specified before or after term 1. This will rely on the ‘word order’ column that I already added to the database, but will involve some complications when term 1 is near the very start or end of a page (as the search will then need to look at the preceding or following page). I ran a few tests of this process directly via the database and it seemed to work ok, but I’ll just need to see whether there are any speed issues when running such queries on potentially thousands of results.
Also this week I had an email from Bryony Randall about her upcoming exhibition for her New Modernist Editing project. The exhibition will feature a live website (https://www.blueandgreenproject.com/) running on a tablet in the venue and Bryony was worried that the wifi at the venue wouldn’t be up to scratch. She asked whether I could create a version of the site that would run locally without an internet connection, and I spent some time working on this.
I continued to work on my replica of the site, getting all of the content transferred over. This took longer than I anticipated, as some of the pages are quite complicated (artworks including poetry, images, text and audio) but I managed to get everything done before the end of the week. In the end it turned out that the wifi at the venue was absolutely fine so my replica site wasn’t needed, but it was still a good opportunity to learn about hosting a site on an Android device and to hone my Bootstrap skills.
Also this week I helped Katie Halsey of the Books and Borrowing project with a query about access to images, had a look through the final version of Kirsteen McCue’s AHRC proposal and spoke to Eleanor Lawson about creating some mockups of the interface to the STAR project websites, which I will start on next week.
I continued to work on the new textbase search facilities for the Anglo-Norman Dictionary this week. I completed work on the required endpoints for the API, creating the facilities that would process a search term (with optional wildcards), limit the search to selected books and or genres and return either full search results in the case of an exact search for a term or a list of possible matching terms and the number of occurrences of each term. I then worked on the front-end to enable a query to be processed and submitted to the API based on the choices made by the user.
By default any text entered will match any term that contains the text – e.g. enter ‘jour’ (without apostrophes) and you’ll find all forms containing the characters ‘jour’ anywhere e.g. ‘adjourner’, ‘journ’. If you want to do an exact match you have to use double quotes – “jour”. You can also use an asterisk at the beginning or end to match forms starting or ending with the term – ‘jour*’ and ‘*jour’ or an asterisk at both ends ‘*jour*’ will only find forms that contain the term somewhere in the middle. You can also use a question mark wildcard to denote any single character, e.g. ‘am?n*’ will find words beginning ‘aman’, ‘amen’ etc.
If your selected form in your selected books / genres matches multiple forms then an intermediary page bringing up a list of matching forms and a count of the number of times each form appears will be displayed. This is the same as how the ‘translation’ advanced search works, for example, and I wanted to maintain a consistent way of doing things across the site. Select a specific form and the actual occurrences of each item in the texts will appear. Above this list is a ‘Select another form’ button that returns you to the intermediary page. If your search only brings back one form the intermediary page is skipped, and as all selection options appear in the URL it’s possible to bookmark / cite the search results too.
Whilst working on this I realised that I’d need to regenerate the data, as it became clear that many words have been erroneously joined together due to there being no space between words when one tag is closed and a following one is opened. When the tags are then stripped out the forms get squashed together, which has led to some crazy forms such as ‘amendeezreamendezremaundez’. Previously I’d not added spaces between tags as I was thinking that a space would have to be ended before a closing tag (e.g. ‘</’ becomes ‘ </’) and this would potentially mess up words that have tags in them, such as superscript tags in names like McDonald. However, I realised I could instead do a find and replace to add spaces between a closing tag and an opening tag (‘><’ becomes ‘> <’, which would not mess up individual tags within words and wouldn’t have any further implications as I strip out all additional spaces when processing the texts for search purposes anyway.
I also decided that I should generate the ‘key-word in context’ (KWIC) for each word and store this in the database. I was going to generate this on the fly every time a search results page was displayed but it seems more efficient to generate and store this once rather than do it every time. I therefore updated by data processing script to generate the KWIC for each of the 3.5 million words as they were extracted from the texts. This took some time to both implement and execute. I decided to pull out the 10 words on either side of the term, which used the ‘word order’ column that gets generated as each page is processed. Some complications were introduced in cases where the term is either before the tenth word on the page or there are less than ten words after the term on the page. I such cases the script needed to look at the page before or after the current page in order to pull out the words and fill out the KWIC with the appropriate words on the other pages.
With the updates to data processing in place and a fair bit of testing of the KWIC facility carried out, I re-ran my scripts to regenerate the data and all looked good. However, after inserting the KWIC data the querying of the tables slowed to a crawl. On my local PC queries which were previously taking 0.5 seconds were taking more than 10 seconds, while on the server execution time was almost 30 seconds. It was really baffling as the only difference was the search words table now had two additional fields (KWIC left and KWIC right), neither of which were being queried or returned in the query. It seemed really strange that adding new columns could have such an effect if they were not even being used in a query. I had to spend quite a bit of time investigating this, including looking at MySQL settings such as key buffer size and trying again to change storage engines, switching from MyISAM to InnoDB and back again to see what was going on. Eventually I looked again at the indexes I’d created for the table, and decided to delete them and start over, in case this somehow jump-started the search speed. I previously had the ‘word stripped’ column indexed in a multiple column index with page ID and word type (either main page or textual apparatus). Instead I created an index of the ‘word stripped’ column on its own, and this immediately boosted performance. Queries that were previously taking close to 30 seconds to execute on the server were now taking less than a second. It was such a relief to have figured out what the issue was, as I had been considering whether my whole approach would need to be dropped and replaced by something completely different.
As I now had a useable search facility I continued to develop the front-end that would use this facility. Previously the exact match for a term was bringing up just the term in question and a link through to the page the term appeared on, but now I could begin to incorporate the KWIC text too. My initial idea was to use a tabular layout, with each word of the KWIC in a different column, with a clickable table heading that would allow the data to be ordered by any of the columns (e.g. order the data alphabetically by the first word to the left of the term). However, after creating such a facility I realised it didn’t work very well. The text just didn’t scan very well due to columns having to be the width of whatever the longest word in the column was, and the text just took up too much horizontal space. Instead, I decided to revert to using an unordered list, with the KWIC left and KWIC right in separate spans, with KWIC left right aligned to push it up against the search term no matter what the length of the KWIC left text. I split the KWIC text up into individual words and stored this in an array to enable each search result to be ordered by any word in the KWIC, and began working on a facility to change the order using a select box above the search results. This is as far as I got this week, but I’m pretty confident that I’ll get things finished next week. Here’s a screenshot of how the KWIC looks so far:
Also this week I had an email conversation with the other College of Arts developers about professional web designers after Stevie Barrett enquired about them, arranged to meet with Gerry Carruthers to discuss the journal he would like us to host, gave some advice to Thomas Clancy about mailing lists and spoke to Joanna Kopaczyk about a website she would like to set up for a conference she’s organising next year.
I’d taken last week off as our final break of the summer, and we spent it on the Kintyre peninsula. We had a great time and were exceptionally lucky with the weather. The rains began as we headed home and I returned to a regular week of work. My major task for the week was to begin work on the search facilities for the Anglo-Norman Dictionary’s textbase, a collection of almost 80 lengthy texts for which I had previously created facilities to browse and view texts. The editors wanted me to replicate the search options that were available through the old site, which enabled a user to select which texts to search (either individual texts or groups of texts arranged by genre), enter a single term to search (either a full match or partial match at the beginning or end of a word), select a specific term from a list of possible matches and then view each hit via a keyword in context (KWIC) interface, showing a specific number of words before and after the hit, with a link through to the full text opened at that specific point.
This is a pretty major development and I decided initially that I’d have two major tasks to tackle. I’d have to categorise the texts by their genre and I’d have to research how best to handle full text searching including limiting to specific texts, KWIC and reordering KWIC, and linking through to specific pages and highlighting the results. I reckoned it was potentially going to be tricky as I don’t have much experience with this kind of searching. My initial thought was to see whether Apache Solr might be able to offer the required functionality. I used this for the DSL’s advanced search, which searches the full text of the entries and returns snippets featuring the word, with the word highlighted and the word then highlighted throughout the entry when an entry in the results is loaded (e.g. https://dsl.ac.uk/results/dreich/fulltext/withquotes/both/). This isn’t exactly what is required here, but I hoped that there might be further options I can explore. Failing that I wondered whether I could repurpose the code for the Scottish Corpus of Texts and Speech. I didn’t create this site, but I redeveloped it significantly a few years ago and may be able to borrow parts from the concordance search. E.g. https://scottishcorpus.ac.uk/advanced-search/ and select ‘general’ then ‘word search’ then ‘word / phrase (concordance)’ then search for ‘haggis’ and scroll down to the section under the map. When opening a document you can then cycle through the matching terms, which are highlighted, e.g. https://scottishcorpus.ac.uk/document/?documentid=1572&highlight=haggis#match1.
After spending some further time with the old search facility and considering the issues I realised there are a lot of things to be considered regarding preparing the texts for search purposes. I can’t just plug the entire texts in as only certain parts of them should be used for searching – no front or back matter, no notes, textual apparatus or references. In addition, in order to properly ascertain which words follow on from each other all XML tags need to be removed too, and this introduces issues where no space has been entered between tags but a space needs to exist between the contents of the tags, e.g. ‘dEspayne</item><item>La charge’ would otherwise become ‘dEspayneLa charge’.
As I’d need to process the texts no matter which search facility I end up using I decided to focus on this first, and set up some processing scripts and a database on my local PC to work with the texts. Initially I managed to extract the page contents for each required page, remove notes etc and strip the tags and line breaks so that the page content is one continuous block of text.
I realised that the old search seems to be case sensitive, which doesn’t seem very helpful. E.g. search for ‘Leycestre’ and you find nothing – you need to enter ‘leycestre’, even though all 264 occurrences actually have a capital L. I decided to make the new search case insensitive – so searching for ‘Leycestre’, ‘leycestre’ or ‘LEYCESTRE’ will bring back the same results. Also, the old search limits the keyword in context display to pages. E.g. the first ‘Leycestre’ hit has no text after it as it’s the last word on the page. I’m intending to take the same approach as I’m processing text on a page-by-page basis. I may be able to fill out the KWIC with text from the preceding / subsequent page if you consider this to be important, but it would be something I’d have to add in after the main work is completed. The old search also limits the KWIC to text that’s on the same line, e.g. in a search for ‘arcevesque’ the result ‘L’arcevesque puis metre en grant confundei’ has no text before because it’s on a different line (it also chops off the end of ‘confundeisun’ for some reason). The new KWIC will ignore breaks in the text (other than page breaks) when displaying the context. I also realised that I need to know what to do about words that have apostrophes in them. The old search splits words on the apostrophe, so for example you can search for arcevesque but not l’arcevesque. I’m intending to do the same. The old search retains both parts before and after the apostrophe as separate search terms, so for example in “qu’il” you can search for “qu” and “il” (but not “qu’il”).
After some discussions with the editor, I updated my system to include textual apparatus, stored in a separate field to the main page text. With all of the text extracted I decided that I’d just try and make my own system initially, to see whether it would be possible. I therefore created a script that would take each word from the extracted page and textual apparatus fields and store this in a separate table, ensuring that words with apostrophes in them are split into separate words and for search purposes all non-alphanumeric characters are removed and the text is stored as lower-case. I also needed to store the word as it actually appears in the text, the word order on the page and whether the word is a main page word or in the textual apparatus. This is because after finding a word I’ll need to extract those around it for the KWIC display. After running my script I ended up with around 3.5 million rows in the ‘words’ table, and this is where I ran into some difficulties.
I ran some test queries on the local version of the database and all looked pretty promising, but after copying the data to the server and running the same queries it appeared that the server is unusably slow. On my desktop a query to find all occurrences of ‘jour’, with the word table joined to the page table and then to the text table completed in less than 0.5 seconds but on the server the same query took more than 16 seconds, so about 32 times slower. I tried the same query a couple of times and the results are roughly the same each time. My desktop PC is a Core i5 with 32GB of RAM, and the database is running on an NVMe M.2 SSD, which no doubt makes things quicker, but I wouldn’t expect it to be 32 times quicker.
I then did some further experiments with the server. When I query the table containing the millions of rows on its own the query is fast (much less than a second). I added a further index to the column that is used for the join to the page table (previously it was indexed, but in combination with other columns) and then when limiting the query to just these two tables the query runs at a fairly decent speed (about 0.5 seconds). However, the full query involving all three tables still takes far too long, and I’m not sure why. It’s very odd as there are indexes on the joining columns and the additional table is not big – it only has 77 rows. I read somewhere that ordering the results by a column in the joined table can make things slower, as can using descending order on a column, so I tried updating the ordering but this has had no effect. It’s really weird – I just can’t figure out why adding the table has such a negative effect on the performance and I may end up just having to incorporate some of the columns from the text table into the page table, even though it will mean duplicating data. I also still don’t know why the performance is so different on my local PC either.
One final thing I tried was to change the database storage type. I noticed that the three tables were set to use MyISAM storage rather than InnoDB, which the rest of the database was set to. I migrated the tables to InnoDB in the hope that this might speed things up, but it’s actually slowed things down, both on my local PC and the server. The two-table query now takes several seconds while the three-table query now takes about the same, so is quicker, but still too slow. On my desktop PC the speed has doubled to about 1 second. I therefore reverted back to using MyISAM.
Also this week I had a chat with Eleanor Lawson about the STAR project that has recently begun. There was a project meeting last week that unfortunately I wasn’t able to attend due to my holiday, so we had an email conversation about some of the technical issues that were raised at the meeting, including how it might be possible to view videos side by side and how a user may choose to select multiple videos to be played automatically one after the other.
I also fixed a couple of minor formatting issues for the DSL people and spoke to Katie Halsey, PI of the Books and Borrowing project about the development of the API for the project and the data export facilities. I also received further feedback from Kirsteen McCue regarding the Data Management Plan for her AHRC proposal and went through this, responding to the comments and generating a slightly tweaked version of the plan.