Week Beginning 2nd November 2020

I spent a lot of this week continuing to work on the redevelopment of the Anglo-Norman Dictionary website, focussing on the search facilities.  I made some tweaks to the citation search that I’d developed last week, ensuring that the intermediate ‘pick a form’ screen appears even if only one search word is returned and updating the search results to omit forms and labels but to include the citation dates and siglums, the latter opening up pop-ups as with the entry pages.  I also needed to regenerate the search terms as I’d realised that due to a typo in my script a number of punctuation marks that should have been stripped out were remaining, meaning some duplicate forms were being listed, sometimes with a punctuation mark such as a colon and othertimes ‘clean’.

I also realised that I needed to update the way apostrophes were being handled.  In my script these were just being stripped out, but this wasn’t working very well as forms like ‘s’oreille’ were then becoming ‘soreille’ when really it’s the ‘oreille’ part that’s important.  However, I couldn’t just split words up on an apostrophe and use the part on the right as apostrophes appear elsewhere in the citations too.  I managed to write a script that successfully split words on apostrophes and retained the sections on both sides as individual search word forms (if they are alphanumeric).  Whilst writing this script I also fixed an issue with how the data stripped of XML tags is processed.  Occasionally there are no spaces between a word and a tag that contains data, and when my script removed tags to generate the plain text required for extracting the search words this led to a word and the contents of the following tag being squashed together, resulting in forms such as ‘apresentDsxiii1’.  By adding spaces between tags I managed to get around this problem.

With these tweaks in place I then moved onto the next advanced search option:  the English translations.  I extracted the translations from the XML and generated the unique words found in each (with a count of their occurrences), also storing the Sense IDs for the senses in which the translations were found so that I could connect the translations up to the citations found within the senses in order to enable a date search (i.e. limiting a search to only those translations that are in a sense that has a citation in a particular year or range of years).  The search works in a similar way to the citation search, in that you can enter a search term (e.g. ‘bread’) and this will lead you to an intermediary page that lists all words in translations that match ‘bread’.  You can then select one to view all of the entries with their translation that feature the word, with it highlighted.  If you supply a year or a range of years then the search connects to the citations and only returns translations for senses that have a citation date in the specified year or range.  This connects citations and translations via the ‘senseid’ in the XML.  So for example, if you only want to find translations containing ‘bread’ that have a citation between 1350 and 1400 you can do so.  There are still some tweaks that need to be done.  For example, one inconsistency we might need to address is that the number in brackets in the intermediary page refers to the number of translations / citations the word is found in, but when you click through to the full results the ‘matched results’ number will likely be different because this refers to matched entries, and an entry may contain more than one matching translation / citation.

I then moved onto the final advanced search option, the label search.  This proved to be a pretty tricky undertaking, especially when citation dates also have to be taken into consideration.  I didn’t manage to get the search working this week, but I did get the form where you can build your label query up and running on the advanced search page.  If you select the ‘Semantic & Usage Labels’ tab you should see a page with a ‘citation date’ box, a section on the left that lists the labels and a section on the right where your selection gets added.  I considered using tooltips for the semantic label descriptions, but decided against it as tooltips don’t work so well on touchscreens and I thought the information would be pretty important to see.  Instead the description (where available) appears in a smaller font underneath the label, with all labels appearing in a scrollable area.  The number on the right is the number of senses (not entries) that have the label applied to them, as you can see in the following screenshot:

As mentioned above, things are seriously complicated by the inclusion of citation dates.  Unlike with other search options, choosing a date or a range here affects the search options that are available.  E.g. if you select the years 1405-1410 then the labels used in this period and the number of times they are used differs markedly from the full dataset.  For this reason the ‘citation date’ field appears above the label section, and when you update the ‘citation date’ the label section automatically updates to only display labels and counts that are relevant to the years you have selected.  Removing everything from the ‘citation date’ resets the display of labels.

When you find labels you want to search for pressing on the label area adds it to the ‘selected labels’ section on the right.  Pressing on it a second time deselects the label and removes it from the ‘selected labels’ section.  If you select more than one label then a Boolean selector appears between the selected label and the one before, allowing you to choose AND, OR, or NOT, as you can see in the above screenshot.

I made a start on actually processing the search, but it’s not complete yet and I’ll have to return to this next week.  However, building complex queries is going to be tricky as without a formal querying language like SQL there are ambiguities that can’t automatically be sorted out by the interface I’m creating.  E.g. how should ‘X AND Y NOT Z OR B’ be interpreted?  Is it ‘(X AND Y) NOT (Z OR B)’ or ‘((X AND Y) NOT Z) OR B’ or ‘(X AND (Y NOT Z)) OR B’ etc.  Each would give markedly different results.  Adding more than two or possibly three labels is likely to lead to confusing results for people.

Other than working on the AND I spent some time this week working on the Place-names of Iona project.  We had a team meeting on Friday morning and after that I began work on the interface for the website.  This involved the usual installing a theme, customising fonts, selecting colour schemes, adding in logos, creating menus and an initial site structure.  As with the Mull site, the Iona site is going to be bilingual (English and Gaelic) so I needed to set this up too.  I also worked on the banner image, combining a lovely photo of Iona from Shutterstock with a map image from the NLS.  It’s almost all in place now, but I’ll need to make a few further tweaks next week.  I also set up the CMS for the project, as we have decided to not just share the Mull CMS.  I migrated the CMS and all of its data across and then worked on a script that would pick out only those place-names from the Mull dataset that are of relevance to the Iona project.  I did this by drawing a box around the island using this handy online interface: https://geoman.io/geojson-editor and then grabbing the coordinates.  I needed to reverse the latitude and longitude of these due to GeoJSON using them the other way around to other systems, and then I plugged these into a nice little algorithm I discovered for working out which coordinates are within a polygon (see https://assemblysys.com/php-point-in-polygon-algorithm/).  This resulted in about 130 names being identified, but I’ll need to tweak this next week to see if my polygon area needs to be increased.

For the remainder of the week I upgraded all of the WordPress sites I manage to the most recent version (I manage 39 such sites so this took a little while).  I also helped Simon Taylor to access the Berwickshire and Kirkcudbrightshire place-names systems again and fixed an access issue with the Books and Borrowing CMS.  I also looked into an issue with the DSL test sites as the advanced searches on each of these had stopped working.  This was caused by an issue with the Solr indexing server that thankfully Arts IT Support were able to address.

Next week I’ll continue with the AND redevelopment and also return to working on the DSL for the first time in quite a while.