Week Beginning 18th April 2022

I divided most of my time between the Speak For Yersel project and the Dictionaries of the Scots Language this week.  For Speak For Yersel I continued to work on the user management side of things.  I implemented the registration form (apart from the ‘where you live’ bit, which still requires data) and all now works, uploading the user’s details to our database and saving them within the user’s browser using HTML5 Storage.  I added in checks to ensure that a year of birth and gender must be supplied too.

I then updated all activities and quizzes so that the user’s answers are uploaded to our database, tracking the user throughout the site so we can tell which user has submitted what.  For the ‘click map’ activity I also record the latitude and longitude of the user’s markers when they check their answers, although a user can check their answers multiple times, and each time the answers will be logged, even if the user has pressed on the ‘view correct locations’ first.  Transcript sections and specific millisecond times are stored in our database for the main click activity now, and I’ve updated the interface for this so that the output is no longer displayed on screen.

With all of this in place I then began working on the maps, replacing the placeholder maps and their sample data with maps that use real data.  Now when a user selects an option a random location within their chosen area is generated and stored along with their answer.  As we still don’t have selectable area data at the point of registration, whenever you register with the site at the moment you are randomly assigned to one or our 411 areas, so by registering and answering some questions test data is then generated.  My first two test users were assigned areas south of Inverness and around Dunoon.

With location data now being saved for answers I then updated all of the maps on the site to remove the sample data and display the real data.  The quiz and ‘explore’ maps are not working properly yet but the general activity ones are.  I replaced the geographical areas visible on the map with those as used in the click map, as requested, but have removed the colours we used on the click map as they were making the markers hard to see.  Acceptability questions use the four rating colours as were used on the sample maps.  Other questions use the ‘lexical’ colours (up to 8 different ones) as specified.

The markers were very small and difficult to spot when there are so few of them so I placed a little check that alters their size depending on the number of returned markers.  If there are less than 100 then each marker is size 6.  If there are 100 or more then the size is 3.  Previously all markers were size 2.  I may update the marker size or put more granular size options in place in future.  The answer submitted by the current user appears on the map when they view the map, which I think is nice.  There is still a lot to do, though.  I still need to implement a legend for the map so you can actually tell which coloured marker refers to what, and also provide links to the audio clips where applicable.  I also still need to implement the quiz question and ‘explore’ maps as I mentioned.  I’ll look into these issues next week.

For the DSL I processed the latest data export from the DSL’s editing system and set up a new version of the API that uses it.  The test DSL website now uses this API and is pretty much ready to go live next week.  After that I spent some time tweaking the search facilities of the new site.  Rhona had noticed that searches involving two single character wildcards (question marks) were returning unexpected results and I spent some time investigating this.

The problem turned out to have been caused by two things.  Firstly, question marks are tricky things in URLs as they mean something very specific: they signify the end of the main part of the URL and the beginning of a list of variables passed in the URL.  So for example in a SCOTS corpus URL like https://scottishcorpus.ac.uk/search/?word=scunner&search=Search the question mark tells the browser and the server-side scripts to start looking for variables.  When you want a URL to feature a question mark and for it not to be treated like this you have to encode it, and the URL code for a question mark is ‘%3F’.  This encoding needs to be done in the JavaScript running in the browser before it redirects to the URL.  Unfortunately JavaScript’s string replace function is rather odd in that by default it only finds and replaces the first occurrence and ignores all others.  This is what was happening when you did a search that included two question marks – the first was being replaced with ‘%3F’ and the second stayed as a regular question mark.  When the browser then tried to load the URL it found a regular question mark and cut off everything after it.  This is why a search for ‘sc?’ was being performed and it’s also why all searches ended up as quick searches – the rest of the content in the URL after the second question mark was being ignored, which included details of what type of search to run.

A second thing was causing further problems:  A quick search by default performs an exact match search (surrounded by double quotes) if you ignore the dropdown suggestions and press the search button.  But an exact match was set up to be just that – single wildcard characters were not being treated as wildcard characters, meaning a search for “sc??m” was looking for exactly that and finding nothing.  I’ve fixed this now, allowing single character wildcards to appear within an exact search.

After fixing this we realised that the new site’s use of the asterisk wildcard didn’t match its use in the live site.  Rhona was expected a search such as ‘sc*m’ to work on the new site, returning all headwords beginning ‘sc’ and ending in ‘m’.  However, in the new site the asterisk wildcard only matches the beginning or end of words, e.g. ‘wor*’ finds all words beginning with ‘wor’ and ‘*ord’ finds all words ending with ‘ord’.  You can combine the two with a Boolean search, though: ‘sc* AND *m’ and this will work in exactly the same way as ‘sc*m’.

However, I decided to enable the mid-wildcard search on the new site in addition to using Boolean AND, because it’s better to be consistent with the old site, plus I also discovered that the full text search in the new site does allow for mid-asterisk searches.  I therefore spent a bit of time implementing the mid-asterisk search, both in the drop-down list of options in the quick search box as well as the main quick search and the advanced search headword search.

Rhona then spotted that a full-text mid-asterisk search was listing results alphabetically rather than by relevance.  I looked into this and it seems to be a limitation with that sort of wildcard search in the Solr search engine.  If you look here https://solr.apache.org/guide/8_7/the-standard-query-parser.html#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser the penultimate bullet point says “Range queries (“[a TO z]”), prefix queries (“a*”), and wildcard queries (“a*b”) are constant-scoring (all matching documents get an equal score).”

I’m guessing the original API that powers the live site uses Lucene rather than Solr’s indexing system, but I don’t really know for certain.  Also, while the live site’s ordering of mid-asterisk wildcard searches is definitely not alphabetical, it doesn’t really seem to be organising properly by relevance either.  I’m afraid we might just have to live with alphabetical ordering for mid-asterisk search results, and I’ll alter the ‘Results are ordered’ statement in such cases to make it clearer that the ordering is alphabetical.

My final DSL tasks for the week were to make some tweaks to the XSLT that processes the layout of bibliographical entries.  This involved fixing the size of author names, ensuring that multiple authors are handled correctly and adding in editors’ names for SND items.  I also spotted a few layout issues that are still cropping up.  The order of some elements is displayed incorrectly and some individual <bibl> items have multiple titles and the stylesheet isn’t expecting this so only displays the first ones.  I think I may need to completely rewrite the stylesheet to fix these issues.  As there were lots of rules for arranging the bibliography I wrote the stylesheet to pick out and display specific elements rather than straightforwardly going through the XML and transforming each XML tag into a corresponding HTML tag.  This meant I could ensure (for example) authors always appear first and titles each get indented, but it is rather rigid – any content that isn’t structured as the stylesheet expects may get displayed in the wrong place or not at all (like the unexpected second titles).  I’m afraid I’m not going to have time to rewrite the stylesheet before the launch of the new site next week and this update will need to be added to the list of things to do for a future release.

Also this week I fixed an issue with the Historical Thesaurus which involved shifting a category and its children one level up and helped sort out an issue with an email address for a project using a top-level ‘ac.uk’ domain.  Next week I’ll hopefully launch the new version of the DSL on Tuesday and press on with the outstanding Speak For Yersel exercises.