Week Beginning 18th April 2022

I divided most of my time between the Speak For Yersel project and the Dictionaries of the Scots Language this week.  For Speak For Yersel I continued to work on the user management side of things.  I implemented the registration form (apart from the ‘where you live’ bit, which still requires data) and all now works, uploading the user’s details to our database and saving them within the user’s browser using HTML5 Storage.  I added in checks to ensure that a year of birth and gender must be supplied too.

I then updated all activities and quizzes so that the user’s answers are uploaded to our database, tracking the user throughout the site so we can tell which user has submitted what.  For the ‘click map’ activity I also record the latitude and longitude of the user’s markers when they check their answers, although a user can check their answers multiple times, and each time the answers will be logged, even if the user has pressed on the ‘view correct locations’ first.  Transcript sections and specific millisecond times are stored in our database for the main click activity now, and I’ve updated the interface for this so that the output is no longer displayed on screen.

With all of this in place I then began working on the maps, replacing the placeholder maps and their sample data with maps that use real data.  Now when a user selects an option a random location within their chosen area is generated and stored along with their answer.  As we still don’t have selectable area data at the point of registration, whenever you register with the site at the moment you are randomly assigned to one or our 411 areas, so by registering and answering some questions test data is then generated.  My first two test users were assigned areas south of Inverness and around Dunoon.

With location data now being saved for answers I then updated all of the maps on the site to remove the sample data and display the real data.  The quiz and ‘explore’ maps are not working properly yet but the general activity ones are.  I replaced the geographical areas visible on the map with those as used in the click map, as requested, but have removed the colours we used on the click map as they were making the markers hard to see.  Acceptability questions use the four rating colours as were used on the sample maps.  Other questions use the ‘lexical’ colours (up to 8 different ones) as specified.

The markers were very small and difficult to spot when there are so few of them so I placed a little check that alters their size depending on the number of returned markers.  If there are less than 100 then each marker is size 6.  If there are 100 or more then the size is 3.  Previously all markers were size 2.  I may update the marker size or put more granular size options in place in future.  The answer submitted by the current user appears on the map when they view the map, which I think is nice.  There is still a lot to do, though.  I still need to implement a legend for the map so you can actually tell which coloured marker refers to what, and also provide links to the audio clips where applicable.  I also still need to implement the quiz question and ‘explore’ maps as I mentioned.  I’ll look into these issues next week.

For the DSL I processed the latest data export from the DSL’s editing system and set up a new version of the API that uses it.  The test DSL website now uses this API and is pretty much ready to go live next week.  After that I spent some time tweaking the search facilities of the new site.  Rhona had noticed that searches involving two single character wildcards (question marks) were returning unexpected results and I spent some time investigating this.

The problem turned out to have been caused by two things.  Firstly, question marks are tricky things in URLs as they mean something very specific: they signify the end of the main part of the URL and the beginning of a list of variables passed in the URL.  So for example in a SCOTS corpus URL like https://scottishcorpus.ac.uk/search/?word=scunner&search=Search the question mark tells the browser and the server-side scripts to start looking for variables.  When you want a URL to feature a question mark and for it not to be treated like this you have to encode it, and the URL code for a question mark is ‘%3F’.  This encoding needs to be done in the JavaScript running in the browser before it redirects to the URL.  Unfortunately JavaScript’s string replace function is rather odd in that by default it only finds and replaces the first occurrence and ignores all others.  This is what was happening when you did a search that included two question marks – the first was being replaced with ‘%3F’ and the second stayed as a regular question mark.  When the browser then tried to load the URL it found a regular question mark and cut off everything after it.  This is why a search for ‘sc?’ was being performed and it’s also why all searches ended up as quick searches – the rest of the content in the URL after the second question mark was being ignored, which included details of what type of search to run.

A second thing was causing further problems:  A quick search by default performs an exact match search (surrounded by double quotes) if you ignore the dropdown suggestions and press the search button.  But an exact match was set up to be just that – single wildcard characters were not being treated as wildcard characters, meaning a search for “sc??m” was looking for exactly that and finding nothing.  I’ve fixed this now, allowing single character wildcards to appear within an exact search.

After fixing this we realised that the new site’s use of the asterisk wildcard didn’t match its use in the live site.  Rhona was expected a search such as ‘sc*m’ to work on the new site, returning all headwords beginning ‘sc’ and ending in ‘m’.  However, in the new site the asterisk wildcard only matches the beginning or end of words, e.g. ‘wor*’ finds all words beginning with ‘wor’ and ‘*ord’ finds all words ending with ‘ord’.  You can combine the two with a Boolean search, though: ‘sc* AND *m’ and this will work in exactly the same way as ‘sc*m’.

However, I decided to enable the mid-wildcard search on the new site in addition to using Boolean AND, because it’s better to be consistent with the old site, plus I also discovered that the full text search in the new site does allow for mid-asterisk searches.  I therefore spent a bit of time implementing the mid-asterisk search, both in the drop-down list of options in the quick search box as well as the main quick search and the advanced search headword search.

Rhona then spotted that a full-text mid-asterisk search was listing results alphabetically rather than by relevance.  I looked into this and it seems to be a limitation with that sort of wildcard search in the Solr search engine.  If you look here https://solr.apache.org/guide/8_7/the-standard-query-parser.html#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser the penultimate bullet point says “Range queries (“[a TO z]”), prefix queries (“a*”), and wildcard queries (“a*b”) are constant-scoring (all matching documents get an equal score).”

I’m guessing the original API that powers the live site uses Lucene rather than Solr’s indexing system, but I don’t really know for certain.  Also, while the live site’s ordering of mid-asterisk wildcard searches is definitely not alphabetical, it doesn’t really seem to be organising properly by relevance either.  I’m afraid we might just have to live with alphabetical ordering for mid-asterisk search results, and I’ll alter the ‘Results are ordered’ statement in such cases to make it clearer that the ordering is alphabetical.

My final DSL tasks for the week were to make some tweaks to the XSLT that processes the layout of bibliographical entries.  This involved fixing the size of author names, ensuring that multiple authors are handled correctly and adding in editors’ names for SND items.  I also spotted a few layout issues that are still cropping up.  The order of some elements is displayed incorrectly and some individual <bibl> items have multiple titles and the stylesheet isn’t expecting this so only displays the first ones.  I think I may need to completely rewrite the stylesheet to fix these issues.  As there were lots of rules for arranging the bibliography I wrote the stylesheet to pick out and display specific elements rather than straightforwardly going through the XML and transforming each XML tag into a corresponding HTML tag.  This meant I could ensure (for example) authors always appear first and titles each get indented, but it is rather rigid – any content that isn’t structured as the stylesheet expects may get displayed in the wrong place or not at all (like the unexpected second titles).  I’m afraid I’m not going to have time to rewrite the stylesheet before the launch of the new site next week and this update will need to be added to the list of things to do for a future release.

Also this week I fixed an issue with the Historical Thesaurus which involved shifting a category and its children one level up and helped sort out an issue with an email address for a project using a top-level ‘ac.uk’ domain.  Next week I’ll hopefully launch the new version of the DSL on Tuesday and press on with the outstanding Speak For Yersel exercises.

Week Beginning 11th April 2022

I was back at work on Monday this week after a lovely week off last week.  It was only a four-day week, however, as the week ended with the Good Friday holiday.  I’ll also be off next Monday too.  I had rather a lot to squeeze into the four working days.  For the DSL I did some further troubleshooting for integrating Google Analytics with the DSL’s new https://macwordle.co.uk/ site.  I also had discussions about the upcoming switchover to the new DSL website, which we scheduled in for the week after next, although later in the week it turned out that all of the data has already been finalised so I’ll begin processing it next week.

I participated in a meeting for the Historical Thesaurus on Tuesday, after which I investigated the server stats for the site, which needed fixing.  I also enquired about setting up a domain URL for one of the ‘ac.uk’ sites we host, and it turned out to be something that IT Support could set up really quickly, which is good to know for future reference.  I also had a chat with Craig Lamont about a database / timeline / map interface for some data for the Allan Ramsay project that he would like me to put together to coincide with a book launch at the end of May.  Unfortunately they want this to be part of the University’s T4 website, which makes development somewhat tricky but not impossible.  I had to spend some time familiarising myself with T4 again and arranging for access to the part of the system where the Ramsay content resides.  Now I have this sorted I’ve agreed to look into developing this in early May.  I also deleted a couple of unnecessary entries from the Anglo-Norman Dictionary after the editor requested their removal and created a new version of the requirements document for the front-end for the Books and Borrowing project following feedback form the project team on the previous version.

The rest of my week was spent on the Speak For Yerself project, for which I still have an awful lot to do and not much time to do it in.  I had a meeting with the team on Monday to go over some recent developments, and following that I tracked down a few bugs in the existing code (e.g. a couple of ‘undefined’ buttons in the ‘explore’ maps).  I then replaced all of the audio files in the ‘click’ exercise as the team had decided to use a standardised sentence spoken by many different regional speakers rather than having different speakers saying different things.  As the speakers were not always from the same region as the previous audio clips I needed to change the ‘correct’ regions and also regenerated the MP3 files and transcript data.

I then moved onto a major update to the system: working on the back end.  This took up the rest of the week and although in terms of the interface nothing much should have changed, behind the scenes things are very different.  I designed and implemented the database that will hold all of the data for the project, including information on respondents, answers and geographical areas.  I also migrated all of the activity and question data to this database too.  This was a somewhat time consuming and tedious task as I needed to input every question and every answer option into the database, but it needed to be done.  If we didn’t have the questions and answer options in the database alongside the answers then it would be rather tricky to analyse the data when the time comes, and this way everything is stored in one place and is all interconnected.  Previously the questions were held as JSON data within the JavaScript code for the site, but this was not ideal for the above reason and also because it made updating and manually accessing the question data a bit tricky.

With the new, much tidier arrangement all of the data is stored in a database on the server and the JavaScript code requests the data for an activity when the user loads the activity’s page.  All answer choices and transcript sections also now have their own IDs, which is what we need for recording which specific answer a user has selected.  For example, for the question with the ID 10 if the user selects ‘bairn’ the answer ID 36 will be logged for that user.  I’ve set up the database structure to hold these answers and have populated the postcode area table with all of the GeoJSON data for each area.

The next step will be to populate the table holding specific locations within a postcode area once this data is available.  After that I’ll be able to create the user information form and then I’ll need to update the activities so the selected options are actually saved.  In the meantime I began to implement the user management system.  A user icon now appears in the top right of every page, either with a green background and a tick if you’ve registered or a red background and a cross if you haven’t.  I haven’t created the registration form yet, but have just included a button to register, and when you press this you’ll be registered and this will be remembered in your browser even if you close your browser or turn your device off.  Press on the green tick user icon to view the details recorded about the registered person (none yet) and find an option to sign out if this isn’t you or you want to clear your details.  If you’re not registered and you try to access the activities the page will redirect you to the registration form as we don’t want unregistered people completing the activities.  I’ll continue with this next week, hopefully getting to the point where the choices a user makes are actually logged in the database.  After that I’ll be able to generate maps with real data, which will be an important step.

 

 

Week Beginning 28th March 2022

I was on strike last week, and I’m going to be on holiday next week, so I had a lot to try and cram into this week.  This was made slightly harder when my son tested positive for Covid again on Tuesday evening.  It’s his fourth time, and the last bout was only seven weeks ago.  Thankfully he wasn’t especially ill, but he was off school from Wednesday onwards.

I worked on several different projects this week.  For the Books and Borrowing project I updated the front-end requirements document based on my discussions with the PI and Co-I and set it on for the rest of the team to give feedback on.  I also uploaded a new batch of register images from St Andrews (more than 2,500 page images taking up about 50Gb) and created all of the necessary register and page records.  I also did the same for a couple of smaller registers from Glasgow.  I also exported spreadsheets of authors, edition formats and edition languages for the team to edit too.

For the Anglo-Norman Dictionary I fixed an issue with the advanced search for citations, where entries with multiple citations were having the same date and reference displayed for each snippet rather than the individual dates and references.  I also updated the display of snippets in the search results so they appear in date order.

I also responded to an email from editor Heather Pagan about how language tags are used in the AND XML.  There are 491 entries that have a language tag and I wrote a little script to list the distinct languages and a count of a number of times each appears.  Here’s the output (bearing in mind that an entry may have multiple language tags):

[Latin] => 79

[M.E.] => 369

[Dutch] => 3

[Arabic] => 12

[Hebrew] => 20

[M.L.] => 4

[Greek] => 2

[A.F._and_M.E.] => 3

[Irish] => 2

[M.E._and_A.F.] => 8

[A-S.] => 3

[Gascon] => 1

There seem to be two ways the language tag appears.  One in a sense, and these appear in the entry, e.g. https://anglo-norman.net/entry/Scotland and one in <head> and these don’t currently seem to get displayed.  E.g. https://anglo-norman.net/entry/ganeir  has:

<head> <language lang=”M.E.”/>

<lemma>ganeir</lemma>

</head>

But ‘M.E’ doesn’t appear anywhere.  I could probably write another little script that moves language to the head as above, and then I could update the XSLT so that this type of language tag gets displayed.  Or I could update the XSLT first so we can see how it might look with entries that already have this structure.  I’ll need to hear back from Heather before I do more.

For the Dictionaries of the Scots Language I spent quite a bit of time working with the XSLT for the display of bibliographies.  There are quite a lot of different structures for bibliographical entries, sometimes where the structure of the XML is the same but a different layout is required, so it proved to be rather tricky to get things looking right.  By the end of the week I think I had got everything to display as requested, but I’ll need to see if the team discover any further quirks.

I also wrote a script that extracts citations and their dates from DSL entries.  I created a new citations table that stores the dates, the quotes and associated entry and bibliography IDs.  The table has 747,868 rows in it.  Eventually we’ll be able to use this table for some kind of date search, plus there’s now an easy to access record of all of the bib IDs for each entry / entry IDs for each bib, so displaying lists of entries associated with each bibliography should also be straightforward when the time comes.  I also added new firstdate and lastdate columns to the entry table, picking out the earliest and latest date associated with each entry and storing these.  This means we can add first dates to the browse, something I decided to add in for test purposes later in the week.

I added the first recorded date (the display version not the machine readable version) to the ‘browse’ for DOST and SND.  The dates are right-aligned and grey to make them stand out less than the main browse label.  This does however make the date of the currently selected entry in the browse a little hard to read.  Not all entries have dates available.  Any that don’t are entries where either the new date attributes haven’t been applied or haven’t worked.  This is really just a proof of concept and I will remove the dates from the browse before we go live, as we’re not going to do anything with the new date information until a later point.

I also processed the ‘History of Scots’ ancillary pages.  Someone had gone through these to add in links to entries (hundreds of links), but unfortunately they hadn’t got the structure quite right.  The links had been added in Word, meaning regular double quotes had been converted into curly quotes, which are not valid HTML.  Also the links only included the entry ID, rather than the path to the entry page.  A couple of quick ‘find and replace’ jobs fixed these issues, but I also needed to update the API to allow old DSL IDs to be passed without also specifying the source.  I also set up a Google Analytics account for the DSL’s version of Wordle (https://macwordle.co.uk/)

For the Speak For Yersel project I had a meeting with Mary on Thursday to discuss some new exercises that I’ll need to create.  I also spent some time creating the ‘Sounds about right’ activity.  This had a slightly different structure to other activities in that the questionnaire has three parts with an introduction for each part.  This required some major reworking of the code as things like the questionnaire numbers and the progress bar relied on the fact that there was one block of questions with no non-question screens in between.  The activity also featured a new question type with multiple sound clips.  I had to process these (converting them from WAV to MP3) and then figure out how to embed them in the questions.

Finally, for the Speech Star project I updated the extIPA chart to improve the layout of the playback speed options.  I also made the page remember the speed selection between opening videos – so if you want to view them all ‘slow’ then you don’t need to keep selecting ‘slow’ each time you open one.  I also updated the chart to provide an option to switch between MRI and animation videos and added in two animation MP4s that Eleanor had supplied me with.  I then added the speed selector to the Normalised Speech Database video popups and then created a new ‘Disordered Paediatric Speech Database’, featuring many videos, filters to limit the display of data and the video speed selector.  It was quite a rush to get this finished by the end of the week, but I managed it.

I will be on holiday next week so there will be no post from me then.