Week Beginning 14th March 2022

With the help of Raymond at Arts IT Support we migrated the test version of the DSL website to the new server this week, and also set up the Solr free-text indexes for the new DSL data too.  This test version of the site will become the live version when we’re ready to launch it in April and the migration all went pretty smoothly, although I did encounter an error with the htaccess script that processed URLs for dictionary pages due to underscores not needing to be escaped on the old server but requiring a backslash as an escape character on the new server.

I also replaced the test version’s WordPress database with a copy of the live site’s WordPress database, plus copied over some of the customisations from the live site such as changes to logos and the content of the header and the footer, bringing the test version’s ancillary content and design into alignment with the live site whilst retaining some of the additional tweaks I’d made to the test site (e.g. the option to hide the ‘browse’ column and the ‘about this entry’ box).

One change to the structure of the DSL data that has been implemented is that dates are now machine readable, with ‘from’, ‘to’ and ‘prefix’ attributes.  I had started to look at extracting these for use in the site (e.g. maybe displaying the earliest citation date alongside the headword in the ‘browse’ lists) when I spotted an issue with the data:  Rather than having a date in the ‘to’ attribute, some entries had an error code – for example there are 6,278 entries that feature a date with ‘PROBLEM6’ as a ‘to’ attribute.  I flagged this up with the DSL people and after some investigation they figured out that the date processing script wasn’t expecting to find a circa in a date ending a range (e.g. c1500-c1512).  When the script encountered such a case it was giving an error instead.  The DSL people were able to fix this issue and a new data export was prepared, although I won’t be using it just yet, as they will be sending me a further update before we go live and to save time I decided to just wait until they send this on.  I also completed work on the XSLT for displaying bibliography entries and created a new ‘versions and changes’ page, linking to it from a statement in the footer that notes the data version number.

For the ‘Speak For Yersel’ project I made a number of requested updates to the exercises that I’d previously created.  I added a border around the selected answer and ensured the active state of a selected button doesn’t stay active and I added handy ‘skip to quiz’ and ‘skip to explore’ links underneath the grammar and lexical quizzes so we don’t have to click through all those questions to check out the other parts of the exercise.  I italicised ‘you’ and ‘others’ on the activity index pages and I fixed a couple of bugs on the grammar questionnaire.  Previously only the map rolled up and an issue was caused when an answer was pressed on whilst the map was still animating.  Now the entire question area animates so it’s impossible to press on an answer when the map isn’t available.  I updated the quiz questions so they now have the same layout as the questionnaire, with options on the left and the map on the right and I made all maps taller to see how this works.

For the ‘Who says what where’ exercise the full sentence text is now included and I made the page scroll to the top of the map if this isn’t visible when you press on an item.  I also updated the map and rating colours, although there is still just one placeholder map that loads so the lexical quiz with its many possible options doesn’t have its own map that represents this.  The map still needs some work – e.g. adding in a legend and popups.  I also made all requested changes to the lexical question wording and made the ‘v4’ click activity the only version, making it accessible via the activities menu and updated the colours for the correct and incorrect click answers.

For the Books and Borrowing project I completed a first version of the requirements for the public website, which has taken a lot of time and a lot of thought to put together, resulting in a document that’s more than 5,000 words long.  On Friday I had a meeting with PI Katie and Co-I Matt to discuss the document.  We spent an hour going through it and a list of questions I’d compiled whilst writing it, and I’ll need to make some modifications to the document based on our discussions.  I also downloaded images of more library registers from St Andrews and one further register from Glasgow that I will need to process when I’m back at work too.

I also spent a bit of time writing a script to export a flat CSV version of the Historical Thesaurus, then made some updates based on feedback from the HT team before exporting a further version.  We also spotted that adjectives of ‘parts of insects’ appeared to be missing from the website and I investigated what was going on with it.  It turned out that there was an empty main category missing, and as all the other data was held in subcategories these didn’t appear, as all subcategories need a main category to hang off.  After adding in a maincat all of the data was restored.

Finally, I did a bit of work for the Speech Star project.  Firstly, I fixed a couple of layout issues with the ExtIPA chart symbols.  There was an issue with the diacritics for the symbol that looks like a theta, resulting in them being offset.  I reduced the size of the symbol slightly and have adjusted the margins of the symbols above and below and this seems to have done the trick.  In addition, I did a little bit of research into setting the playback speed and it looks like this will be pretty easy to do whilst still using the default video player.  See this page: https://stackoverflow.com/questions/3027707/how-to-change-the-playing-speed-of-videos-in-html5.  I added a speed switcher to the popup as a little test to see how it works.  The design would still need some work (buttons with the active option highlighted) but it’s good to have a proof of concept.  Pressing ‘normal’ or ‘slow’ sets the speed for the current video in the popup and works both when the video is playing and when it’s stopped.

Also, I was sure that jumping to points in the videos wasn’t working before, but it seems to work fine now – you can click and drag the progress bar and the video jumps to the required point, either when playing or paused.  I wonder if there was something in the codec that was previously being used that prevented this.  So fingers cross we’ll be able to just use the standard HTML5 video player to achieve everything the projects requires.

I’ll be participating in the UCU strike action for all of next week so it will be the week beginning the 28th of March before I’m back in work again.

Week Beginning 7th March 2022

This was my first five-day week after the recent UCU strike action and it was pretty full-on, involving many different projects.  I spent about a day working on the Speak For Yersel project.  I added in the content for all 32 ‘I would never say that’ questions and completed work on the new ‘Give your word’ lexical activity, which features a further 30 questions of several types.  This includes questions that have associated images and questions where multiple answers can be selected.  For the latter no more than three answers are allowed to be selected and this question type needs to be handed differently as we don’t want the map to load as soon as one answer is selected. Instead the user can select / deselect answers.  If at least one answer is selected a ‘Continue’ button appears under the question.  When you press on this the answers become read only and the map appears.  I made it so that no more than three options can be selected – you need to deselect one before you can add another.  I think we’ll need to look into the styling of the buttons, though, as currently ‘active’ (when a button is hovered over or has been pressed and nothing else has yet been pressed) is the same colour is ‘selected’.  So if you select ‘ginger’ then deselect it the button still looks selected until you press somewhere else, which is confusing.  Also if you press a fourth button it looks like it has been selected when in actual fact it’s just ‘active’ and isn’t really selected.

I also spent about a day continuing to work on the requirements document for the Books and Borrowing project.  I haven’t quite finished this initial version of the document but I’ve made good progress and I aim to have it completed next week.  Also for the project I participated in a Zoom call with RA Alex Deans and NLS Maps expert Chris Fleet about a subproject we’re going to develop for B&B for the Chambers Library in Edinburgh.  This will feature a map-based interface showing where the borrowers lived and will use a historical map layer for the centre of Edinburgh.

Chris also talked about a couple of projects at the NLS that were very useful to see.  The first one was the Jamaica journal of Alexander Innes (https://geo.nls.uk/maps/innes/) which features journal entries plotted on a historical map and a slider allowing you to quickly move through the journal entries.  The second was the Stevenson maps  of Scotland (https://maps.nls.uk/projects/stevenson/) that provides options to select different subjects and date periods.  He also mentioned a new crowdsourcing project to transcribe all of the names on the Roy Military Survey of Scotland (1747-55) maps which launched in February and already has 31,000 first transcriptions in place, which is great.  As with the GB1900 project, the data produced here will be hugely useful for things like place-name projects.

I also participated in a Zoom call with the Historical Thesaurus team where we discussed ongoing work.  This mainly involves a lot of manual linking of the remaining unlinked categories and looking at sensitive words and categories so there’s not much for me to do at this stage, but it was good to be kept up to date.

I continued to work on the new extIPA charts for the Speech Star project, which I had started on last week.  Last week I had some difficulties replicating the required phonetic symbols but this week Eleanor directed me to an existing site that features the extIPA chart (https://teaching.ncl.ac.uk/ipa/consonants-extra.html).  This site uses standard Unicode characters in combinations that work nicely, without requiring any additional fonts to be used.  I’ve therefore copied the relevant codes from there (this is just character codes like b̪ – I haven’t copied anything other than this from the site).   With the symbols in place I managed to complete an initial version of the chart, including pop-ups featuring all of the videos, but unfortunately the videos seem to have been encoded with an encoder that requires QuickTime for playback.  So although the videos are MP4 they’re not playing properly in browsers on my Windows PC – instead all I can hear is the audio.  It’s very odd as the videos play fine directly from Windows Explorer, but in Firefox, Chrome or MS Edge I just get audio and the static ‘poster’ image.  When I access the site on my iPad the videos play fine (as QuickTime is an Apple product).  Eleanor is still looking into re-encoding the videos and will hopefully get updated versions to me next week.

I also did a bit more work for the Anglo-Norman Dictionary this week.  I fixed a couple of minor issues with the DTD, for example the ‘protect’ attribute was an enumerated list that could either be ‘yes’ or ‘no’ but for some entries the attribute was present but empty, and this was against the rules.  I looked into whether an enumerated list could also include an empty option (as opposed to not being present, which is a different matter) but it looks like this is not possible (see for example http://lists.xml.org/archives/xml-dev/200309/msg00129.html).  What I did instead was to change the ‘protect’ attribute from an enumerated list with options ‘yes’ and ‘no’ to a regular data field, meaning the attribute can now include anything (including being empty).  The ‘protect’ attribute is a hangover from the old system and doesn’t do anything whatsoever in the new system so it shouldn’t really matter.  And it does mean that the XML files should now validate.

The AND people also noticed that some entries that are present in the old version of the site are missing from the new version.  I looked through the database and also older versions of the data from the new site and it looks like these entries have never been present in the new site.  The script I ran to originally export the entries from the old site used a list of headwords taken from another dataset (I can’t remember where from exactly) but I can only assume that this list was missing some headwords and this is why these entries are not in the new site.  This is a bit concerning, but thankfully the old site is still accessible.  I managed to write a little script that grabs the entire contents of the browse list from the old website, separating it into two lists, one for main entries and one for xrefs.  I then ran each headword against a local version of the current AND database, separating out homonym numbers then comparing the headword with the ‘lemma’ field in the DB and the hom with the hom.  Initially I ran main and xref queries separately, comparing main to main and xref to xref, but I realised that some entries had changed types (legitimately so, I guess) so stopped making a distinction.

The script outputted 1540 missing entries.  This initially looks pretty horrifying, but I’m fairly certain most of them are legitimate.  There are a whole bunch of weird ‘n’ forms in the old site that have a strange character (e.g. ‘nun⋮abilité’) that are not found in the new site, I guess intentionally so.  Also, there are lots of ‘S’ and ‘R’ words but I think most of these are because of joining or splitting homonyms.  Geert, the editor, looked through the output and thankfully it turns out that only a handful of entries are missing, and also that these were also missing from the old DMS version of the data so their omission occurred before I became involved in the project.

Finally this week I worked with a new dataset of the Dictionaries of the Scots Language.  I successfully imported the new data and have set up a new ‘dps-v2’ api.  There are 80,319 entries in the new data compared to 80,432 in the previous output from DPS.  I have updated our test site to use the new API and its new data, although I have not been able to set up the free-text data in Solr yet so the advanced search for full text / quotations only will not work yet.  Everything else should, though.

Also today I began to work on the layout of the bibliography page.  I have completed the display of DOST bibs but haven’t started on SND yet.  This includes the ‘style guide’ link when a note is present.  I think we may still need to tweak the layout, however.  I’ll continue to work with the new data next week.

Week Beginning 28 February 2022

I participated in the UCU strike action from Monday to Wednesday this week, making it a two-day week for me.  I’d heard earlier in the week that the paper I’d submitted about the redevelopment of the Anglo-Norman Dictionary had been accepted for DH2022 in Tokyo, which was great.  However, the organisers have decided to make the conference online only, which is disappointing, although probably for the best given the current geopolitical uncertainty.  I didn’t want to participate in an online only event that would be taking place in Tokyo time (nine hours ahead of the UK) so I’ve asked to withdraw my paper.

On Thursday I had a meeting with the Speak For Yersel project to discuss the content that the team have prepared and what I’ll need to work on next.  I also spend a bit of time looking into creating a geographical word cloud which would fit word cloud output into a geoJSON polygon shape.  I found one possible solution here: https://npm.io/package/maptowordcloud but I haven’t managed to make it work yet.

I also received a new set of videos for the Speech Star project, relating to the extIPA consonants, and I began looking into how to present these.  This was complicated by the extIPA symbols not being standard Unicode characters.  I did a bit of research into how these could be presented, and found this site http://www.wazu.jp/gallery/Test_IPA.html#ExtIPAChart but here the marks appear to the right of the main symbol rather than directly above or below.  I contacted Eleanor to see if she had any other ideas and she got back to me with some alternatives which I’ll need to look into next week.

I spent a bit of time working for the DSL this week too, looking into a question about Google Analytics from Pauline Graham (and finding this very handy suite of free courses on how to interpret Google Analytics here https://analytics.google.com/analytics/academy/).  The DSL people had also wanted me to look into creating a Levenshtein distance option, whereby words that are spelled similarly to an entered term are given as suggestions, in a similar way to this page: http://chrisgilmour.co.uk/scots/levensht.php?search=drech.  I created a test script that allows you to enter a term and view the SND headwords that have a Levenshtein distance of two or less from your term, with any headwords with a distance of one highlighted in bold.  However, Levenshtein is a bit of a blunt tool, and as it stands I’m not sure the results of the script are all that promising.  My test term ‘drech’ brings back 84 matches, including things like ‘french’ which is unfortunately only two letters different from ‘drech’.  I’m fairly certain my script is using the same algorithm as used by the site linked to above, it’s just that we have a lot more possible matches.  However, this is just a simple Levenshtein test – we could also add in further tests to limit (or expand) the output, such as a rule that changes vowels in certain places as in the ‘a’ becomes ‘ai’ example suggested by Rhona at our meeting last week.  Or we could limit the output to words beginning with the same letter.

Also this week I had a chat with the Historical Thesaurus people, arranging a meeting for next week and exporting a recent version of the database for them to use offline.  I also tweaked a couple of entries for the AND and spent an hour or so upgrading all of the WordPress sites I manage to the latest WordPress version.

Week Beginning 21st February 2022

I participated in the UCU strike action for all of last week and on Monday and Tuesday this week.  I divided the remaining three days between three projects:  the Anglo-Norman Dictionary, the Dictionaries of the Scots Language and Books and Borrowing.

For AND I continued to work on the publication of a major update of the letter S.  I had deleted all of the existing S entries and had imported all of the new data into our test instance the week before the strike, giving the editors time to check through it all and work on the new data via the content management system of the test instance.  They had noticed that some of the older entries hadn’t been deleted, and this was causing some new entries to not get displayed (as both old and new entries had the same ‘slug’ and therefore the older entry was still getting picked up when the entry’s page was loaded).  It turned out that I had forgotten that not all S entries actually have a headword beginning with ‘s’ – there are some that have brackets and square brackets.  There were 119 of these entries still left in the system and I updated my deletion scripts to remove these additional entries, ensuring that only the older versions and not the new ones were removed.  This fixed the issues with new entries not appearing.  With this task completed and the data approved by the editors we replaced the live data with the data from the test instance.

The update has involved 2,480 ‘main’ entries, containing 4,109 main senses, 1,295 subsenses, 2,627 locutions, 1,753 locution senses, 204 locution subsenses and 16,450 citations.  In addition, 4,623 ‘xref’ entries have been created or updated.  I also created a link checker which goes through every entry, pulls out all cross references from anywhere in the entry’s XML and checks to see whether each cross-referenced entry actually exists in the system.  The vast majority of links were all working fine but there were still a substantial number that were broken (around 800).  I’ve passed a list of these over to the editors who will need to manually fix the entries over time.

For the DSL I had a meeting on Thursday morning with Rhona, Ann and Pauline to discuss the major update to the DSL’s data that is going to go live soon.  This involves a new batch of data exported from their new editing system that will have a variety of significant structural changes, such as a redesigned ‘head’ section, and an overhauled method of recording dates.  We will also be migrating the live site to a new server, a new API and a new Solr instance so it’s a pretty major change.  We had been planning to have all of this completed by the end of March, but due to the strike we now think it’s best to push this back to the end of April, although we may launch earlier if I manage to get all of the updates sorted before then.  Following the meeting I made a few updates to our test instance of the system (e.g. reinstating some superscript numbers from SND that we’d previously hidden) and had a further email conversation with Ann about some ancillary pages.

For the Books and Borrowing project I downloaded a new batch of images for five more registers that had been digitised for us by the NLS.  I then processed these, uploaded them to our server and generated register and page records for each page image.  I also processed the data from the Royal High School of Edinburgh that had been sent to me in a spreadsheet.  There were records from five different registers and it took quite some time to write a script that would process all of the data, including splitting up borrower and book data, generating book items where required and linking everything together so that a borrower and a book only exist once in the system even if they are associated with many borrowing records.  Thankfully I’d done this all before for previous external datasets, but the process is always different for each dataset so there was still much in the way of reworking to be done.

I completed my scripts and ran them on a test instance of the database running on my local PC to start with.  When all was checked and looking good I ran the scripts on the live server to incorporate the new register data with the main project dataset.  After completing the task there were 19,994 borrowing records across 1,438 register pages, involving 1,932 books and 2,397 borrowers.  Some tweaking of the data may be required (e.g. I noticed there are two ‘Alexander Adam’ borrowers, which seems to have occurred because there was a space character before the forename sometimes) but on the whole it’s all looking good to me.

Next week I’ll be on strike again on Monday to Wednesday.