Week Beginning 18th April 2022

I divided most of my time between the Speak For Yersel project and the Dictionaries of the Scots Language this week.  For Speak For Yersel I continued to work on the user management side of things.  I implemented the registration form (apart from the ‘where you live’ bit, which still requires data) and all now works, uploading the user’s details to our database and saving them within the user’s browser using HTML5 Storage.  I added in checks to ensure that a year of birth and gender must be supplied too.

I then updated all activities and quizzes so that the user’s answers are uploaded to our database, tracking the user throughout the site so we can tell which user has submitted what.  For the ‘click map’ activity I also record the latitude and longitude of the user’s markers when they check their answers, although a user can check their answers multiple times, and each time the answers will be logged, even if the user has pressed on the ‘view correct locations’ first.  Transcript sections and specific millisecond times are stored in our database for the main click activity now, and I’ve updated the interface for this so that the output is no longer displayed on screen.

With all of this in place I then began working on the maps, replacing the placeholder maps and their sample data with maps that use real data.  Now when a user selects an option a random location within their chosen area is generated and stored along with their answer.  As we still don’t have selectable area data at the point of registration, whenever you register with the site at the moment you are randomly assigned to one or our 411 areas, so by registering and answering some questions test data is then generated.  My first two test users were assigned areas south of Inverness and around Dunoon.

With location data now being saved for answers I then updated all of the maps on the site to remove the sample data and display the real data.  The quiz and ‘explore’ maps are not working properly yet but the general activity ones are.  I replaced the geographical areas visible on the map with those as used in the click map, as requested, but have removed the colours we used on the click map as they were making the markers hard to see.  Acceptability questions use the four rating colours as were used on the sample maps.  Other questions use the ‘lexical’ colours (up to 8 different ones) as specified.

The markers were very small and difficult to spot when there are so few of them so I placed a little check that alters their size depending on the number of returned markers.  If there are less than 100 then each marker is size 6.  If there are 100 or more then the size is 3.  Previously all markers were size 2.  I may update the marker size or put more granular size options in place in future.  The answer submitted by the current user appears on the map when they view the map, which I think is nice.  There is still a lot to do, though.  I still need to implement a legend for the map so you can actually tell which coloured marker refers to what, and also provide links to the audio clips where applicable.  I also still need to implement the quiz question and ‘explore’ maps as I mentioned.  I’ll look into these issues next week.

For the DSL I processed the latest data export from the DSL’s editing system and set up a new version of the API that uses it.  The test DSL website now uses this API and is pretty much ready to go live next week.  After that I spent some time tweaking the search facilities of the new site.  Rhona had noticed that searches involving two single character wildcards (question marks) were returning unexpected results and I spent some time investigating this.

The problem turned out to have been caused by two things.  Firstly, question marks are tricky things in URLs as they mean something very specific: they signify the end of the main part of the URL and the beginning of a list of variables passed in the URL.  So for example in a SCOTS corpus URL like https://scottishcorpus.ac.uk/search/?word=scunner&search=Search the question mark tells the browser and the server-side scripts to start looking for variables.  When you want a URL to feature a question mark and for it not to be treated like this you have to encode it, and the URL code for a question mark is ‘%3F’.  This encoding needs to be done in the JavaScript running in the browser before it redirects to the URL.  Unfortunately JavaScript’s string replace function is rather odd in that by default it only finds and replaces the first occurrence and ignores all others.  This is what was happening when you did a search that included two question marks – the first was being replaced with ‘%3F’ and the second stayed as a regular question mark.  When the browser then tried to load the URL it found a regular question mark and cut off everything after it.  This is why a search for ‘sc?’ was being performed and it’s also why all searches ended up as quick searches – the rest of the content in the URL after the second question mark was being ignored, which included details of what type of search to run.

A second thing was causing further problems:  A quick search by default performs an exact match search (surrounded by double quotes) if you ignore the dropdown suggestions and press the search button.  But an exact match was set up to be just that – single wildcard characters were not being treated as wildcard characters, meaning a search for “sc??m” was looking for exactly that and finding nothing.  I’ve fixed this now, allowing single character wildcards to appear within an exact search.

After fixing this we realised that the new site’s use of the asterisk wildcard didn’t match its use in the live site.  Rhona was expected a search such as ‘sc*m’ to work on the new site, returning all headwords beginning ‘sc’ and ending in ‘m’.  However, in the new site the asterisk wildcard only matches the beginning or end of words, e.g. ‘wor*’ finds all words beginning with ‘wor’ and ‘*ord’ finds all words ending with ‘ord’.  You can combine the two with a Boolean search, though: ‘sc* AND *m’ and this will work in exactly the same way as ‘sc*m’.

However, I decided to enable the mid-wildcard search on the new site in addition to using Boolean AND, because it’s better to be consistent with the old site, plus I also discovered that the full text search in the new site does allow for mid-asterisk searches.  I therefore spent a bit of time implementing the mid-asterisk search, both in the drop-down list of options in the quick search box as well as the main quick search and the advanced search headword search.

Rhona then spotted that a full-text mid-asterisk search was listing results alphabetically rather than by relevance.  I looked into this and it seems to be a limitation with that sort of wildcard search in the Solr search engine.  If you look here https://solr.apache.org/guide/8_7/the-standard-query-parser.html#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser the penultimate bullet point says “Range queries (“[a TO z]”), prefix queries (“a*”), and wildcard queries (“a*b”) are constant-scoring (all matching documents get an equal score).”

I’m guessing the original API that powers the live site uses Lucene rather than Solr’s indexing system, but I don’t really know for certain.  Also, while the live site’s ordering of mid-asterisk wildcard searches is definitely not alphabetical, it doesn’t really seem to be organising properly by relevance either.  I’m afraid we might just have to live with alphabetical ordering for mid-asterisk search results, and I’ll alter the ‘Results are ordered’ statement in such cases to make it clearer that the ordering is alphabetical.

My final DSL tasks for the week were to make some tweaks to the XSLT that processes the layout of bibliographical entries.  This involved fixing the size of author names, ensuring that multiple authors are handled correctly and adding in editors’ names for SND items.  I also spotted a few layout issues that are still cropping up.  The order of some elements is displayed incorrectly and some individual <bibl> items have multiple titles and the stylesheet isn’t expecting this so only displays the first ones.  I think I may need to completely rewrite the stylesheet to fix these issues.  As there were lots of rules for arranging the bibliography I wrote the stylesheet to pick out and display specific elements rather than straightforwardly going through the XML and transforming each XML tag into a corresponding HTML tag.  This meant I could ensure (for example) authors always appear first and titles each get indented, but it is rather rigid – any content that isn’t structured as the stylesheet expects may get displayed in the wrong place or not at all (like the unexpected second titles).  I’m afraid I’m not going to have time to rewrite the stylesheet before the launch of the new site next week and this update will need to be added to the list of things to do for a future release.

Also this week I fixed an issue with the Historical Thesaurus which involved shifting a category and its children one level up and helped sort out an issue with an email address for a project using a top-level ‘ac.uk’ domain.  Next week I’ll hopefully launch the new version of the DSL on Tuesday and press on with the outstanding Speak For Yersel exercises.

Week Beginning 11th April 2022

I was back at work on Monday this week after a lovely week off last week.  It was only a four-day week, however, as the week ended with the Good Friday holiday.  I’ll also be off next Monday too.  I had rather a lot to squeeze into the four working days.  For the DSL I did some further troubleshooting for integrating Google Analytics with the DSL’s new https://macwordle.co.uk/ site.  I also had discussions about the upcoming switchover to the new DSL website, which we scheduled in for the week after next, although later in the week it turned out that all of the data has already been finalised so I’ll begin processing it next week.

I participated in a meeting for the Historical Thesaurus on Tuesday, after which I investigated the server stats for the site, which needed fixing.  I also enquired about setting up a domain URL for one of the ‘ac.uk’ sites we host, and it turned out to be something that IT Support could set up really quickly, which is good to know for future reference.  I also had a chat with Craig Lamont about a database / timeline / map interface for some data for the Allan Ramsay project that he would like me to put together to coincide with a book launch at the end of May.  Unfortunately they want this to be part of the University’s T4 website, which makes development somewhat tricky but not impossible.  I had to spend some time familiarising myself with T4 again and arranging for access to the part of the system where the Ramsay content resides.  Now I have this sorted I’ve agreed to look into developing this in early May.  I also deleted a couple of unnecessary entries from the Anglo-Norman Dictionary after the editor requested their removal and created a new version of the requirements document for the front-end for the Books and Borrowing project following feedback form the project team on the previous version.

The rest of my week was spent on the Speak For Yerself project, for which I still have an awful lot to do and not much time to do it in.  I had a meeting with the team on Monday to go over some recent developments, and following that I tracked down a few bugs in the existing code (e.g. a couple of ‘undefined’ buttons in the ‘explore’ maps).  I then replaced all of the audio files in the ‘click’ exercise as the team had decided to use a standardised sentence spoken by many different regional speakers rather than having different speakers saying different things.  As the speakers were not always from the same region as the previous audio clips I needed to change the ‘correct’ regions and also regenerated the MP3 files and transcript data.

I then moved onto a major update to the system: working on the back end.  This took up the rest of the week and although in terms of the interface nothing much should have changed, behind the scenes things are very different.  I designed and implemented the database that will hold all of the data for the project, including information on respondents, answers and geographical areas.  I also migrated all of the activity and question data to this database too.  This was a somewhat time consuming and tedious task as I needed to input every question and every answer option into the database, but it needed to be done.  If we didn’t have the questions and answer options in the database alongside the answers then it would be rather tricky to analyse the data when the time comes, and this way everything is stored in one place and is all interconnected.  Previously the questions were held as JSON data within the JavaScript code for the site, but this was not ideal for the above reason and also because it made updating and manually accessing the question data a bit tricky.

With the new, much tidier arrangement all of the data is stored in a database on the server and the JavaScript code requests the data for an activity when the user loads the activity’s page.  All answer choices and transcript sections also now have their own IDs, which is what we need for recording which specific answer a user has selected.  For example, for the question with the ID 10 if the user selects ‘bairn’ the answer ID 36 will be logged for that user.  I’ve set up the database structure to hold these answers and have populated the postcode area table with all of the GeoJSON data for each area.

The next step will be to populate the table holding specific locations within a postcode area once this data is available.  After that I’ll be able to create the user information form and then I’ll need to update the activities so the selected options are actually saved.  In the meantime I began to implement the user management system.  A user icon now appears in the top right of every page, either with a green background and a tick if you’ve registered or a red background and a cross if you haven’t.  I haven’t created the registration form yet, but have just included a button to register, and when you press this you’ll be registered and this will be remembered in your browser even if you close your browser or turn your device off.  Press on the green tick user icon to view the details recorded about the registered person (none yet) and find an option to sign out if this isn’t you or you want to clear your details.  If you’re not registered and you try to access the activities the page will redirect you to the registration form as we don’t want unregistered people completing the activities.  I’ll continue with this next week, hopefully getting to the point where the choices a user makes are actually logged in the database.  After that I’ll be able to generate maps with real data, which will be an important step.

 

 

Week Beginning 14th March 2022

With the help of Raymond at Arts IT Support we migrated the test version of the DSL website to the new server this week, and also set up the Solr free-text indexes for the new DSL data too.  This test version of the site will become the live version when we’re ready to launch it in April and the migration all went pretty smoothly, although I did encounter an error with the htaccess script that processed URLs for dictionary pages due to underscores not needing to be escaped on the old server but requiring a backslash as an escape character on the new server.

I also replaced the test version’s WordPress database with a copy of the live site’s WordPress database, plus copied over some of the customisations from the live site such as changes to logos and the content of the header and the footer, bringing the test version’s ancillary content and design into alignment with the live site whilst retaining some of the additional tweaks I’d made to the test site (e.g. the option to hide the ‘browse’ column and the ‘about this entry’ box).

One change to the structure of the DSL data that has been implemented is that dates are now machine readable, with ‘from’, ‘to’ and ‘prefix’ attributes.  I had started to look at extracting these for use in the site (e.g. maybe displaying the earliest citation date alongside the headword in the ‘browse’ lists) when I spotted an issue with the data:  Rather than having a date in the ‘to’ attribute, some entries had an error code – for example there are 6,278 entries that feature a date with ‘PROBLEM6’ as a ‘to’ attribute.  I flagged this up with the DSL people and after some investigation they figured out that the date processing script wasn’t expecting to find a circa in a date ending a range (e.g. c1500-c1512).  When the script encountered such a case it was giving an error instead.  The DSL people were able to fix this issue and a new data export was prepared, although I won’t be using it just yet, as they will be sending me a further update before we go live and to save time I decided to just wait until they send this on.  I also completed work on the XSLT for displaying bibliography entries and created a new ‘versions and changes’ page, linking to it from a statement in the footer that notes the data version number.

For the ‘Speak For Yersel’ project I made a number of requested updates to the exercises that I’d previously created.  I added a border around the selected answer and ensured the active state of a selected button doesn’t stay active and I added handy ‘skip to quiz’ and ‘skip to explore’ links underneath the grammar and lexical quizzes so we don’t have to click through all those questions to check out the other parts of the exercise.  I italicised ‘you’ and ‘others’ on the activity index pages and I fixed a couple of bugs on the grammar questionnaire.  Previously only the map rolled up and an issue was caused when an answer was pressed on whilst the map was still animating.  Now the entire question area animates so it’s impossible to press on an answer when the map isn’t available.  I updated the quiz questions so they now have the same layout as the questionnaire, with options on the left and the map on the right and I made all maps taller to see how this works.

For the ‘Who says what where’ exercise the full sentence text is now included and I made the page scroll to the top of the map if this isn’t visible when you press on an item.  I also updated the map and rating colours, although there is still just one placeholder map that loads so the lexical quiz with its many possible options doesn’t have its own map that represents this.  The map still needs some work – e.g. adding in a legend and popups.  I also made all requested changes to the lexical question wording and made the ‘v4’ click activity the only version, making it accessible via the activities menu and updated the colours for the correct and incorrect click answers.

For the Books and Borrowing project I completed a first version of the requirements for the public website, which has taken a lot of time and a lot of thought to put together, resulting in a document that’s more than 5,000 words long.  On Friday I had a meeting with PI Katie and Co-I Matt to discuss the document.  We spent an hour going through it and a list of questions I’d compiled whilst writing it, and I’ll need to make some modifications to the document based on our discussions.  I also downloaded images of more library registers from St Andrews and one further register from Glasgow that I will need to process when I’m back at work too.

I also spent a bit of time writing a script to export a flat CSV version of the Historical Thesaurus, then made some updates based on feedback from the HT team before exporting a further version.  We also spotted that adjectives of ‘parts of insects’ appeared to be missing from the website and I investigated what was going on with it.  It turned out that there was an empty main category missing, and as all the other data was held in subcategories these didn’t appear, as all subcategories need a main category to hang off.  After adding in a maincat all of the data was restored.

Finally, I did a bit of work for the Speech Star project.  Firstly, I fixed a couple of layout issues with the ExtIPA chart symbols.  There was an issue with the diacritics for the symbol that looks like a theta, resulting in them being offset.  I reduced the size of the symbol slightly and have adjusted the margins of the symbols above and below and this seems to have done the trick.  In addition, I did a little bit of research into setting the playback speed and it looks like this will be pretty easy to do whilst still using the default video player.  See this page: https://stackoverflow.com/questions/3027707/how-to-change-the-playing-speed-of-videos-in-html5.  I added a speed switcher to the popup as a little test to see how it works.  The design would still need some work (buttons with the active option highlighted) but it’s good to have a proof of concept.  Pressing ‘normal’ or ‘slow’ sets the speed for the current video in the popup and works both when the video is playing and when it’s stopped.

Also, I was sure that jumping to points in the videos wasn’t working before, but it seems to work fine now – you can click and drag the progress bar and the video jumps to the required point, either when playing or paused.  I wonder if there was something in the codec that was previously being used that prevented this.  So fingers cross we’ll be able to just use the standard HTML5 video player to achieve everything the projects requires.

I’ll be participating in the UCU strike action for all of next week so it will be the week beginning the 28th of March before I’m back in work again.

Week Beginning 7th March 2022

This was my first five-day week after the recent UCU strike action and it was pretty full-on, involving many different projects.  I spent about a day working on the Speak For Yersel project.  I added in the content for all 32 ‘I would never say that’ questions and completed work on the new ‘Give your word’ lexical activity, which features a further 30 questions of several types.  This includes questions that have associated images and questions where multiple answers can be selected.  For the latter no more than three answers are allowed to be selected and this question type needs to be handed differently as we don’t want the map to load as soon as one answer is selected. Instead the user can select / deselect answers.  If at least one answer is selected a ‘Continue’ button appears under the question.  When you press on this the answers become read only and the map appears.  I made it so that no more than three options can be selected – you need to deselect one before you can add another.  I think we’ll need to look into the styling of the buttons, though, as currently ‘active’ (when a button is hovered over or has been pressed and nothing else has yet been pressed) is the same colour is ‘selected’.  So if you select ‘ginger’ then deselect it the button still looks selected until you press somewhere else, which is confusing.  Also if you press a fourth button it looks like it has been selected when in actual fact it’s just ‘active’ and isn’t really selected.

I also spent about a day continuing to work on the requirements document for the Books and Borrowing project.  I haven’t quite finished this initial version of the document but I’ve made good progress and I aim to have it completed next week.  Also for the project I participated in a Zoom call with RA Alex Deans and NLS Maps expert Chris Fleet about a subproject we’re going to develop for B&B for the Chambers Library in Edinburgh.  This will feature a map-based interface showing where the borrowers lived and will use a historical map layer for the centre of Edinburgh.

Chris also talked about a couple of projects at the NLS that were very useful to see.  The first one was the Jamaica journal of Alexander Innes (https://geo.nls.uk/maps/innes/) which features journal entries plotted on a historical map and a slider allowing you to quickly move through the journal entries.  The second was the Stevenson maps  of Scotland (https://maps.nls.uk/projects/stevenson/) that provides options to select different subjects and date periods.  He also mentioned a new crowdsourcing project to transcribe all of the names on the Roy Military Survey of Scotland (1747-55) maps which launched in February and already has 31,000 first transcriptions in place, which is great.  As with the GB1900 project, the data produced here will be hugely useful for things like place-name projects.

I also participated in a Zoom call with the Historical Thesaurus team where we discussed ongoing work.  This mainly involves a lot of manual linking of the remaining unlinked categories and looking at sensitive words and categories so there’s not much for me to do at this stage, but it was good to be kept up to date.

I continued to work on the new extIPA charts for the Speech Star project, which I had started on last week.  Last week I had some difficulties replicating the required phonetic symbols but this week Eleanor directed me to an existing site that features the extIPA chart (https://teaching.ncl.ac.uk/ipa/consonants-extra.html).  This site uses standard Unicode characters in combinations that work nicely, without requiring any additional fonts to be used.  I’ve therefore copied the relevant codes from there (this is just character codes like &#x62;&#x32A; – I haven’t copied anything other than this from the site).   With the symbols in place I managed to complete an initial version of the chart, including pop-ups featuring all of the videos, but unfortunately the videos seem to have been encoded with an encoder that requires QuickTime for playback.  So although the videos are MP4 they’re not playing properly in browsers on my Windows PC – instead all I can hear is the audio.  It’s very odd as the videos play fine directly from Windows Explorer, but in Firefox, Chrome or MS Edge I just get audio and the static ‘poster’ image.  When I access the site on my iPad the videos play fine (as QuickTime is an Apple product).  Eleanor is still looking into re-encoding the videos and will hopefully get updated versions to me next week.

I also did a bit more work for the Anglo-Norman Dictionary this week.  I fixed a couple of minor issues with the DTD, for example the ‘protect’ attribute was an enumerated list that could either be ‘yes’ or ‘no’ but for some entries the attribute was present but empty, and this was against the rules.  I looked into whether an enumerated list could also include an empty option (as opposed to not being present, which is a different matter) but it looks like this is not possible (see for example http://lists.xml.org/archives/xml-dev/200309/msg00129.html).  What I did instead was to change the ‘protect’ attribute from an enumerated list with options ‘yes’ and ‘no’ to a regular data field, meaning the attribute can now include anything (including being empty).  The ‘protect’ attribute is a hangover from the old system and doesn’t do anything whatsoever in the new system so it shouldn’t really matter.  And it does mean that the XML files should now validate.

The AND people also noticed that some entries that are present in the old version of the site are missing from the new version.  I looked through the database and also older versions of the data from the new site and it looks like these entries have never been present in the new site.  The script I ran to originally export the entries from the old site used a list of headwords taken from another dataset (I can’t remember where from exactly) but I can only assume that this list was missing some headwords and this is why these entries are not in the new site.  This is a bit concerning, but thankfully the old site is still accessible.  I managed to write a little script that grabs the entire contents of the browse list from the old website, separating it into two lists, one for main entries and one for xrefs.  I then ran each headword against a local version of the current AND database, separating out homonym numbers then comparing the headword with the ‘lemma’ field in the DB and the hom with the hom.  Initially I ran main and xref queries separately, comparing main to main and xref to xref, but I realised that some entries had changed types (legitimately so, I guess) so stopped making a distinction.

The script outputted 1540 missing entries.  This initially looks pretty horrifying, but I’m fairly certain most of them are legitimate.  There are a whole bunch of weird ‘n’ forms in the old site that have a strange character (e.g. ‘nun⋮abilité’) that are not found in the new site, I guess intentionally so.  Also, there are lots of ‘S’ and ‘R’ words but I think most of these are because of joining or splitting homonyms.  Geert, the editor, looked through the output and thankfully it turns out that only a handful of entries are missing, and also that these were also missing from the old DMS version of the data so their omission occurred before I became involved in the project.

Finally this week I worked with a new dataset of the Dictionaries of the Scots Language.  I successfully imported the new data and have set up a new ‘dps-v2’ api.  There are 80,319 entries in the new data compared to 80,432 in the previous output from DPS.  I have updated our test site to use the new API and its new data, although I have not been able to set up the free-text data in Solr yet so the advanced search for full text / quotations only will not work yet.  Everything else should, though.

Also today I began to work on the layout of the bibliography page.  I have completed the display of DOST bibs but haven’t started on SND yet.  This includes the ‘style guide’ link when a note is present.  I think we may still need to tweak the layout, however.  I’ll continue to work with the new data next week.

Week Beginning 28 February 2022

I participated in the UCU strike action from Monday to Wednesday this week, making it a two-day week for me.  I’d heard earlier in the week that the paper I’d submitted about the redevelopment of the Anglo-Norman Dictionary had been accepted for DH2022 in Tokyo, which was great.  However, the organisers have decided to make the conference online only, which is disappointing, although probably for the best given the current geopolitical uncertainty.  I didn’t want to participate in an online only event that would be taking place in Tokyo time (nine hours ahead of the UK) so I’ve asked to withdraw my paper.

On Thursday I had a meeting with the Speak For Yersel project to discuss the content that the team have prepared and what I’ll need to work on next.  I also spend a bit of time looking into creating a geographical word cloud which would fit word cloud output into a geoJSON polygon shape.  I found one possible solution here: https://npm.io/package/maptowordcloud but I haven’t managed to make it work yet.

I also received a new set of videos for the Speech Star project, relating to the extIPA consonants, and I began looking into how to present these.  This was complicated by the extIPA symbols not being standard Unicode characters.  I did a bit of research into how these could be presented, and found this site http://www.wazu.jp/gallery/Test_IPA.html#ExtIPAChart but here the marks appear to the right of the main symbol rather than directly above or below.  I contacted Eleanor to see if she had any other ideas and she got back to me with some alternatives which I’ll need to look into next week.

I spent a bit of time working for the DSL this week too, looking into a question about Google Analytics from Pauline Graham (and finding this very handy suite of free courses on how to interpret Google Analytics here https://analytics.google.com/analytics/academy/).  The DSL people had also wanted me to look into creating a Levenshtein distance option, whereby words that are spelled similarly to an entered term are given as suggestions, in a similar way to this page: http://chrisgilmour.co.uk/scots/levensht.php?search=drech.  I created a test script that allows you to enter a term and view the SND headwords that have a Levenshtein distance of two or less from your term, with any headwords with a distance of one highlighted in bold.  However, Levenshtein is a bit of a blunt tool, and as it stands I’m not sure the results of the script are all that promising.  My test term ‘drech’ brings back 84 matches, including things like ‘french’ which is unfortunately only two letters different from ‘drech’.  I’m fairly certain my script is using the same algorithm as used by the site linked to above, it’s just that we have a lot more possible matches.  However, this is just a simple Levenshtein test – we could also add in further tests to limit (or expand) the output, such as a rule that changes vowels in certain places as in the ‘a’ becomes ‘ai’ example suggested by Rhona at our meeting last week.  Or we could limit the output to words beginning with the same letter.

Also this week I had a chat with the Historical Thesaurus people, arranging a meeting for next week and exporting a recent version of the database for them to use offline.  I also tweaked a couple of entries for the AND and spent an hour or so upgrading all of the WordPress sites I manage to the latest WordPress version.

Week Beginning 3rd January 2022

This was my first week back after the Christmas holidays, and it was a three-day week.  I spent the days almost exclusively on the Books and Borrowing project.  We had received a further batch of images for 23 library registers from the NLS, which I needed to download from the NLS’s server and process.  This involved renaming many thousands of images via a little script I’d written in order to give the images more meaningful filenames and stripping out several thousand images of blank pages that had been included but are not needed by the project.  I then needed to upload the images to the project’s web server and then generate all of the necessary register and page records in the CMS for each page image.

I also needed up update the way folio numbers were generated for the registers.  For the previous batch of images from the NLS I had just assigned the numerical part of the image’s filename as the folio number, but it turns out that most of the images have a hand-written page number in the top-right which starts at 1 for the first actual page of borrowing records.  There are usually a few pages before this, and these need to be given Roman numerals as folio numbers.  I therefore had to write another script that would take into consideration the number of front-matter pages in each register, assign Roman numerals as folio numbers to them and then begin the numbering of borrowing record pages from 1 after that, incrementing through the rest of the volume.

I guess it was inevitable with data of this sort, but I ran into some difficulties whilst processing it.  Firstly, there were some problems with the Jpeg images the NLS had sent for two of the volumes.  These didn’t match the Tiff images for the volumes, with each volume having an incorrect number of files.  Thankfully the NLS were able to quickly figure out what had gone wrong and were able to supply updated images.

The next issue to crop up occurred when I began to upload the images to the server.  After uploading about 5Gb of images the upload terminated, and soon after that I received emails from the project team saying they were unable to log into the CMS.  It turns out that the server had run out of storage.  Each time someone logs into the CMS the server needs a tiny amount of space to store a session variable, but there wasn’t enough space to store this, meaning it was impossible to log in successfully.  I emailed the IT people at Stirling (Where the project server is located) to enquire about getting some further space allocated but I haven’t heard anything back yet.  In the meantime I deleted the images from the partially uploaded volume which freed up enough space to enable the CMS to function again.  I also figured out a way to free up some further space:  The first batch of images from the NLS also included images of blank pages across 13 volumes – several thousand images.  It was only after uploading these and generating page records that we had decided to remove the blank pages, but I only removed the CMS records for these pages – the image files were still stored on the server.  I therefore wrote another script to identify and delete all of the blank page images from the first batch that was uploaded, which freed up 4-5Gb of space from the server, which was enough to complete the upload of the second batch of registers from the NLS.  We will still need more space, though, as there are still many thousands of images left to add.

I also took the opportunity to update the folio numbers of the first batch of NLS registers to bring them into line with the updated method we’d decided on for the second batch (Roman numerals for front-matter and then incrementing page numbers from the first page of borrowing records).  I wrote a script to renumber all of the required volumes, which was mostly a success.

However, I also noticed that the automatically generated folio numbers often became out of step with the hand-written folio numbers found in the top-right corner of the images.  I decided to go through each of the volumes to identify all that became unaligned and to pinpoint on exactly which page or pages the misalignment occurred.  This took some time as there were 32 volumes that needed checked, and each time an issue was spotted I needed to look back through the pages and associated images from the last page until I found the point where the page numbers correctly aligned.  I discovered that there were numbering issues with 14 of the 32 volumes, mainly due to whoever wrote the numbers in getting muddled.  There are occasions where a number is missed, or a number is repeated.  In once volume the page numbers advance by 100 from one page to the next.  It should be possible for me to write a script that will update the folio numbers to bring them into alignment with the erroneous handwritten numbers (for example where a number is repeated these will be given ‘a’ and ‘b’ suffixes).  I didn’t have time to write the script this week but will do so next week.

Also for the project this week I looked through the spreadsheet of borrowing records from the Royal High School of Edinburgh that one of the RAs has been preparing.  I had a couple of questions about the spreadsheet, and I’m hoping to be able to process it next week.  I also exported the records from one register for Gerry McKeever to work on, as these records now need to be split across two volumes rather than one.

Also this week I had an email conversation with Marc Alexander about a few issues, during which he noted that the Historical Thesaurus website was offline.  Further investigation revealed that the entire server was offline, meaning several other websites were down too.  I asked Arts IT Support to look into this, which took a little time as it was a physical issue with the hardware and they were all still working remotely.  However, the following day they were able to investigate and address the issue, which they reckon was caused by a faulty network port.

Week Beginning 20th December 2021

This was the last week before Christmas and it’s a four-day week as the University has generously given us all an extra day’s holiday on Christmas Eve.  I also lost a bit of time due to getting my Covid booster vaccine on Wednesday.  I was booked in for 9:50 and got there at 9:30 to find a massive queue snaking round the carpark.  It took an hour to queue outside, plus about 15 minutes inside, but I finally got my booster just before 11.  The after-effects kicked in during Wednesday night and I wasn’t feeling great on Thursday, but I managed to work.

My major task of the week was to deal with the new Innerpeffray data for the Books and Borrowing project.  I’d previously uploaded data from an existing spreadsheet in the early days of the project, but it turns out that there were quite a lot of issues with the data and therefore one of the RAs has been creating a new spreadsheet containing reworked data.  The RA Kit got back to me this week after I’d checked some issues with her last week and I therefore began the process of deleting the existing data and importing the new data.

I was a pretty torturous process but I managed to finish deleting the existing Innerpeffray data and imported the new data.  This required a pretty complex amount of processing and checking via a script I wrote this week.  I managed to retain superscript characters in the transcriptions, something that proved to be very tricky as there is no way to find and replace superscript characters in Excel.  Eventually I ended up copying the transcription column into Word, then saving the table as HTML, stripping out all of the rubbish Word adds in when it generates an HTML file and then using this resulting file alongside the main spreadsheet file that I saved as a CSV.  After several attempts at running the script on my local PC, then fixing issues, then rerunning, I eventually reckoned the script was working as it should – adding page, borrowing, borrower, borrower occupation, book holding and book item records as required.  I then ran the script on the server and the data is now available via the CMS.

There were a few normalised occupations that weren’t right and I updated these.  There were also 287 standardised titles that didn’t match any existing book holding records in Innerpeffray.  For these I created a new holding record and (if there’s an ESTC number) linked to a corresponding edition.

Also this week I completed work on the ‘Guess the Category’ quizzes for the Historical Thesaurus.  Fraser had got back to me about the spreadsheets of categories and lexemes that might cause offence and should therefore never appear in the quiz.  I added a new ‘inquiz’ column to both the category and lexeme table which has been set to ‘N’ for each matching category and lexeme.  I also updated the code behind the quiz so that only categories and lexemes with ‘inquiz’ set to ‘Y’ are picked up.

The category exclusions are pretty major – a total of 17,111 are now excluded.  This is due to including child categories where noted, and 8340 of these are within ’03.08 Faith’.  For lexemes there are a total of 2174 that are specifically noted as excluded based on both tabs of the spreadsheet (but note that all lexemes in excluded categories are excluded by default – a total of 69099).  The quiz picks a category first and then a lexeme within it, so there should never be a case where a lexeme in an excluded category is displayed.  I also ensured that when a non-noun category is returned if there isn’t a full trail of categories (because there isn’t a parent in the same part of speech) then the trail is populated from the noun categories instead.

The two quizzes (a main one and an Old English one) are now live and can be viewed here:

https://ht.ac.uk/guess-the-category/

https://ht.ac.uk/guess-the-oe-category/

Also this week I made a couple of tweaks to the Comparative Kingship place-names systems, adding in Pictish as a language and tweaking how ‘codes’ appear in the map.  I also helped Raymond migrate the Anglo-Norman Dictionary to the new server that was purchased earlier this year.  We had to make a few tweaks to get the site to work at a temporary URL but it’s looking good now.  We’ll update the DNS and make the URL point to the new server in the New Year.

That’s all for this year.  If there is anyone reading this (doubtful, I know) I wish you a merry Christmas and all the best for 2022!

Week Beginning 13th December 2021

My big task of the week was to return to working for the Speak For Yersel project after a couple of weeks when my services haven’t been required.  I had a meeting with PI Jennifer Smith and RA Mary Robinson on Monday where we discussed the current status of the project and the tasks I should focus on next.  Mary had finished work on the geographical areas we are going to use.  These are based on postcode areas but a number of areas have been amalgamated.  We’ll use these to register where a participant is from and also to generate a map marker representing their responses at a random location within their selected area based on the research I did a few weeks ago about randomly positioning a marker in a polygon.

The original files that Mary sent me were plus two exports from ArcGIS as JSON and GeoJSON.  Unfortunately both files used a different coordinates system rather than latitude and longitude, the GeoJSON file didn’t include any identifiers for the areas so couldn’t really be used and while the JSON file looked promising when I tried to use it in Leaflet it gave me an ‘invalid GeoJSON object’ error.  Mary then sent me the original ArcGIS file for me to work with and I spent some time in ArcGIS figuring out how to export the shapefile data as GeoJSON with latitude and longitude.

Using ArcGIS I exported the data by typing in ‘to json’ in the ‘Geoprocessing’ pane on the right of the map then selecting ‘Features to JSON’.  I selected ‘output to GeoJSON’ and also checked ‘Project to WGS_1984’ which converts the ArcGIS coordinates to latitude and longitude.  When not using the ‘formatted JSON option’ (which adds in line breaks and tabs) this gave me a file size of 115Mb.  As a starting point I created a Leaflet map that uses this GeoJSON file but I ran into a bit of a problem:  the data takes a long time to load into the map – about 30-60 seconds for me – and the map feels a bit sluggish to navigate around even after it’s loaded in. And this is without there being any actual data.  The map is going to be used by school children, potentially on low-spec mobile devices connecting to slow internet services (or even worse, mobile data that they may have to pay for per MB).  We may have to think about whether using these areas is going to be feasible.  A option might be to reduce the detail in the polygons, which would reduce the size of the JSON file.  The boundaries in the current file are extremely detailed and each twist and turn in the polygon requires a latitude / longitude pair in the data, and there are a lot of twists and turns.  The polygons we used in SCOSYA are much more simplified (see for example https://scotssyntaxatlas.ac.uk/atlas/?j=y#9.75/57.6107/-7.1367/d3/all/areas) but would still suit our needs well enough.  However, manually simplifying each and every polygon would be a monumental and tedious task.  But perhaps there’s a method in ArcGIS that could do this for us.  There’s a tool called ‘Simplify Polygon’: https://desktop.arcgis.com/en/arcmap/latest/tools/cartography-toolbox/simplify-polygon.htm which might work.

I spoke to Mary about this and she agreed to experiment with the tool.  Whilst she worked on this I continued to work with the data.  I extracted all of the 411 areas and stored these in a database, together with all 954 postcode components that are related to these areas.  This will allow us to generate a drop-down list of options as the user types – e.g.  type in ‘G43’ and options ‘G43 2’ and ‘G43 3’ will appear, and both of these are associated with ‘Glasgow South’.

I also wrote a script to generate sample data for each of the 411 areas using the ‘turf.js’ script I’d previously used.  For each of the 411 areas a random number of markers between 0 and 100 are generated and stored in the database, each with a random rating of between 1 and 4.  This has resulted in 19946 sample ratings, which I then added to the map along with the polygonal area data, as you can see here:

Currently these are given the colours red=1, orange=2, light blue=3, dark blue=4, purely for test purposes.  As you can see, including almost 20,000 markers swamps the map when it’s zoomed out, but when you zoom in things look better.  I also realised that we might not even need to display the area boundaries to users.  They can be used in the background to work out where a marker should be positioned (as is the case with the map above) but perhaps they’re not needed for any other reasons?  It might be sufficient to include details of area in a popup or sidebar and if so we might not need to rework the areas at all.

However, whilst working on this Mary had created four different versions of the area polygons using four different algorithms.  These differ in how the simplify the polygons and therefore result in different boundaries – some missing out details such as lochs and inlets.  All four versions were considerably smaller in file size than the original, ranging from 4Mb to 20Mb.  I created new maps for each of the four simplified polygon outputs.  For each of these I regenerated new random marker data.  For algorithms ‘DP’ and ‘VW’ I limited the number of markers to between 0 and 20 per area, giving around 4000 markers in each map.  For ‘WM’ and ‘ZJ’ I limited the number to between 0 and 50 per area, giving around 10,000 markers per map.

All four new maps look pretty decent to me, with even the smaller JSON files (‘DP’ and ‘VW’) containing a remarkable level of detail.  I think the ‘DP’ one might be the one to go for.  It’s the smallest (just under 4MB compared to 115MB for the original) yet also seems to have more detail than the others.  For example for the smaller lochs to the east of Loch Ness the original and ‘DP’ include the outline of four lochs while the other three only include two.  ‘DP’ also includes more of the smaller islands around the Outer Hebrides.

We decided that we don’t need to display the postcode areas on the map to users but instead we’ll just use these to position the map markers.  However, we decided that we do want to display the local authority area so people have a general idea of where the markers are positioned.  My next task was to add these in.  I downloaded the administrative boundaries for Scotland from here: https://raw.githubusercontent.com/martinjc/UK-GeoJSON/master/json/administrative/sco/lad.json as referenced on this website: https://martinjc.github.io/UK-GeoJSON/ and added them into my ‘DP’ sample map, giving the boundaries a dashed light green that turns a darker green when you hover over the area, as you can see from the screenshot below:

Also this week I added in a missing text to the Anglo-Norman Dictionary’s Textbase.  To do this I needed to pass the XML text through several scripts to generate page records and all of the search words and ‘keyword in context’ data for search purposes.  I also began to investigate replacing the Innerpeffray data for Books and Borrowing with a new dataset that Kit has worked on.  This is going to be quite a large and complicated undertaking and after working through the data I had a set of questions to ask Kit before I proceeded to delete any of the existing data.  Unfortunately she is currently on jury duty so I’ll need to wait until she’s available again before I can do anything further.  Also this week a huge batch of images became available to us from the NLS and I spent some time downloading these and moving them to an external hard drive as they’d completely filled up the hard drive of my PC.

I also spoke to Fraser about the new radar diagrams I had been working on for the Historical Thesaurus and also about the ‘guess the category’ quiz that we’re hoping to launch soon.  Fraser sent on a list of categories and words that we want to exclude from the quiz (anything that might cause offence) but I had some questions about this that will need clarification before I take things further.  I’d suggested to Fraser that I could update the radar diagrams to include not only the selected category but also all child categories and he thought this would be worth investigating so I spent some time updating the visualisations.

I was a little worried about the amount of processing that would be required to include child categories but thankfully things seem pretty speedy, even when multiple top-level categories are chosen.  See for example the visualisation of everything within ‘Food and drink’, ‘Faith’ and ‘Leisure’:

This brings back many tens of thousands of lexemes but doesn’t take too long to generate.  I think including child categories will really help make the visualisations more useful as we’re now visualising data at a scale that’s very difficult to get a grasp on simply by looking at the underlying words.  It’s interesting to note in the above visualisation how ‘Leisure’ increases in size dramatically throughout the time periods while ‘Faith’ shrinks in comparison (but still grows overall).  With this visualisation the ‘totals’ rather than the ‘percents’ view is much more revealing.

Week Beginning 6th December 2021

I spent a bit of time this week writing as second draft of a paper for DH2022 after receiving feedback from Marc.  This one targets ‘short papers’ (500-750 words) and I managed to get it submitted before the deadline on Friday.  Now I’ll just need to see if it gets accepted – I should find out one way or the other in February.  I also made some further tweaks to the locution search for the Anglo-Norman Dictionary, ensuring that when a term appears more than once the result is repeated for each occurrence, appearing in the results grouped by each word that matches the term.  So for example ‘quatre tempres, tens’ now appears twice, once amongst the ‘tempres’ and once amongst the ‘tens’ results.

I also had a chat with Heather Pagan about the Irish Dictionary eDIL (http://www.dil.ie/) who are hoping to rework the way they handle dates in a similar way to the AND.  I said that it would be difficult to estimate how much time it would take without seeing their current data structure and getting more of an idea of how they intend to update it, and also what updates would be required to their online resource to incorporate the updated date structure, such as enhanced search facilities and whether further updates to their resource would also be part of the process.  Also whether any back-end systems would also need to be updated to manage the new data (e.g. if they have a DMS like the AND).

Also this week I helped out with some issues with the Iona place-names website just before their conference started on Thursday.  Someone had reported that the videos of the sessions were only playing briefly and then cutting out, but they all seemed to work for me, having tried them on my PC in Firefox and Edge and on my iPad in Safari.  Eventually I managed to replicate the issue in Chrome on my desktop and in Chrome on my phone, and it seemed to be an issue specifically related to Chrome, and didn’t affect Edge, which is based on Chrome.  The video file plays and then cuts out due to the file being blocked on the server.  I can only assume that the way Chrome accesses the file is different to other browsers and it’s sending multiple requests to the server which is then blocking access due to too many requests being sent (the console in the browser shows a 403 Forbidden error).  Thankfully Raymond at Arts IT Support was able to increase the number of connections allowed per browser and this fixed the issue.  It’s still a bit of a strange one, though.

I also had a chat with the DSL people about when we might be able to replace the current live DSL site with the ‘new’ site, as the server the live site is on will need to be decommissioned soon.  I also had a bit of a catch-up with Stevie Barrett, the developer in Celtic and Gaelic, and had a video call with Luca and his line-manager Kirstie Wild to discuss the current state of Digital Humanities across the College of Arts.  Luca does a similar job to me at college-level and it was good to meet him and Kirstie to see what’s been going on outside of Critical Studies.  I also spoke to Jennifer Smith about the Speak For Yersel project, as I’d not heard anything about it for a couple of weeks.  We’re going to meet on Monday to take things further.

I spent the rest of the week working on the radar diagram visualisations for the Historical Thesaurus, completing an initial version.  I’d previously created a tree browser for the thematic headings, as I discussed last week.  This week I completed work on the processing of data for categories that are selected via the tree browser.  After the data is returned the script works out which lexemes have dates that fall into the four periods (e.g. a word with dates 650-9999 needs to appear in all four periods).  Words are split by Part of speech, and I’ve arranged the axes so that N, V, Aj and Av appear first (if present), with any others following on.  All verb categories have also been merged.

I’m still not sure how widely useful these visualisations will be as they only really work for categories that have several parts of speech.  But there are some nice ones.  See for example a visualisation of ‘Badness/evil’, ‘Goodness, acceptability’ and ‘Mediocrity’ which shows words for ‘Badness/evil’ being much more prevalent in OE and ME while ‘Mediocrity’ barely registers, only for it and ‘Goodness, acceptability’ to grow in relative size EModE and ModE:

I also added in an option to switch between visualisations which use total counts of words in each selected category’s parts of speech and visualisations that use percentages.  With the latter the scale is fixed at a maximum of 100% across all periods and the points on the axes represent the percentage of the total words in a category that are in a part of speech in your chosen period.  This means categories of different sizes are more easy to compare, but does of course mean that the relative sizes of categories is not visualised.  I could also add a further option that fixes the scale at the maximum number of words in the largest POS so the visualisation still represents relative sizes of categories but the scale doesn’t fluctuate between periods (e.g. if there are 363 nouns for a category across all periods then the maximum on the scale would stay fixed at 363 across all periods, even if the maximum number of nouns in OE (for example) is 128.  Here’s the above visualisation using the percentage scale:

The other thing I did was to add in a facility to select a specific category and turn off the others.  So for example if you’ve selected three categories you can press on a category to make it appear bold in the visualisation and to hide the other categories.  Pressing on a category a second time reverts back to displaying all.  Your selection is remembered if you change the scale type or navigate through the periods.  I may not have much more time to work on this before Christmas, but the next thing I’ll do is to add in access to the lexeme data behind the visualisation.  I also need to fix a bug that is causing the ModE period to be missing a word in its counts sometimes.

 

Week Beginning 29th November 2021

I participated in the UCU strike action on Wednesday to Friday this week, so it was a two-day working week for me.  During this time I gave some help to the students who are migrating the International Journal of Scottish Theatre and Screen and talked to Gerry Carruthers about another project he’s hoping to put together.  I also passed on information about the DNS update to the DSL’s IT people, added a link to the DSL’s new YouTube site to the footer of the DSL site and dealt with a query regarding accessing the DSL’s Google Analytics data.  I also spoke with Luca about arranging a meeting with him and his line manager to discuss digital humanities across the college and updated the listings for several Android apps that I created a few years ago that had been taken down due to their information being out of date.  As central IT services now manages the University Android account I hadn’t received notifications that this was going to take place.  Hopefully the updates have done the trick now.

Other than this I made some further updates to the Anglo-Norman Dictionary’s locution search that I created last week.  This included changing the ordering to list results by the word that was search for rather than by headword, changing the way the search works so that a wildcard search such as ‘te*’ now matches the start of any word in the locution phrase rather than just the first work and fixing a number of bugs that had been spotted.

I spent the rest of my available time starting to work on an interactive version of the radar diagram for the Historical Thesaurus.  I’d made a static version of this a couple of months ago which looks at a the words in an HT category by part of speech and visualises how the numbers of words in each POS change over time.  What I needed to do was find a way to allow users to select their own categories to visualise.  We had decided to use the broader Thematic Categories for the feature rather than regular HT categories so my first task was to create a Thematic Category browser from ‘AA The World’ to ‘BK Leisure’.  It took a bit of time to rework the existing HT category browser to work with thematic categories, and also to then enable the selection of multiple categories by pressing on the category name.  Selected categories appear to the right of the browser, and I added in an option to remove a selected category if required.  With this in place I began work on the code to actually grab and process the data for the selected categories.  This finds all lexemes and their associated dates for each lexeme in each HT category in each of the selected thematic categories.  For now the data is just returned and I’m still in the middle of processing the dates to work out which period each word needs to appear in.  I’ll hopefully find some time to continue with this next week.  Here’s a screenshot of the browser: