Week Beginning 23rd October 2017

After an enjoyable week’s holiday I returned to work on Monday, spending quite a bit of Monday catching up with some issues people had emailed me about whilst I was away, such as making further tweaks to the ‘Concise Scots Dictionary’ page on the DSL website for Rhona Alcorn (the page is now live if you’d like to order the book: http://dsl.ac.uk/concise-scots-dictionary/), speaking with Luca about a project he’s involved in the planning of that’s going to use some of the DSL data, helping Carolyn Jess-Cooke with some issues she was encountering when accessing one of her websites, giving some information to Brianna of the RNSN project about timeline tools we might use, and a few other such things.

I also spent some time adding paragraph IDs to the ‘Scots Language’ page of the DSL (http://dsl.ac.uk/about-scots/the-scots-language/) for Ann Fergusson to enable references to specific paragraphs to be embedded in other pages.  Implementing this was somewhat complicated by the ‘floating’ contents section on the left as when a ‘hash’ is included in a URL a browser automatically jumps to the ID of the element that has this value.  But for the contents section to float or be fixed to the top of the page depending on which section the user is viewing the page needs to load at the top for the position to be calculated.  If the page loads halfway down then the contents section remains fixed at the top of the page, which is not much use.  However, I managed to get the ‘jump to paragraph from a URL’ feature working with the floating contents section now with a bit of a hack.  Basically, I’ve made it so that the ‘hash’ that gets passed to the page doesn’t actually correspond to an element on the page, so the browser doesn’t jump anywhere.  But my JavaScript grabs the hash after the page has loaded, reworks it to a format that does match an actual element and then smoothly scrolls to this element.   I’ve tested this in Firefox, Chrome, Internet Explorer and Edge and it works pretty well.

I had a couple of queries from Wendy Anderson this week.  The first was for Mapping Metaphor.  Wendy wanted to grab all of the bidirectional metaphors in both the main and OE datasets, including all of their sample lexemes.  I wrote a script that extracted the required data and formatted it as a CSV file, which is just the sort of thing she wanted.  The second query was for all of the metadata associated with the Corpus of Modern Scots Writing texts.  A researcher had contacted Wendy to ask for a copy but although the metadata is in the database and can be viewed on a per text basis through the website, we didn’t have the complete dataset in an easy to share format.  I wrote a little script that queried the database and retrieved all of the data.  I had to do a little digging into how the database was structure in order to do this, as it is a system that wasn’t developed by me.  However, after a little bit of exploration I managed to write a script that grabbed the data about each text, including multiple authors that can be associated with each text.  I then formatted this as a CSV file and sent the outputted file to Wendy.

I met with Gary on Monday to discuss some changes to the SCOSYA atlas and CMS that he wanted me to implement ahead of an event the team are at next week.  This included adding Google Analytics to the website, updating the legend of the Atlas to make it clearer what the different rating levels meant, separating out the grey squares (which mean no data is present) and the grey circles (meaning data is present but doesn’t meet the specified criteria) into separate layers so they can be switched on and off independently of each other, making the map markers a little smaller, and adding in facilities to allow Gary to delete codes, attributes and code parents via the CMS.  This all took a fair amount of time to implement, and unfortunately I lost a lot of time on Thursday due to a very strange situation with my access to the server.

I work from home on Thursdays and I had intended to work on the ‘delete’ facilities that day, but when I came to log into the server the files and the database appeared to have reverted back to the state they were in in May – i.e. it looked like we had lost almost six months of data, plus all of the updates to the code I’d implemented during this time.  This was obviously rather worrying and I spent a lot of time toing and froing with Arts IT Support to try and figure out what had gone wrong.  This included restoring a backup from the weekend before, which strangely still seemed to reflect the state of things in May.  I was getting very concerned about this when Gary noted that he was seeing two different views of the data on his laptop.  In Safari on his laptop his view of the data appeared to have ‘stuck’ at May while in Chrome he could see the up to date dataset.  I then realised that perhaps the issue wasn’t with the server after all but instead the problem was my home PC (and Safari on Gary’s laptop) was connecting to the wrong server.  Arts IT Support’s Raymond Brasas suggested it might be an issue with my ‘hosts’ file and that’s when I realised what had happened.  As the SCOSYA domain is an ‘ac.uk’ domain and it takes a while for these domains to be set up, we had set up the server long before the domain was running, so to allow me to access the server I had added a line to the ‘hosts’ file on my PC to override what happens when the SCOSYA URL is requested.  Instead of it being resolved by a domain name service my PC pointed at the IP address of the server as I had entered it in my ‘hosts’ file.  Now in May, the SCOSYA site was moved to a new server, with a new IP address, but the old server had never been switched off, so my home PC was still connecting to this old server.  I had only encountered the issue this week because I hadn’t worked on SCOSYA from home since May.  So, it turned out there was no problem with the server, or the SCOSYA data.  I removed the line from my ‘hosts’ file, restarted my browser and immediately I could access the up to date site.  All this took several hours of worry and stress, but it was quite a relief to actually figure out what the issue was and to be able to sort it.

I had intended to start setting up the server for the SPADE project this week, but the machine has not yet been delivered, so I couldn’t work on this.  I did make a few further tweaks to the SPADE website, however, and responded to a couple of queries from Rachel about the SCOTS data and metadata, which the project will be using.

I also met with Fraser to discuss the ongoing issue of linking up the HT and OED data.  We’re at the stage now where we can think about linking up the actual words with categories.  I’d previously written a script that goes through each HT category that matches an OED category and compares the words in each, checking whether an HT word matches the next found in either the OED ‘ght_lemma’ or ‘lemma’ fields.  After our meeting I updated the HT lexeme table to include extra fields for the ID of a matching OED lexeme and whether the lexeme had been checked.  After that I updated the script to go through every matching category in order to ‘tick off’ the matching words within.  The first time I ran my script it crashed the browser, but with a bit of tweaking I got it to successfully complete the second time.  Here are some stats:

There are 655513 HT lexemes that are now matched up with an OED lexeme.  There are 47074 HT lexemes that only have OE forms, so with 793733 HT lexemes in total this means there are 91146 HT lexemes that should have an OED match but don’t.  Note, however, that we still have 12373 HT categories that don’t match OED categories and these categories contain a total of 25772 lexemes.

On the OED side of things, we have a total of 688817 lexemes, and of these 655513 now match an HT lexeme, meaning there are 33304 OED lexemes that don’t match anything.  At least some of these will also be cleared up by future HT / OED category matches.  Of the 655513 OED lexemes that now match, 243521 of them are ‘revised’.  There are 262453 ‘revised’ OED lexemes in total, meaning there are 18932 ‘revised’ lexemes that don’t currently match an HT lexeme.  I think this is all pretty encouraging as it looks like my script has managed to match up bulk of the data.  It’s just the several thousand edge cases that are going to be a bit more work.

On Wednesday I met with Thomas Widmann of Scots Language Dictionaries to discuss our plans to merge all three of the SLD websites (DSL, SLD and Scuilwab) into one resource that will have the DSL website’s overall look and feel.  We’re going to use WordPress as a CMS for all of the site other than the DSL’s dictionary pages, so as to allow SLD staff to very easily update the content of the site.  It’s going to take a bit of time to migrate things across (e.g. making a new WordPress theme based on the DSL website, create quick search widgets, updating the DSL dictionary pages to work with the WordPress theme), but we now have the basis of a plan.  I’ll try to get started on this before the year is out.

Finally this week, I responded to a request from Simon Taylor to make a few updates to the REELS system, and I replied to Thomas Clancy about how we might use existing Ordinance Survey data in the Scottish Place-Names survey.  All in all it has been a very busy week.