Week Beginning 24th September 2018

Having left Rob Maslen’s Fantasy blog in a somewhat unfinished state last Friday due to server access issues, I jumped straight into completing this work on Monday morning.  Thankfully I could access the server again and after spending an hour or so tweaking header images, choosing colour schemes and fonts, reinstating widgets, menus and such things I managed to get the site fully working again, with a fully responsive theme: http://fantasy.glasgow.ac.uk/.  I also updated some content on the Burns Paper Database website for Ronnie Young, completed my PDR, responded to a query about TheGlasgowStory and met with Matt Barr in Computing Science to discuss some possible future developments.  I also made some further tweaks to the Seeing Speech and Dynamic Dialect website upgrades that are still ongoing.  Eleanor had created new versions of some of the videos, so I uploaded them, and also updated all of the images in the image carousels for both sites.

I spent a fair amount of time this week updating the maps on the ‘Saints in Scottish Place-Names’ website.  As mentioned in a previous post, the maps on this site all use Google maps, and Google now blocks access to their maps API unless you connect via an account that has a credit card associated with it.  This is not very good for legacy research projects such as this one, so the plan was that I’d migrate the maps from Google to the free and open source Leaflet.js mapping library.  Another advantage of Leaflet is that the scripts are all stored on the same server as the rest of the resource – we’re no longer reliant on a third-party server so there should be less risk of the maps becoming unavailable in future.  Of course the map layers themselves are all stored on other third-party servers, but the ones I’ve chosen (based on the ones I selected for the REELS project) are all free to use, and another benefit of Leaflet is that it’s very simple to switch out one map layer for another – so if one tileset becomes unavailable I can replace it very quickly with another.

I created a new Leaflet powered version of the website in a subdirectory so I could test things out without messing up the live site.  As far as I could tell there were four pages that featured maps, each using them in different ways.  I migrated all of them over to the Leaflet mapping library and incorporated base maps and other features from the REELS and KCB map interface, namely:

  1. A map ‘display options’ button in the top left of the map that opens a panel through which you can change the base map.
  2. A choice of 6 base maps, as with REELS and KCB:
    1. A default topographical map
    2. A satellite map
    3. A satellite map with things like roads, rivers and settlements marked on it
    4. A modern OS map
    5. A historical OS map from 1840-1888
    6. A historical OS map from 1920-1933
  3. An ‘Attribution and copyright’ popup linked to from the bottom right of the map, which I adapted from REELS.
  4. A ‘full screen’ button in the bottom right of the map that allows you to view any map full screen. I’ve removed the ‘view larger map’ option on the Saints page as I didn’t think this was really necessary when the ‘full screen’ option is available anyway.
  5. A map scale (metric and imperial) appears in the bottom left of the map.

 

Here’s some information about the four map types that I updated:

 

  1. Place map

This is the simplest map and displays a marker showing the location of the place.  Hover over the marker to view the place-name.

  1. Saint map

This map colour codes the markers based on ‘certainty’.  I used the same coloured markers as found on the original map.  I also added a map legend to the top right that shows you what the colours represent.  You can turn any of the layers on or off to make it easier to see the markers you’re interested in (e.g. hide all ‘certain’ markers).  I removed the legend section that appeared underneath the original map as this is no longer needed due to the in-map version.

  1. Search map

As with the original version, when you zoom in on an area any place-names found in the vicinity appear as red dots.  I updated the functionality slightly so that as you pan round the map at one zoom level new markers continue to load (with the previous version you had to change the zoom level to initiate the loading of new markers).  Now as you pan around new red spots appear all over the place like measles.

  1. Search results map

I couldn’t get the original version of this map to work at all, so I think there must have been some problem with it in addition to the Google Maps issue.  Anyway, the new version displays the search results on a map, and if the search included a saint then the results are categorised by ‘certainty’ as with the saint map.  You can turn certainty levels on or off.  You can also open the marker pop-ups to link through to the place-name record and the saint record too.

There will no doubt be a few further tweaks that will be required before I replace the live site with the new version I’ve been working on, but I reckon that bulk of the work is now done.

I also continued with the Bilingual Thesaurus project, although I didn’t have as much time as I had hoped to work on this.  However, I updated the ‘language of origin’ data for the 1829 headwords that had no language of origin, assigning ‘uncertain’ to all of them.  I also noticed that 15 headwords have no ‘date of citation’ and I asked Louise whether this was ok.  I also updated the way I’m storing dates.  Previously I had set up a separate table where any number of date fields could be associated with a headword.  Instead I have now added two new columns to the main ‘lexeme’ table: startdate and enddate.  I then wrote a script that went through the originally supplied dates (e.g. [1230,1450], adding the first date to the startdate column and the second date to the enddate column.  Where an enddate is not supplied or is ‘0’ I’ve added the startdate to this column, just to make it clearer that this is a single year.  Louise had mentioned that some dates would have ‘1450+’ as a second date but I’ve checked the original JSON file I was given and no dates have the plus sign, o I’ve checked with her in case this data has somehow been lost.  I also discovered that there are 16 headwords that have an enddate but no startdate (e.g. the date in the original JSON file is something like [0,1436] and have asked what should happen to these.  Finally, I made a start on the front-end for the resource.  There is very little in place yet, but I’ve started to create a ‘Bootstrap’ based interface using elements from the other thesaurus websites (e.g. logo, fonts).  Once a basic structure is in place I’ll get the required search and browse facilities up and running and we can then thing about things such as colour schemes and site text.

I spent the rest of the week on the somewhat Sisyphean task of matching up the HT and OED category data.  This is a task that Fraser and I have been working with on and off for over a year now, and it seemed like the end was in sight, as we were down to just a few thousand OED categories that were unmatched.  However, last week I noticed some errors in the category matching, and some further occasions where an OED category has been connected to multiple HT categories.  Marc, Fraser and I met on Monday to discuss the process and Marc suggested we start the matching process from scratch again.  I was rather taken aback by this as there appeared to only be a few thousand erroneous matches out of more than 220,000 and it seemed like a shame to abandon all our previous work.  However, I’ve since realised this is something that needed to be done, mainly because the previous process wasn’t very well documented and could not be easily replicated.  It’s a process that Fraser and I could only focus on between other commitments and progress was generally tracked via email conversations and a few Word documents.  It was all very experimental, and we often ran a script, which matched a group of categories, then altered the script and ran it again, often several times in succession.  We also approached the matching from what I realise now is the wrong angle – starting with the HT categories and trying the match these to the OED categories.  However, it’s the OED categories that need to be matched and it doesn’t really matter if HT categories are left unmatched (as plenty will as they are more recent additions or are empty place-holder categories).  We’ve also learned a lot from the initial process and have identified certain scripts and processes that we know are the most likely to result in matches.

It was a bit of a wrench, but we have now abandoned our first stab at category matching and are starting over again.  Of course, I haven’t deleted the previous matches so no data has been lost.  Instead I’ve created new ‘v2’ matching fields and I’m being much more rigorous in documenting the processes that we’re putting the data through and ensuring every script is retained exactly as it was when it performed a specific task rather than tweaking and reusing scripts.

I then ran an initial matching script that looked for identical matches – where the maincat, subcat, part of speech and ‘stripped’ heading were all identical.  This matched 202030 OED categories, leaving just 27,295 unmatched.  However, it is possible that not all of these 202030 matches are actually correct.  This is because quite often a category heading is reused – e.g. there are lots of subcats that have the heading ‘pertaining to’ – so it’s possible that a category might look identical but in actual fact be something completely different.  To check for this I ran a script that the combination of the stripped heading and the part of speech appears in more than one category.  There are 166096 matched categories where this happens.  For these the script then compares the total number of words and the last word in each match to see whether the match looks valid.  There were 12,640 where the number of words or the last word are not the same and I created a further script that then checked whether these had identical parent category headings.  This then identified 2,414 that didn’t.  These will need further checking.

I also noticed that a small number of HT categories had a parent whose combination of ‘oedmaincat’, ‘subcat’ and ‘pos’ information was not unique.  This is an error and I created a further script to list all such categories.  Thankfully there are only 98 and Fraser is going to look at these.  I also created a new stats page for our V2 matching process, which I will hopefully continue to make good progress with next week.