Week Beginning 4th March 2019

I spent about half of this week working on the SCOSYA project.  On Monday I met with Jennifer and E to discuss a new aspect of the project that will be aimed primarily at school children.  I can’t say much about it yet as we’re still just getting some ideas together, but it will allow users to submit their own questionnaire responses and see the results.  I also started working with the location data that the project’s researchers had completed mapping out.  As mentioned in previous posts, I had initially created Voronoi diagrams that extrapolate our point-based questionnaire data to geographic areas.  The problem with this approach was that the areas were generated purely on the basis of the position of the points and did not take into consideration things like the varying coastline of Scotland or the fact that a location on one side of a body of water (e.g. the Firth of Forth) should not really extend into the other side, giving the impression that a feature is exhibited in places it quite clearly doesn’t.  Having the areas extend over water also made it difficult to see the outline of Scotland and to get an impression of which cell corresponded to which area.  So instead of this purely computational approach to generating geographical areas we decided to create them manually, using the Voronoi areas as a starting point, but tweaking them to take geographical features into consideration.   I’d generated the Voronoi cells as GeoJSON files and the researchers then used this very useful online tool https://geoman.io/studio to import the shapes and tweak them, saving them in multiple files as their large size caused some issues with browsers.

Upon receiving these files I then had to extract the data for each individual shape and work out which of our questionnaire locations the shape corresponded to, before adding the data to the database.  Although GeoJSON allows you to incorporate any data you like, in addition to the latitude / longitude pairings, I was not able to incorporate location names and IDs into the GeoJSON file I generated using the Voronoi library (it just didn’t work – see an earlier post for more information), meaning this ‘which shape corresponds to which location’ process needed to be done manually.  This involved grabbing the data for an individual location from the GeoJSON files, saving this and importing it into the GeoMan website, comparing the shape to my initial Voronoi map to find the questionnaire location contained within the area, adding this information to the GeoJOSN and then uploading it to the database.  There were 147 areas to do, and the process took slightly over a day to complete.

With all of the area data associated with questionnaire locations in the database I could then begin to work on an updated ‘storymap’ interface that would use this data.  I’m basing this new interface on Leaflet’s choropleth example: https://leafletjs.com/examples/choropleth/ which is a really nice interface and is very similar to what we require.  My initial task was to try and get the data out of the database and formatted in such a way that it could appear on the map.  This involved updating the SCOSYA API to incorporate the GeoJSON output for each location, which turned out to be slightly tricky, as my API automatically converts the data exported from the database (e.g. arrays and such things) into JSON using PHP’s json_encode function.  However, applying this to data that is already encoded as JSON (i.e. the new GeoJSON data) results in that data being treated as a string rather than as a JSON object, so the output was garbled.  Instead I had to ensure that the json_encode function was applied to every bit of data except the GeoJSON data, and once I’d done this the API outputted the GeoJSON data in such a way as to ensure any JavaScript could work with it.

I then produced a ‘proof of concept’ that simply grabbed the location data, pulled all the GeoJSON for each location together and processed it via Leaflet to produce area overlays, as you can see in the following screenshot:

With this in place I then began looking at how to incorporate our intended ‘story’ interface with the Choropleth map – namely working with a number of ‘slides’ that a user can navigate between, with a different dataset potentially being loaded and displayed on each slide, and different position and zoom levels being set on each slide.  This is actually proving to be quite a complicated task, as much of the code I’d written for my previous Voronoi version of the storymap was using older, obsolete libraries.  Thankfully with the new approach I’m able to use the latest version of Leaflet, meaning features like the ‘full screen’ option and smoother panning and zooming will work.

By the end of the week I’d managed to get the interface to load in data for each slide and colour code the areas.  I’d also managed to get the slide contents to display – both a ‘big’ version that contains things like video clips and a ‘compact’ version that sits to one side, as you can see in the following screenshot:

There is still a lot to do, though.  One area is missing its data, which I need to fix.  Also the ‘click on an area’ functionality is not yet working.  Locations as map points still need to be added in too, and the formatting of the areas still needs some work.  Also, the pan and zoom functionality isn’t there yet either.  However, I hope to get all of this working next week.

Also this week I had had a chat with Gavin Miller about the website for his new Medical Humanities project.  We have been granted the top-level ‘.ac.uk’ domain we’d requested so we can now make a start on the website itself.  I also made some further tweaks to the RNSN data, based on feedback.  I also spent about a day this week working on the REELS project, creating a script that would output all of the data in the format that is required for printing.  The tool allows you to select one or more parishes, or to leave the selection blank to export data for all parishes.  It then formats this in the same way as the printed place-name surveys, such as the Place-Names of Fife.  The resulting output can then be pasted into Word and all formatting will be retained, which will allow the team to finalise the material for publication.

I spent the rest of the week working on Historical Thesaurus tasks.  I met with Marc and Fraser on Friday, and ahead of this meeting I spent some time starting to look at matching up lexemes in the HT and OED datasets.  This involved adding seven new fields to the HT’s lexeme database to track the connection (which needs up to four fields) and to note the status of the connection (e.g. whether it was a manual or automatic match, which particular process was applied).  I then ran a script that matched up all lexemes that are found in matched categories where every HT lexeme matches an OED lexeme (based on the ‘stripped’ word field plus first dates).

Whilst doing this I’m afraid I realised I got some stats wrong previously.  When I calculated the percentage of total matched lexemes in matched categories and it gave figures of about 89% matched lexemes this was actually the number of matched lexemes across all categories (whether they were fully matched or not).  The number of matched lexemes in fully matched categories is unfortunately a lot lower.  For ‘01’ there are 173,677 matched lexemes, for ‘02’ there are 45,943 matched lexemes and for ‘03’ there are 110,087 matched lexemes.  This gives a total of 329,707 matched lexemes in categories where every HT word matches an OED word (including categories where there are additional OED words) out of 731307 non-OE words in the HT, which is about 45% matched.  I ticked these off in the database with check code 1 but these will need further checking, as there are some duplicate matches (where the HT lexeme has been joined to more than one OED lexeme).  Where this happens the last occurrence currently overwrites any earlier occurrence.  Some duplicates are caused by a word’s resulting ‘stripped’ form being the same – e.g. ‘chine’ and ‘to-chine’.

When we met on Friday we figured out another big long list of updates and new experiments that I would carry out over the next few weeks, but Marc spotted a bit of a flaw in the way we are linking up HT and OED lexemes.  In order to ensure the correct OED lexeme is uniquely identified we rely on the OED’s category ID field.  However, this is likely to be volatile:  during future revisions some words will be moved between categories.  Therefore we can’t rely on the category ID field as a means of uniquely identifying an OED lexeme.  This will be a major problem when dealing with future updates from the OED ad we will need to try and find a solution – for example updating the OED data structure so that the current category ID is retained in a static field.  This will need further investigation next week.