My big task of the week was to return to working for the Speak For Yersel project after a couple of weeks when my services haven’t been required. I had a meeting with PI Jennifer Smith and RA Mary Robinson on Monday where we discussed the current status of the project and the tasks I should focus on next. Mary had finished work on the geographical areas we are going to use. These are based on postcode areas but a number of areas have been amalgamated. We’ll use these to register where a participant is from and also to generate a map marker representing their responses at a random location within their selected area based on the research I did a few weeks ago about randomly positioning a marker in a polygon.
The original files that Mary sent me were plus two exports from ArcGIS as JSON and GeoJSON. Unfortunately both files used a different coordinates system rather than latitude and longitude, the GeoJSON file didn’t include any identifiers for the areas so couldn’t really be used and while the JSON file looked promising when I tried to use it in Leaflet it gave me an ‘invalid GeoJSON object’ error. Mary then sent me the original ArcGIS file for me to work with and I spent some time in ArcGIS figuring out how to export the shapefile data as GeoJSON with latitude and longitude.
Using ArcGIS I exported the data by typing in ‘to json’ in the ‘Geoprocessing’ pane on the right of the map then selecting ‘Features to JSON’. I selected ‘output to GeoJSON’ and also checked ‘Project to WGS_1984’ which converts the ArcGIS coordinates to latitude and longitude. When not using the ‘formatted JSON option’ (which adds in line breaks and tabs) this gave me a file size of 115Mb. As a starting point I created a Leaflet map that uses this GeoJSON file but I ran into a bit of a problem: the data takes a long time to load into the map – about 30-60 seconds for me – and the map feels a bit sluggish to navigate around even after it’s loaded in. And this is without there being any actual data. The map is going to be used by school children, potentially on low-spec mobile devices connecting to slow internet services (or even worse, mobile data that they may have to pay for per MB). We may have to think about whether using these areas is going to be feasible. A option might be to reduce the detail in the polygons, which would reduce the size of the JSON file. The boundaries in the current file are extremely detailed and each twist and turn in the polygon requires a latitude / longitude pair in the data, and there are a lot of twists and turns. The polygons we used in SCOSYA are much more simplified (see for example https://scotssyntaxatlas.ac.uk/atlas/?j=y#9.75/57.6107/-7.1367/d3/all/areas) but would still suit our needs well enough. However, manually simplifying each and every polygon would be a monumental and tedious task. But perhaps there’s a method in ArcGIS that could do this for us. There’s a tool called ‘Simplify Polygon’: https://desktop.arcgis.com/en/arcmap/latest/tools/cartography-toolbox/simplify-polygon.htm which might work.
I spoke to Mary about this and she agreed to experiment with the tool. Whilst she worked on this I continued to work with the data. I extracted all of the 411 areas and stored these in a database, together with all 954 postcode components that are related to these areas. This will allow us to generate a drop-down list of options as the user types – e.g. type in ‘G43’ and options ‘G43 2’ and ‘G43 3’ will appear, and both of these are associated with ‘Glasgow South’.
I also wrote a script to generate sample data for each of the 411 areas using the ‘turf.js’ script I’d previously used. For each of the 411 areas a random number of markers between 0 and 100 are generated and stored in the database, each with a random rating of between 1 and 4. This has resulted in 19946 sample ratings, which I then added to the map along with the polygonal area data, as you can see here:
Currently these are given the colours red=1, orange=2, light blue=3, dark blue=4, purely for test purposes. As you can see, including almost 20,000 markers swamps the map when it’s zoomed out, but when you zoom in things look better. I also realised that we might not even need to display the area boundaries to users. They can be used in the background to work out where a marker should be positioned (as is the case with the map above) but perhaps they’re not needed for any other reasons? It might be sufficient to include details of area in a popup or sidebar and if so we might not need to rework the areas at all.
However, whilst working on this Mary had created four different versions of the area polygons using four different algorithms. These differ in how the simplify the polygons and therefore result in different boundaries – some missing out details such as lochs and inlets. All four versions were considerably smaller in file size than the original, ranging from 4Mb to 20Mb. I created new maps for each of the four simplified polygon outputs. For each of these I regenerated new random marker data. For algorithms ‘DP’ and ‘VW’ I limited the number of markers to between 0 and 20 per area, giving around 4000 markers in each map. For ‘WM’ and ‘ZJ’ I limited the number to between 0 and 50 per area, giving around 10,000 markers per map.
All four new maps look pretty decent to me, with even the smaller JSON files (‘DP’ and ‘VW’) containing a remarkable level of detail. I think the ‘DP’ one might be the one to go for. It’s the smallest (just under 4MB compared to 115MB for the original) yet also seems to have more detail than the others. For example for the smaller lochs to the east of Loch Ness the original and ‘DP’ include the outline of four lochs while the other three only include two. ‘DP’ also includes more of the smaller islands around the Outer Hebrides.
We decided that we don’t need to display the postcode areas on the map to users but instead we’ll just use these to position the map markers. However, we decided that we do want to display the local authority area so people have a general idea of where the markers are positioned. My next task was to add these in. I downloaded the administrative boundaries for Scotland from here: https://raw.githubusercontent.com/martinjc/UK-GeoJSON/master/json/administrative/sco/lad.json as referenced on this website: https://martinjc.github.io/UK-GeoJSON/ and added them into my ‘DP’ sample map, giving the boundaries a dashed light green that turns a darker green when you hover over the area, as you can see from the screenshot below:
Also this week I added in a missing text to the Anglo-Norman Dictionary’s Textbase. To do this I needed to pass the XML text through several scripts to generate page records and all of the search words and ‘keyword in context’ data for search purposes. I also began to investigate replacing the Innerpeffray data for Books and Borrowing with a new dataset that Kit has worked on. This is going to be quite a large and complicated undertaking and after working through the data I had a set of questions to ask Kit before I proceeded to delete any of the existing data. Unfortunately she is currently on jury duty so I’ll need to wait until she’s available again before I can do anything further. Also this week a huge batch of images became available to us from the NLS and I spent some time downloading these and moving them to an external hard drive as they’d completely filled up the hard drive of my PC.
I also spoke to Fraser about the new radar diagrams I had been working on for the Historical Thesaurus and also about the ‘guess the category’ quiz that we’re hoping to launch soon. Fraser sent on a list of categories and words that we want to exclude from the quiz (anything that might cause offence) but I had some questions about this that will need clarification before I take things further. I’d suggested to Fraser that I could update the radar diagrams to include not only the selected category but also all child categories and he thought this would be worth investigating so I spent some time updating the visualisations.
I was a little worried about the amount of processing that would be required to include child categories but thankfully things seem pretty speedy, even when multiple top-level categories are chosen. See for example the visualisation of everything within ‘Food and drink’, ‘Faith’ and ‘Leisure’:
This brings back many tens of thousands of lexemes but doesn’t take too long to generate. I think including child categories will really help make the visualisations more useful as we’re now visualising data at a scale that’s very difficult to get a grasp on simply by looking at the underlying words. It’s interesting to note in the above visualisation how ‘Leisure’ increases in size dramatically throughout the time periods while ‘Faith’ shrinks in comparison (but still grows overall). With this visualisation the ‘totals’ rather than the ‘percents’ view is much more revealing.