I split most of my time this week between two projects (SCOSYA and the Historical Thesaurus), with a few tasks from other projects thrown in along the way too. For SCOSYA I continued to rework the atlas interface. Last week I had experimented with different marker styles and after meeting with Gary and discussing things a little further I decided I would go ahead and use the DVF plugin (https://github.com/humangeo/leaflet-dvf) to allow us to have different marker styles, and also to make the markers a ‘fixed’ size on the map, meaning they are easier to see when zoomed out. As I mentioned last week, this does mean the markers cover a smaller area when zoomed in, but Gary didn’t think this was a problem, and due to the maximum zoom level I’ve set it’s not like the marker identifies an individual house or anything like that. I replaced the existing circles on the Atlas with Leaflet circleMarkers as opposed to circles, as these are a fixed size rather than convering a fixed geographical area, and I replaced the existing squares with DVF ‘regularPolygonMarkers’ with four sides (I.e. squares). I haven’t implemented any other shapes yet (e.g. triangles, stars) but at least the groundwork is now set to allow such markers to be used.
I met with Gary a couple of times this week and he has been discussing further ways of changing how the Atlas works with Jennifer. It seems like a lot of what they would like to see would be covered by using the ‘consistency data’ query page I previously created as the basis for the Atlas search form. Gary and Jennifer are going to discuss things further and get back to me with some clearer ideas as Gary wasn’t entirely sure exactly how I should proceed at this stage. Another thing they would like to be able to do is to export the data (which is already possible), work with it in Excel, and then re-upload to the system in order to plot it on the Atlas. I thought this was feasible, so long as the export file structure is not altered, and will investigate how to implement such a feature.
Another thing Jennifer and Gary wanted me to investigate was simplifying the look of the base map that we use for the Atlas. As you can see from screenshots in last week’s post, the basemap included quite a lot of detail – topographical features such as mountains, roads, points of interest, national parks etc. Jennifer wanted to get rid of all of this clutter so the map just showed settlements and rivers. Thankfully Mapbox (https://www.mapbox.com/) provides some very handy tools for working with map base layers. I played around with the tools for a while and created three different map designs. The first was rather traditional and plain looking – blue sea, green land, brown roads. The second was more monotone and didn’t have roads marked on (although I did make another version with roads marked in a dark red). The third was a variation on the current atlas, with waves in the sea but with all land plain white. Jennifer wanted me to use the plain blue and green one, but to get rid of the roads, which I did. Replacing the previous basemap with the new one was a very straightforward process and now we have an Atlas with new markers and a new base, as the following screenshot demonstrates.
I also created an export script that goes through the questionnaire ratings and displays all of the duplicate codes – i.e. where the same code such as ‘A01’ appears more than once in a questionnaire. These are all errors that somehow crept in during the filling in of the questionnaire forms and need to be analysed and fixed. There are 592 such duplicates that will need to be looked at and the script exports these as CSV data. I was going to create an import script that would allow a researcher to fix the problems in Excel using the CSV file and the script would then fix the online database, but Gary thinks it would be better for the researcher to just look at the data in the CMS and manually fix things through that.
For the Historical Thesaurus I met with Fraser to discuss how we were going to proceed to integrate the OED data with the existing HT data. So far we had managed to do the following for categories (we haven’t got to the point yet where we have tackled the words):
- In the HT ‘category’ table a new ‘oedmaincat’ field has been created. This is based on the ‘v1maincat’ field with various steps as outlined in the ‘OED Changes to v1 Codes’ document applied to it.
- In the HT ‘category’ table there is a new ‘oedcatid’ field that contains the OED ‘cid’ for an OED category that matches the HT ‘oedmaincat’ field to the OED ‘path’ field, plus the ‘sub’ and ‘pos’ fields.
- We have 235,249 HT categories and all but 12,544 match an OED category number / sub / pos. Most of the ones that don’t match are OE or empty headings.
- Looking at the data the other way around, we have 235,929 OED categories and all but 6,692 match an HT category. Most of the ones that don’t match are the OED’s top-level headings that don’t have a POS.
- There is also an issue where the category number, sub and POS are the same in the OED and HT data, but the category headings do not match. We created a script that looked specifically at the noun categories where the case insensitive headings don’t match. There are 124,355 noun categories in the HT data that have an OED category with the same number, sub and POS. Of these there are 21,109 where the category heading doesn’t match. In addition there are a further 6,334 HT categories that don’t have a corresponding OED number, sub and POS. Further rules could also be added to further reduce headings that do not match – e.g. changing ‘native/inhabitant’ to ‘native or inhabitant’
- We then updated the script in order to reduce the non-matching category headings based on comparing Leveshtein scores. The gives a numerical score to a word pairing reflecting the number of steps it would take to convert one word into another. The script allows you to set a threshold and either view the categories that are above or below the threshold.
When we met Fraser and I discussed what our next steps would be. We agreed that it would be useful to have a field in the database that noted whether a category had been checked or still needed fixing. I added such a ‘Y/N’ column and then created a script that marked as checked all of the matches we have between the OED and HT category data. A ‘match’ was made where the HT and OED category numbers, sub-category numbers and part of speech matched and the Levenshtein score when comparing the category headings was 3 or less. Out of 234,249 in the HT category table this has updated 200,431 categories. So we have 34,818 HT categories that do not match an OED category (across all parts of speech, including categories that have no value in the ‘oedmaincat’ column) or 32,975 that have something in the ‘oedmaincat’ field but the match isn’t ‘correct’. I then updated the previous scripts I had made so that they only deal with categories that are marked as ‘not checked’ as we don’t need to bother about the checked ones any more.
I then created a script that takes the HT categories that are ‘not checked’ but where the HT and OED path / sub / pos matches (i.e. everything where the headings don’t match). The script then looks for other subcats in the same maincat (for subcats) or other maincats from the maincat’s parent downwards (for maincats) and checks the category names to work out the one with the lowest Levenshtein score. The output lists the OED heading and the closest HT heading (based on Levenshtein score). There are a few times when the script finds a perfect match but most of the time I’m afraid to say the output is not that helpful – at least not in terms of automatically deducing the ‘correct’ match. It might be helpful for manually working out some of the right ones, though. I then split the output into two separate lists for ‘maincat’ and ‘subcat’. Out of 20,435 categories that have the same path / subcat / pos but different headings in the HT and OED data 13,077 are main categories, and it looks like a lot of these are just caused by the OED changing the words used to describe the same category.
I then created a script that attempted to change some works in the HT category headings to see if this resulted in a match in the OED data. For example, changing ‘especially’ to ‘esp’. Going through this list resulted in almost 3,000 further matches being found, which was rather nice.
My final task was to create a script that listed all the ‘non-matches’ between the HT and OED categories – i.e. the ones where the HT category numbers don’t match anything in the OED data, the ones where the OED category numbers don’t match anything in the HT data, and the ones where the numbers match up but the headings don’t. After running through the previous scripts we are left with 38,676 categories that don’t match. The ones where the HT and OED catnums match but the headings are different are listed first. After that comes that HT categories that have no OED match. Ones marked with an asterisk only have OE words in them while ones with two asterisks have no words in them. Where there is no ‘oedmaincat’ for an HT category its number displays ‘XX’. Note that all of these are empty categories. It is likely that someone will need to manually go through this list and decide what the match should be, and Fraser has some people lined up who are going to do this.
My tasks for other projects this week are as follows: I made the ‘Jolly Beggars’ section of the Burns site ‘live’ in time for Burns Night on Wednesday: http://burnsc21.glasgow.ac.uk/the-jolly-beggars/. I replied to emails from Rob Maslen about his blog and Hannah Tweed about the Medical Humanities Network mailing list. I contributed to an email discussion about importing sources into the REELS database and I completed the online fire training course, which I had somehow not managed to do before. Oh, I also continued to migrate some of the STARN materials to T4.
I returned to the SCOSYA project this week. One of the big tasks on my ‘to do’ list was to investigate alternative marker shapes for the Atlas, so we can have markers of different shapes and colours to reflect different combinations of attributes that users have selected, for example a blue circle means attributes A and B are present, while a green triangle means A is present but not B. A yellow star might be used to mean attribute B is present but not A while a red square could mean neither A nor B. Linked to this was the issue of the size of the markers. Gary was concerned that when zoomed out to the point where a user can see the whole of Scotland on screen at once the Atlas markers are currently too small to distinguish the different colours used for them. This is because currently the markers are not ‘pins’ on the map but are instead geographical areas – when zoomed in each ‘marker’ appears to cover a large area (e.g. the whole of Barrhead), but when zoomed out the area of Barrhead (for example) is a tiny part of the overall map and therefore the ‘marker’ is a small dot.
Rather than experiment on the working version of the Atlas I decided to experiment with different markers and marker sizes using test versions of the Atlas. This has allowed me to try things out and break things without affecting Gary’s access to the Atlas as he is using it these days.
First of all I made an update that increased the size of the markers only, but otherwise kept things more or less the same, as the following screenshot demonstrates.
As with the ‘live’ atlas, the markers cover a geographical area rather than being exact points, which means when zoomed in the circles and squares now cover a larger area of the Atlas. This means there is the potential for a lot of overlap and may lead to users making incorrect assumptions about the geographical extent of the survey results. E.g. the search for ‘imjustafter’ when zoomed in to level 11 for the area around Paisley demonstrates that the highlighted area for Paisley covers a large portion of the Johnstone area too whereas if you do the same search on the ‘live’ atlas the circles for Paisley, Johnstone and Barrhead do not overlap at all.
On the above map I also added in a turquoise triangle. This is an image based marker that I was experimenting with. We could have image markers in any shape and colour that we want and can also make them change size at different zoom levels. This occurs with the triangle when you go from zoom level 11 to 10 and from 9 to 8. The triangle looks a bit fuzzy at certain zooms because it’s a bitmap image (a PNG) but we could use vector images (SVG format) instead and these scale smoothly. The downside with using such markers is we need to manually make them, which means every possible permutation would need to be created in advance. E.g. if a user decides to do a search for 20 attributes we’d need to ensure we have created images for every required shape in every required colour to support such a search. The other markers are generated on the fly – the script can just keep picking new colours as required. Also, using an image marker has broken the ‘save map image’ functionality so I’d need to figure out an alternative method for saving the map images if we use image based markers.
My second test was to use the Leaflet DVF (Data Visualisation Framework) plugin (see https://github.com/humangeo/leaflet-dvf). I used this previously to create the donut / coxcomb markers that were rejected as being too complicated at the last team meeting. The plugin allows for polygon and star shaped markers. You can see the results of my experiment in the following screenshot:
For test purposes I’ve updated the ‘Locations’ map so that each ‘number of completed’ level is a polygon of a different number of sides, in addition to being a different colour. So ‘1 completed’ is a triangle, ‘2 completed’ is a square etc. You can specify the rotation of the shapes too – so a square shape can be rotated to make a diamond, for example. Note that these markers are a fixed size rather than covering a geographical area of the map like the markers in the previous map. This means the markers don’t change size when you zoom in and out. This means things get slightly cluttered when you zoom out (but the markers are at least big and easy to see). It also means when you zoom in the marker seems to cover a very specific point on the map. E.g. in the ‘live’ atlas the square for ‘Cambuslang’ covers all of Cambuslang while the DVF marker at maximum zoom is a specific point to the west of a golf club. This specificity might be a problem or not. It probably would be possible to update the radius of the markers automatically when the zoom level is changed, as I did for my triangle test mentioned above. So, for example, at maximum zoom the marker radius could be larger.
On this test atlas I also changed the circle markers to be a fixed size too, as the following map for ‘imjustafter’ shows. The same possible issues when zoomed far in and out occur for these too. Note also that for test purposes I’ve replaced the grey ‘no data’ square with a grey star (see Prestwick – with label hover-over showing too). You can specify the number of points on the star as well, which is a nice feature. It should be noted that using DVF also breaks the ‘save map image’ functionality so I’ll need to explore alternatives if we decide to use this plugin.
Also this week I investigated upgrading Leaflet from version 0.7 to the new version 1 that was released in November last year. I spent a bit of time trying the new version out, but encountered some difficulties with the various plugins I’m using for the Atlas. For some reason when using version 1 none of the map markers are clickable and the labels don’t work either. This is obviously a big issue and although I spent a bit of time trying to figure out why the pop-ups no longer worked I ended up just reverting to version 0.7 as there didn’t seem to be any benefit from moving to version 1. Also, when panning and zooming the markers didn’t animate smoothly with version on but instead stayed where they were until after the base map finished updating and then moved, which looked pretty horrible. I think I’ll just stick with version 0.7 for now.
Apart from the above SCOSYA work much of the rest of my week was taken up with administrative duties. I am the fire officer for my building and I had to attend a fire training course on Wednesday morning. I also spent all of Friday updating the more than 20 WordPress instances that I manage to the most recent version of WordPress, which was a bit of a tedious task but needed doing. I also prepared a new section of the Burns website for Kirsteen McCue in preparation for Burns Night next week, made another couple of tweaks to the export facilities for the REELS project and had a brief chat with Fraser about the HT OED data merging task. We’re going to meet to make a workflow for this next week.
This was my first full week back of the new year and I spent quite a bit of it working for the REELS project. We had a project meeting on Tuesday, and at this we discussed a data export facility that had been requested. Such a facility would allow place-name data to be exported as an Excel compatible CSV file. Exactly how this should be formatted took some thinking through and discussion as the place-name data is split across 13 related tables and many different approaches could be taken in order to ‘flatten’ this out into a single two dimensional spreadsheet. We decided to make two different export types to suit different requirements.
The first would list each place-name with one name per row and all of the related data stored in separate columns. A place-name may have any number of historical forms, classifications, parishes, sources and other such data and wherever these ‘one to many’ or ‘many to many’ relationships were encountered these would be added as additional columns. For example, if a place-name has 3 historical forms and each historical form consists of data in 7 columns then there will be 21 columns for historical form data for this place-name. Of course different place-names have different numbers of historical forms and in order for all of the columns to match up I also needed to work out which place-name had the most historical forms and ‘pad out’ the rows that had fewer historical forms with blank fields (well, actually fields with an ‘x’ in) to keep all of the data aligned. This proved slightly complicated to get working, but I got there in the end.
The second export option would group place-names by their source, so the researchers could see which names were found in which source. The structure of the resulting file is slightly different because ‘source’ is related to a specific historical form for a place-name rather than directly with the place-name itself. Each row in the resulting file has one historical form, relating to one place-name and place-names will appear in multiple rows across the export file – once for each ‘source’ one of their historical forms are linked to. Also, any place-name that doesn’t currently have a historical form will not appear in the file as such place-names will not have sources yet.
I created an ‘export’ page in the CMS that allows the researchers to select the export type, and also to optionally select a start and end date for their export. This allows them to export just a subset of the data that was created or edited during a specific time period rather than for the whole duration of the project. Leaving the date fields blank returns the full dataset. I also updated the system so that the ‘date of last edit’ field now gets updated when data relating to the place-name is changed. Previously this field only updated with the ‘core’ record for the place-name was edited (e.g. the name, the grid reference) whereas now it gets updated when other information such as the place-names elements and historical forms are updated, for example if new elements are added or a historical form is deleted.
I also had to make it clear to the team that when a record is edited the existing date of last edit is replaced so if a record was created in January 2016 and last edited in June 2016 it will be found if you limit the period to between March and June 2016, but if the record is then edited again in January 2017 the record will no longer be returned in the March to June 2016 export. My final task related to the export facility was to figure out how to make Excel open a CSV file that uses Unicode text. By default Excel doesn’t display Unicode characters properly and this is something of an issue for us as we use Unicode characters in the pronunciation field and elsewhere. Rather than opening the CSV file directly in Excel (e.g. by double clicking on the file icon) researchers will have to import the data into an existing blank Excel file using Excel’s ‘Get External Data’ option. It’s a bit of a pain but worth it to see the text as it’s supposed to look rather than all garbled.
The rest of my week was spent on a few other tasks. I had some further AHRC review duties to take care of, which I took care of towards the end of the week. I also had a phone call with Marc about a couple of new projects that will be starting up in the coming months. One is to further redevelop ARIES and the Grammar app with new content, which will be great to do. The other involves setting up a small corpus of legal documents. I had an email conversation with Fraser about some LDNA tasks I had been assigned and also the redevelopment of the HT data. I made a couple of further tweaks to the ‘Learning with the Old English Thesaurus’ resource for Carole. I responded to a request from Thomas Widmann of SLD about how to edit some of the ancillary sections of the DSL resource, I responded to a request from elsewhere in the University about the University app accounts, which I currently manage and I met with Gary to discuss some further updates to the SCOSYA atlas. I also fixed a bug he’d spotted with the atlas, whereby the first attribute in each parent category was being omitted.
I had a fairly easy first week back after the Christmas holidays as I was only working on the Thursday and Friday. On Thursday I spent some time catching up with emails and other such administrative tasks. I also spent some time preparing for a meeting I had on Friday with Alice Jenkins. She is putting together a proposal for a project that has a rather large and complicated digital component and before the meeting I read through the materials she had sent me and wrote a few pages of notes about how the technical aspects might be tackled. We then had a good meeting on Friday and we will be taking the proposal forward during the New Year, all being well. I can’t say much more about it here at this stage, though.
I spent some further time on Thursday and on Friday updating the content of the rather ancient ‘Learning with the Thesaurus of Old English’ website for Carole Hough. The whole website needs a complete overhaul but its exercises are built around an old version of the thesaurus that forms part of the resource and is quite different in its functionality from the new TOE online resource. So for now Carole just wanted some of the content of the existing website updated and we’ll leave the full redesign for later. This meant going through a list of changes Carole had compiled and making the necessary updates, which took a bit of time but wasn’t particularly challenging to do – so a good way to start back after the hols.
Other than these tasks I spent the remainder of the week going through the old STELLA resource STARN and migrating it to T4. Before Christmas I had completed ‘Criticism and commentary’ and this week I completed ‘Journalism’ and made a start on ‘Language’. However, this latter section actually has a massive amount of content tucked away in subsections and it is going to take rather a long time to get this all moved over. Luckily there’s no rush to get this done and I’ll just keep pegging away at it whenever I have a free moment or two over the next few months.