Week Beginning 23rd January 2017

I split most of my time this week between two projects (SCOSYA and the Historical Thesaurus), with a few tasks from other projects thrown in along the way too.  For SCOSYA I continued to rework the atlas interface.  Last week I had experimented with different marker styles and after meeting with Gary and discussing things a little further I decided I would go ahead and use the DVF plugin (https://github.com/humangeo/leaflet-dvf) to allow us to have different marker styles, and also to make the markers a ‘fixed’ size on the map, meaning they are easier to see when zoomed out.  As I mentioned last week, this does mean the markers cover a smaller area when zoomed in, but Gary didn’t think this was a problem, and due to the maximum zoom level I’ve set it’s not like the marker identifies an individual house or anything like that.  I replaced the existing circles on the Atlas with Leaflet circleMarkers as opposed to circles, as these are a fixed size rather than convering a fixed geographical area, and I replaced the existing squares with DVF ‘regularPolygonMarkers’ with four sides (I.e. squares).  I haven’t implemented any other shapes yet (e.g. triangles, stars) but at least the groundwork is now set to allow such markers to be used.

I met with Gary a couple of times this week and he has been discussing further ways of changing how the Atlas works with Jennifer.  It seems like a lot of what they would like to see would be covered by using the ‘consistency data’ query page I previously created as the basis for the Atlas search form.  Gary and Jennifer are going to discuss things further and get back to me with some clearer ideas as Gary wasn’t entirely sure exactly how I should proceed at this stage.  Another thing they would like to be able to do is to export the data (which is already possible), work with it in Excel, and then re-upload to the system in order to plot it on the Atlas.  I thought this was feasible, so long as the export file structure is not altered, and will investigate how to implement such a feature.

Another thing Jennifer and Gary wanted me to investigate was simplifying the look of the base map that we use for the Atlas.  As you can see from screenshots in last week’s post, the basemap included quite a lot of detail – topographical features such as mountains, roads, points of interest, national parks etc.  Jennifer wanted to get rid of all of this clutter so the map just showed settlements and rivers.  Thankfully Mapbox (https://www.mapbox.com/) provides some very handy tools for working with map base layers.  I played around with the tools for a while and created three different map designs.  The first was rather traditional and plain looking – blue sea, green land, brown roads.  The second was more monotone and didn’t have roads marked on (although I did make another version with roads marked in a dark red).  The third was a variation on the current atlas, with waves in the sea but with all land plain white.  Jennifer wanted me to use the plain blue and green one, but to get rid of the roads, which I did.  Replacing the previous basemap with the new one was a very straightforward process and now we have an Atlas with new markers and a new base, as the following screenshot demonstrates.

I also created an export script that goes through the questionnaire ratings and displays all of the duplicate codes – i.e. where the same code such as ‘A01’ appears more than once in a questionnaire.  These are all errors that somehow crept in during the filling in of the questionnaire forms and need to be analysed and fixed.  There are 592 such duplicates that will need to be looked at and the script exports these as CSV data.  I was going to create an import script that would allow a researcher to fix the problems in Excel using the CSV file and the script would then fix the online database, but Gary thinks it would be better for the researcher to just look at the data in the CMS and manually fix things through that.

For the Historical Thesaurus I met with Fraser to discuss how we were going to proceed to integrate the OED data with the existing HT data.  So far we had managed to do the following for categories (we haven’t got to the point yet where we have tackled the words):

  1. In the HT ‘category’ table a new ‘oedmaincat’ field has been created. This is based on the ‘v1maincat’ field with various steps as outlined in the ‘OED Changes to v1 Codes’ document applied to it.
  2. In the HT ‘category’ table there is a new ‘oedcatid’ field that contains the OED ‘cid’ for an OED category that matches the HT ‘oedmaincat’ field to the OED ‘path’ field, plus the ‘sub’ and ‘pos’ fields.
  3. We have 235,249 HT categories and all but 12,544 match an OED category number / sub / pos. Most of the ones that don’t match are OE or empty headings.
  4. Looking at the data the other way around, we have 235,929 OED categories and all but 6,692 match an HT category. Most of the ones that don’t match are the OED’s top-level headings that don’t have a POS.
  5. There is also an issue where the category number, sub and POS are the same in the OED and HT data, but the category headings do not match. We created a script that looked specifically at the noun categories where the case insensitive headings don’t match.  There are 124,355 noun categories in the HT data that have an OED category with the same number, sub and POS.  Of these there are 21,109 where the category heading doesn’t match.  In addition there are a further 6,334 HT categories that don’t have a corresponding OED number, sub and POS.  Further rules could also be added to further reduce headings that do not match – e.g. changing ‘native/inhabitant’ to ‘native or inhabitant’
  6. We then updated the script in order to reduce the non-matching category headings based on comparing Leveshtein scores. The gives a numerical score to a word pairing reflecting the number of steps it would take to convert one word into another.  The script allows you to set a threshold and either view the categories that are above or below the threshold.

When we met Fraser and I discussed what our next steps would be.  We agreed that it would be useful to have a field in the database that noted whether a category had been checked or still needed fixing.  I added such a ‘Y/N’ column and then created a script that marked as checked all of the matches we have between the OED and HT category data.  A ‘match’ was made where the HT and OED category numbers, sub-category numbers and part of speech matched and the Levenshtein score when comparing the category headings was 3 or less.  Out of 234,249 in the HT category table this has updated 200,431 categories.  So we have 34,818 HT categories that do not match an OED category (across all parts of speech, including categories that have no value in the ‘oedmaincat’ column) or 32,975 that have something in the ‘oedmaincat’ field but the match isn’t ‘correct’.  I then updated the previous scripts I had made so that they only deal with categories that are marked as ‘not checked’ as we don’t need to bother about the checked ones any more.

I then created a script that takes the HT categories that are ‘not checked’ but where the HT and OED path / sub / pos matches (i.e. everything where the headings don’t match).  The script then looks for other subcats in the same maincat (for subcats) or other maincats from the maincat’s parent downwards (for maincats) and checks the category names to work out the one with the lowest Levenshtein score.  The output lists the OED heading and the closest HT heading (based on Levenshtein score).  There are a few times when the script finds a perfect match but most of the time I’m afraid to say the output is not that helpful – at least not in terms of automatically deducing the ‘correct’ match.  It might be helpful for manually working out some of the right ones, though.  I then split the output into two separate lists for ‘maincat’ and ‘subcat’.  Out of 20,435 categories that have the same path / subcat / pos but different headings in the HT and OED data 13,077 are main categories, and it looks like a lot of these are just caused by the OED changing the words used to describe the same category.

I then created a script that attempted to change some works in the HT  category headings to see if this resulted in a match in the OED data.  For example, changing ‘especially’ to ‘esp’.  Going through this list resulted in almost 3,000 further matches being found, which was rather nice.

My final task was to create a script that listed all the ‘non-matches’ between the HT and OED categories – i.e. the ones where the HT category numbers don’t match anything in the OED data, the ones where the OED category numbers don’t match anything in the HT data, and the ones where the numbers match up but the headings don’t.  After running through the previous scripts we are left with 38,676 categories that don’t match.  The ones where the HT and OED catnums match but the headings are different are listed first.  After that comes that HT categories that have no OED match.  Ones marked with an asterisk only have OE words in them while ones with two asterisks have no words in them.  Where there is no ‘oedmaincat’ for an HT category its number displays ‘XX’.  Note that all of these are empty categories.  It is likely that someone will need to manually go through this list and decide what the match should be, and Fraser has some people lined up who are going to do this.

My tasks for other projects this week are as follows:  I made the ‘Jolly Beggars’ section of the Burns site ‘live’ in time for Burns Night on Wednesday: http://burnsc21.glasgow.ac.uk/the-jolly-beggars/.  I replied to emails from Rob Maslen about his blog and Hannah Tweed about the Medical Humanities Network mailing list.  I contributed to an email discussion about importing sources into the REELS database and I completed the online fire training course, which I had somehow not managed to do before.  Oh, I also continued to migrate some of the STARN materials to T4.