Week Beginning 11th March 2019

I mainly worked on three projects this week:  SCOSYA, the Historical Thesaurus and the DSL.  For SCOSYA I continued with the new version of my interactive ‘story map’ using Leaflet’s choropleth example and the geographical areas that has been created by the project’s RAs.  Last week I’d managed to get the areas working and colour coded based on the results from the project’s questionnaires.  This week I needed to make the interactive aspects, such as being able to navigate through slides, load in new data, click on areas to view details etc.  After a couple of days I had managed to get a version of the interface that did everything my earlier, more crudely laid out Voronoi diagram did, but using the much more pleasing (and useful) geographical areas and more up to date underlying technologies.  Here’s a screenshot of how things currently look:

If you look at the screenshot from last week’s post you’ll notice that one location (Whithorn) wasn’t getting displayed.  This was because the script iterates through the locations with data first, then the other locations, and the code to take the ratings from the locations with data and to add them to the map was only triggering when the next location also had data.  After figuring this out I fixed it.  I also figured out why ‘Airdrie’ had been given a darker colour.  This was because of a typo in the data.  We had both ‘Airdrie’ and ‘Ardrie’, so two layers were being generated, one on top of the other.  I’ve fixed ‘Ardrie’ now.  I also updated the styles of the layers to make the colours more transparent and the borders less thick and added in circles representing the actual questionnaire locations.  Areas now get a black border when you move the mouse over it, reverting to the dotted white border on mouse out.  When you click on an area (as with Harthill in the above screenshot) it is given a thicker black border and the area becomes more opaque.  The name of the location and its rating level appears in the box in the bottom left.  Clicking on another area, or clicking a second time on the selected area, deselects the current area.  Also, the pan and zoom between slides is now working, and this uses Leaflet’s ‘FlyTo’ method, which is a bit smoother than the older method used in the Voronoi storymap.  Similarly, switching from one dataset to another is also smoother.  Finally, the ‘Full screen’ option in the bottom right of the map works, although I might need to work on the positioning of the ‘slide’ box when in this view.  I haven’t implemented the ‘transparency slider’ feature that was present in the Voronoi version, as I’m not sure it’s really necessary any more.

The underlying data is exactly that same as for the Voronoi example, and is contained in a pretty simple JSON file.  So long as the project RAs stick to the same format they should be able to make new stories for different features, and I should be able to just plug a new file into the atlas and display it without any further work.  I think this new story map interface is working really well now and I’m very glad we took the time to manually plot out the geographical areas.

Also for SCOSYA this week, E contacted me to say that the team hadn’t been keeping a consistent record of all of the submitted questionnaires over the years, and wondered whether I might be able to write an export script that generated questionnaires in the same format as they were initially uploaded.  I spent a few hours creating such a feature, which at the click of a button iterates through the questionnaires in the database, formats all of the data, generates CSV files for each, adds them to a ZIP file and presents this for download.  I also added a facility to download an individual CSV when looking at a questionnaire’s page.

For the HT I continues with the seemingly endless task of matching up the HT and OED data.  Last week Fraser had sent me some category matches he’d manually approved that had been outputted by my gap matching script.  I ran these through a further script that ticked these matches off.  There were 154 matches, bringing the number of unmatched OED categories that have a POS and are not empty down to 995.  It feels like something of a milestone to get this figure under a thousand.

Last week we’d realised that using category ID to uniquely identify OED lexemes (as they don’t have a primary key) is not going to work in the long term as during the editing process the OED people can move lexemes between categories.  I’d agreed to write a script that identifies all of the OED lexemes that cannot be uniquely identified when disregarding category ID (i.e. all those OED lexemes that appear in more than one category).  Figuring this out proved to be rather tricky as the script I wrote takes up more memory than the server will allow me to use.  I had to run things on my desktop PC instead, but to do this I needed to export tables from the online database, and these were bigger than the server would let me export too.  So I had to process the XML on my desktop and generate fresh copies of the table that way. Ugh.

Anyway, the script I wrote goes through the new OED lexeme data and counts all the times a specific combination of refid, refentry and lemmaid appears (disregarding the catid).  As I expected, the figure is rather large.  There are 115,550 times when a combination of refid, refentry and lemmaid appears more than once.  Generally the number of times is 2, but looking through the data I’ve seen one combination that appears 7 times.  The total number of words with a non-unique combination is 261,028, which is about 35% of the entire dataset.  We clearly need some other way of uniquely identifying OED lexemes.  Marc’s suggestion last week of asking the OED to create a legacy ‘catid’ field that is retained in the data as it is now and is never updated in future that would be sufficient to uniquely identify everything in a (hopefully) persistent way.  However, we would still need to deal with new lexemes added in future, though, which might be an issue.

I then decided to generate a list of all of the OED words where the refentry, refid and lemmaid are the sameMost of the time the word has the same date in each category, but not always.  For example, see:


654         5154310                0              180932  absenteeism      1957       2005

654         5154310                0              210756  absenteeism      1850       2005


3366       9275424                0              92850    affectuous          1441       1888

3366       9275424                0              136581  affectuous          1441       1888

3366       9275424                0              136701  affectuous          1566       1888


10058    40557440             0              25985    aquiline                 1646       1855

10058    40557440             0              39861    aquiline                 1745       1855

10058    40557440             0              65014    aquiline                 1791       1855

I then updated the script to output data only when refentry, refid, lemmaid, lemma, sortdate and enddate are all the same.  There are 97927 times when a combination of all of these fields appears more than once, and the total number of words where this happens is 213,692 (about 28% of all of the OED lemmas).  Note that the output here will include the first two ‘affectuous’ lines listed above while omitting the third. After that I created a script that brings back all HT lexemes that appear in multiple categories but have the same word form (the ‘word’ column), ‘startd’ and ‘endd’ (non OE words only).  There are 71,934 times when a combination of these fields is not unique, and the total number of words where this happens is 154,335.  We have 746,658 non-OE lexemes, so this is about 21% of all the HT’s non-OE words.  Again, most of these appear in two categories, but not all of them.  See for example:

529752  138922  abridge of/from/in          1303       1839

532961  139700  abridge of/from/in          1303       1839

613480  164006  abridge of/from/in          1303       1839

328700  91512    abridged              1370       0

401949  111637  abridged              1370       0

779122  220350  abridged              1370       0

542289  142249  abridgedly           1801       0

774041  218654  abridgedly           1801       0

779129  220352  abridgedly           1801       0

I also created a script that attempted to identify whether the OED categories that had been deleted in their new version of the data, but we had connected up to one of the HT’s categories, had possibly been moved elsewhere rather than being deleted outright.  There were 42 such categories and I created two checks to try and find whether the categories had just been moved.  The first looks for a category in the new data that has the same path, sub and pos while the second looks for a category with the same heading and pos and the highest number of words (looking at the stripped form) that are identical to the deleted category.  Unfortunately neither approach has been very successful.  Check number 1 has identified a few categories, but all are clearly wrong.  It looks very much like where a category has been deleted things lower down the hierarchy have been shifted up.  Check number 2 has identified two possible matches but nothing more.  And unfortunately both of these OED categories are already matched to HT categories and are present in the new OED data too, so perhaps these are simply duplicate categories that have been removed from the new data.

I then began to use the new OED category table rather than the old one.  As expected, when using the new data the number of unmatched not empty OED categories with POS has increased, from 995 to 1952.  In order to check thow the new OED category data compares to the old data I wrote a script that brings back 100 random matched categories and their words for spot checking.  This displays the category and word details for the new OED data, the old OED data and the HT data.I’ve looked through a few output screens and haven’t spotted any issues with the matching yet.  However, it’s interesting to note how the path field in the new OED data differs from the old, and from the HT.  In many cases the new path is completely different to the old one.  In the HT data we use the ‘oedmaincat’ field, which (generally) matches the path in the old data. I added in a new field ‘HT current Tnum’ that displays the current HT catnum and sub, just to see if this matches up with the new OED path.  It is generally pretty similar but frequently slightly different. Here are some examples:

OED catid 47373 (HT 42865) ‘Turtle-soup’ is ‘|03 (n)’ in the old data and in the HT’s ‘oedmaincat’ field.  In the new OED data it’s ‘|03 (n)’ while the HT’s current catnum is ‘|03’.

OED catid 98467 (HT 91922) ‘place off centre’ is ‘|04.01 (vt)’ in the old data and oedmaincat.  In the new OED data its ‘|04.01 (vt)’ and HT catnum is ‘|04.01’.

OED catid 202508 (HT 192468) ‘Miniature vehicle for use in experiments’ is ‘|13 (n)’ in the old data and oedmaincat.  In the new data it’s ‘|13 (n)’ and the HT catnum (as you probably guessed) is ‘|13’.

As we’re linking categories on the catid it doesn’t really have any bearing on the matching process, but it’s possibly less than ideal that we have three different hierarchical structures on the go.

For the DSL I spent some time this week analysing the DSL API in order to try and figure out why the XML outputted by the API is different to the XML stored in the underlying database that the API apparently uses.  I wasn’t sure whether there was another database on the server that I was unaware of, or whether Peter’s API code was dynamically changing the XML each time it was requested.  It turns out it’s the latter.  As far as I can tell, every time a request for an entry is sent to the API, it grabs the XML in the database, plus some other information stored in other tables relating to citations and bibliographical entries, and then it dynamically updates sections of the XML (e.g. <cit>) to replace sections, adding in IDs, quotes and other such things.  It’s a bit of an odd system, but presumably there was a reason why Peter set it up like this.

Anyway, after figuring out that the API is behaving this way I could then work out a method to grab all of the fully formed XML that the API generates.  Basically I’ve written a little script that requests the full details for every word in the dictionary and then saves this information in a new version of the database.  It took several hours for the script to complete, but it has now done so.  I would appear to have the fully formed XML details for 89,574, and with access to this data I should be able to start working on a new version of the API using this data, that will hopefully give us something identical in functionality and content to the old API.

Also this week I moved offices, which took most of Friday morning to sort out.  I also helped Bryony Randall to get some stats for the New Modernist Editing websites, created a further ‘song story’ for the RNSN project and updated all of the WordPress sites I manage to the latest version of WordPress.