Week Beginning 10th October 2016

It was a four-day week this week as I am off on holiday on Friday and will be away all next week too.  This week I split my time primarily between two projects:  SCOSYA and the Historical Thesaurus.  Last week I met with Fraser and Marc to discuss how we were to take the new data that the OED had sent us and amalgamate it with our data.  The OED data was in XML and my first task was extracting this data and converting it to SQL format so that it would be possible to query it against our SQL based HT database.  I created two new tables in the HT database:  ‘category_oed’ and ‘lexeme_oed’ to contain all of the information from the XML.  I then created a handy little PHP script that could go through all of the XML files in a directory and extract the necessary information from elements and attributes.  After testing and tweaking my script I ran it on the full set of XML files and it populated the tables remarkably quickly.  There are 235,893 categories in the ‘category_oed’ table and 688,817 words in the ‘lexeme_oed’ table.  This includes all of the required fields from the XML, such as dates, ‘GHT’ fields, OED references, definitions etc.  I did have to manually change the part of speech after generating the data so that it matches our structure (e.g. ‘n’ rather than ‘noun’).  This only took a minute or so to do, though.

My script makes use of a very simple XML processor in PHP called ‘simplexml’.  It allows you to load an XML file and then just process it like any PHP object or multidimensional array.  You load your XML file like so:

$xml = simplexml_load_file($file);

And then you can do what you want with it, for example, to go through each ‘class’ in the OED data (which is what they call a ‘category’) and assign the attributes to PHP variables:

foreach($xml->{‘class’}as $c){

$id = $c[“id”];

$path = $c[“path”];

}

Note that ‘class’ has to appear as {‘class’} purely because ‘class’ is a reserved word in PHP.  With all of the OED data in place I then began on my next task, which was to update our ‘v1maincat’ column so that for each category the contents corresponded to the ‘path’ column in the OED data.  Thankfully Fraser had previously documented the steps required to do this during the SAMUELS project, and equally thankfully, I found a script that I’d previously created to follow these steps.

I completed the steps needed to convert the ‘v1maincat’ field into a new ‘oedmaincat’ field that (together with ‘sub’ and ‘pos’) matches our categories to the OED categories.  It was slightly trickier than I thought it would be to do this because I’d forgotten that some of the steps in Fraser’s ‘OED Changes to v1 Codes’ document weren’t handled via my script but directly through MySQL.  Thankfully I realised this and have now added them all to the script, so the conversion process can now be done all at once in a matter of seconds.

With that done I then created a script that goes through each of our categories, picks out the ‘oedmaincat’, ‘subcat’ and ‘pos’ fields and then queries these against the ‘path’, ‘sub’ and ‘pos’ fields in the ‘category_oed’ database.  Where it finds a match it displays this on screen and adds the OED ‘cid’ to our ‘category’ table for future use (if this has not already been added).  Where it doesn’t find a match it displays a notification in red, so you can easily spot the problems.

We have 235,249 categories and there are 235,893 categories in the OED data (I guess this is because they have empty categories that have no POS).

 

 

 

Of our 235,249 categories there are 233,406 that have a value in the ‘oedmaincat’ column.  Of these there are now 220,866 that have an ‘oedcatid’ – i.e. a connection to a category in the ‘category_oed’ table.  This will allow us to compare lexemes in each category.  We will probably need to investigate what can be done about the 15,000 or so categories that don’t match up – and also check that the matches that are present are actually correct.  I think Fraser will need to get involved to do this.

For SCOSYA I met with Gary and Jennifer a couple of times this week to discuss the consistent / conflicting data view facilities that I created last week.  I extended the functionality of this significantly this week.  Where previously the tables were generated without staff being able to specify anything other than the age of the participants, the new facility allows many things to be tweaked, such as what ratings we consider to be ‘high’ or ‘low’, and whether ‘spurious’ data should be included.  It’s a much more flexible and useful tool now.  I also added in averages for each code as well as averages for each location.

Jennifer wanted me to look into presenting the data via a chart of some sort as well as via in-browser and CSV based tables.  I decided I would use the HighCharts JavaScript library (http://www.highcharts.com/demo) as I’ve used it before and found it to be very useful and easy to work with (it’s syntax is very much like jQuery, which is good for me).  I created two charts as follows:

  1. The ‘Code’ chart has the codes on the X axis and the percentage of responses that are mixed on the Y axis, with one percentage (the percentage of each code that is ‘mixed’) per code plotted.
  2. The ‘Location’ chart has the locations on X axis and percentage on the Y axis, with one percentage (the percentage of each location that is ‘mixed’) per location plotted.

These charts work rather well, and after completing them Gary and Jennifer tried them out and we had a further meeting to discuss extending the functionality further.  I didn’t get a chance to complete all of the new updates that resulted from this meeting, but the main thing I added in was parent categories for codes, which allows us to group codes together.  Gary prepared a spreadsheet containing the initial groupings and I added all of them to a database table, associated the codes with these and updated the CMS to add in facilities to add new parents, browse code parents, see which codes have which parent and assign a parent to a code.  I’ve also made a start on updating the ‘consistency’ data views to take this into consideration, but I haven’t completed this yet.

Other than working on these two projects I spent a few hours considering possible options for adding sound files and pronunciations to the REELS database and content management system, I gave a colleague in TFTS some advice on a WordPress issue, I updated the Technical Plan for Carolyn Jess-Cooke’s proposal and I gave feedback and wrote a little bit of text for Jane Stuart-Smith’s ‘Talking Heads’ proposal.

I will be on holiday all of next week so it will be a couple of weeks before my next post.