Week Beginning 24th October 2016

I had a very relaxing holiday last week and returned to work on Monday.  When I got back to work I spent a bit of time catching up with things, going through my emails, writing ‘to do’ lists and things like that and once that was out the way I settled back down into some actual work.

I started off with Mapping Metaphor.  Wendy had noticed a bug with the drop-down ‘show descriptors’ buttons in the search results page, which I swiftly fixed.  Carole had also completed work on all of the Old English metaphor data, so I uploaded this to the database.  Unfortunately, this process didn’t go as smoothly as previous data uploads due to some earlier troubles with this final set of data (this data was the dreaded ‘H27’ data, which originally was one category but which was split into several smaller categories, which caused problems for Flora’s Access database that the researchers were using).

Rather that updating rows the data upload added new ones, and this was because the ordering of cat1 and cat2 appeared to have been reversed since stage 4 of the data processing.  For example, in the database cat1 is ‘1A16’ and cat2 is ‘2C01’ but in the spreadsheet these are the other way round.  Thankfully this was consistently the case, so once identified it was easy to rectify the problem.  For Old English we now have a complete set of metaphorical connections, consisting of 2488 and 4985 example words.  I also fixed a slight bug in the category ordering for OE categories and replied to Wendy about a query she had received regarding access to the underlying metaphor data.  After that I updated a few FAQs and all was complete.

Also this week I undertook some more AHRC work, which took up the best part of a day, and I replied to a request from Gavin Miller about a Medical Humanities Network mailing list.  We’ve agreed a strategy to implement such a thing, which I hope to undertake next week.  I also chatted to Chris about migrating the Scots Corpus website to a new server.  The fact that the underlying database is PostGreSQL rather than MySQL is causing a few issues here, but we’ve come up with a solution to this.

I spent a couple of days this week working on the SCOSYA project, continuing with the updates to the ‘consistency data’ views that I had begun before I went away.  I added an option to the page that allows staff to select which ‘code parents’ they want to include in the output.  This defaults to ‘all’ but you can narrow this down to any of them as required.  You can also select or deselect ‘all’ which ticks / unticks all the boxes.  The ‘in-browser table’ view now colour codes the codes based on their parent, with the colour assigned to a parent listed across the top of the page.  The colours are randomly assigned each time the page loads so if two colours are too similar a user can reload the page and different ones will take their place.

Colour coding is not possible in the CSV view as CSV files are plain text and can’t have any formatting.  However, I have added the colour coding to the chart view, which colours both the bars and the code text based on each code’s parent.  I’ve also added in a little feature that allows staff to save the charts.

I then added in the last remaining required feature to the consistency data page, namely making the figures for ‘percentage high’ and ‘percentage low’ available in addition to ‘percentage mixed’.  In the ‘in-browser’ and CSV table views these appear as new rows and columns alongside ‘% mixed’, giving you the figures for each location and for each code.  In the Chart view I’ve updated the layout to make a ‘stacked percentage’ bar chart.  Each bar is the same height (100) but is split into differently coloured sections to reflect the parts that are high, low and mixed.  I’ve made ‘mixed’ appear at the bottom rather than between high and low as mixed is most important and it’s easier to track whatever is at the bottom.  This change in chart style does mean that the bars are no longer colour coded to match the parent code (as three colours are now needed per bar), but the x-axis label still has the parent code colour so you can still see which code belongs to each parent.

I spent most of the rest of the week working with the new OED data for the Historical Thesaurus.  I had previously made a page that lists all of the HT categories and notes which OED categories match up, or if there are no matching OED categories.  Fraser had suggested that it would be good to be able to approach this from the other side – starting with OED categories and finding which ones have a matching HT category and which ones don’t.  I created such a script, and I also update both this and the other script so that the output would either display all of the categories or just those that don’t have matches (as these are the ones we need to focus on).

I then focussed on creating a script that matches up HT and OED words for each category where the HT and OED categories match up.  What the script does is as follows:

  1. Finds each HT category that has a matching OED category
  2. Retrieves the lists of HT and OED words in each
  3. For each HT word displays it and the HT ‘fulldate’ field
  4. For each HT word it then checks to see if an OED word matches.  This checks the HT’s ‘wordoed’ column against the OED’s ‘ght_lemma’ column and also the OED’s ‘lemma’ column (as I noticed sometimes the ‘ght_lemma’ column is blank but the ‘lemma’ column matches)
  5. If an OED word matches the script displays it and its dates (OED ‘sortdate’ (=start) and ‘enddate’)
  6. If there are any additional OED words in the category that haven’t been matched to an HT word these are then displayed


Note that this script has to process every word in the HT thesaurus and every word in the OED thesaurus data so it’s rather a lot of data.  I tried running it on the full dataset but this resulted in Firefox crashing.  And Chrome too.  For this reason I’ve added a limit on the number of categories that are processed.  By default the script starts at 0 and processes 10,000 categories.  ‘Data processing complete’ appears at the bottom of the output so you can tell it’s finished, as sometimes a browser will just silently stop processing.  You can look at a different section of the data by passing parameters to it – ‘start’ (the row to start at) and ‘rows’ (the number of rows to process).  I’ve tried it with 50,000 categories at it worked for me, but any more than that may result in a crashed browser.  I think the output is pretty encouraging.  The majority of OED words appear to match up, and for the OED words that don’t I could create a script that lists these and we could manually decide what to do with them – or we could just automatically add them, but there are definitely some in there that should match – such as HT ‘Roche(‘s) limit’ and OED ‘Roche limit’.  After that I guess we just need to figure out how we handle the OED dates.  Fraser, Marc and I are meeting next week to discuss how to take this further.

I was left with a bit of time on Friday afternoon which I spent attempting to get the ‘essentials of Old English’ app updated.  A few weeks ago it was brought to my attention that some of the ‘C’ entries in the glossary were missing their initial letters.  I’ve fixed this issue in the source code (it took a matter of minutes) but updating the app even for such a small fix takes rather a lot longer than this.  First of all I had to wait until gigabytes of updates had been installed for MacOS and XCode, as I hadn’t used either for a while.  After that I had to update Cordova, and then I had to update the Android developer tools.  Cordova kept failing to build my updated app because it said I hadn’t accepted the license agreement for Android, even though I had!  This was hugely frustrating, and eventually I figured out the problem was I had the Android tools installed in two different locations.  I’d updated Android (and accepted the license agreements) in one place, but Cordova uses the tools installed in the other location.  After realising this I made the necessary updates and finally my project built successfully.  Unfortunately about three hours of my Friday afternoon had by that point been used up and it was time to leave.  I’ll try to get the app updated next week, but I know there are more tedious hoops to jump through before this tiny fix is reflected in the app stores.


Week Beginning 10th October 2016

It was a four-day week this week as I am off on holiday on Friday and will be away all next week too.  This week I split my time primarily between two projects:  SCOSYA and the Historical Thesaurus.  Last week I met with Fraser and Marc to discuss how we were to take the new data that the OED had sent us and amalgamate it with our data.  The OED data was in XML and my first task was extracting this data and converting it to SQL format so that it would be possible to query it against our SQL based HT database.  I created two new tables in the HT database:  ‘category_oed’ and ‘lexeme_oed’ to contain all of the information from the XML.  I then created a handy little PHP script that could go through all of the XML files in a directory and extract the necessary information from elements and attributes.  After testing and tweaking my script I ran it on the full set of XML files and it populated the tables remarkably quickly.  There are 235,893 categories in the ‘category_oed’ table and 688,817 words in the ‘lexeme_oed’ table.  This includes all of the required fields from the XML, such as dates, ‘GHT’ fields, OED references, definitions etc.  I did have to manually change the part of speech after generating the data so that it matches our structure (e.g. ‘n’ rather than ‘noun’).  This only took a minute or so to do, though.

My script makes use of a very simple XML processor in PHP called ‘simplexml’.  It allows you to load an XML file and then just process it like any PHP object or multidimensional array.  You load your XML file like so:

$xml = simplexml_load_file($file);

And then you can do what you want with it, for example, to go through each ‘class’ in the OED data (which is what they call a ‘category’) and assign the attributes to PHP variables:

foreach($xml->{‘class’}as $c){

$id = $c[“id”];

$path = $c[“path”];


Note that ‘class’ has to appear as {‘class’} purely because ‘class’ is a reserved word in PHP.  With all of the OED data in place I then began on my next task, which was to update our ‘v1maincat’ column so that for each category the contents corresponded to the ‘path’ column in the OED data.  Thankfully Fraser had previously documented the steps required to do this during the SAMUELS project, and equally thankfully, I found a script that I’d previously created to follow these steps.

I completed the steps needed to convert the ‘v1maincat’ field into a new ‘oedmaincat’ field that (together with ‘sub’ and ‘pos’) matches our categories to the OED categories.  It was slightly trickier than I thought it would be to do this because I’d forgotten that some of the steps in Fraser’s ‘OED Changes to v1 Codes’ document weren’t handled via my script but directly through MySQL.  Thankfully I realised this and have now added them all to the script, so the conversion process can now be done all at once in a matter of seconds.

With that done I then created a script that goes through each of our categories, picks out the ‘oedmaincat’, ‘subcat’ and ‘pos’ fields and then queries these against the ‘path’, ‘sub’ and ‘pos’ fields in the ‘category_oed’ database.  Where it finds a match it displays this on screen and adds the OED ‘cid’ to our ‘category’ table for future use (if this has not already been added).  Where it doesn’t find a match it displays a notification in red, so you can easily spot the problems.

We have 235,249 categories and there are 235,893 categories in the OED data (I guess this is because they have empty categories that have no POS).




Of our 235,249 categories there are 233,406 that have a value in the ‘oedmaincat’ column.  Of these there are now 220,866 that have an ‘oedcatid’ – i.e. a connection to a category in the ‘category_oed’ table.  This will allow us to compare lexemes in each category.  We will probably need to investigate what can be done about the 15,000 or so categories that don’t match up – and also check that the matches that are present are actually correct.  I think Fraser will need to get involved to do this.

For SCOSYA I met with Gary and Jennifer a couple of times this week to discuss the consistent / conflicting data view facilities that I created last week.  I extended the functionality of this significantly this week.  Where previously the tables were generated without staff being able to specify anything other than the age of the participants, the new facility allows many things to be tweaked, such as what ratings we consider to be ‘high’ or ‘low’, and whether ‘spurious’ data should be included.  It’s a much more flexible and useful tool now.  I also added in averages for each code as well as averages for each location.

Jennifer wanted me to look into presenting the data via a chart of some sort as well as via in-browser and CSV based tables.  I decided I would use the HighCharts JavaScript library (http://www.highcharts.com/demo) as I’ve used it before and found it to be very useful and easy to work with (it’s syntax is very much like jQuery, which is good for me).  I created two charts as follows:

  1. The ‘Code’ chart has the codes on the X axis and the percentage of responses that are mixed on the Y axis, with one percentage (the percentage of each code that is ‘mixed’) per code plotted.
  2. The ‘Location’ chart has the locations on X axis and percentage on the Y axis, with one percentage (the percentage of each location that is ‘mixed’) per location plotted.

These charts work rather well, and after completing them Gary and Jennifer tried them out and we had a further meeting to discuss extending the functionality further.  I didn’t get a chance to complete all of the new updates that resulted from this meeting, but the main thing I added in was parent categories for codes, which allows us to group codes together.  Gary prepared a spreadsheet containing the initial groupings and I added all of them to a database table, associated the codes with these and updated the CMS to add in facilities to add new parents, browse code parents, see which codes have which parent and assign a parent to a code.  I’ve also made a start on updating the ‘consistency’ data views to take this into consideration, but I haven’t completed this yet.

Other than working on these two projects I spent a few hours considering possible options for adding sound files and pronunciations to the REELS database and content management system, I gave a colleague in TFTS some advice on a WordPress issue, I updated the Technical Plan for Carolyn Jess-Cooke’s proposal and I gave feedback and wrote a little bit of text for Jane Stuart-Smith’s ‘Talking Heads’ proposal.

I will be on holiday all of next week so it will be a couple of weeks before my next post.


Week Beginning 3rd October 2016

I spent most of my time this week split between two projects:  SCOSYA and REELS.  I’ll discuss REELS first.  On Wednesday I attended a project meeting for the REELS project, the first one I’ve attended for several months as I had previously finished work on the content management system for the project and didn’t have anything left to do for the project for a while.  The project has recently appointed their PhD student so it seemed like a good time to have a full project meeting and it was god to catch up with the project again and meet the new member of staff.  A few updates and fixes to the content management system were requested at the meeting, so I spent some time this week working on these, specifically:

  1. I added a new field to the place-name table for recording whether the place-name is ‘non-core’ or not. This is to allow names like ‘Edrington’, that appear in names like ‘Edrington Castle’ but don’t appear to exist as names in their own right to be recorded.  It’s a ‘yes/no’ field as with ‘Obsolete’ and ‘Linear’ and appears on the ‘add’ and ‘edit’ place-name page underneath the ‘Linear’ option.
  2. I fixed the issue caused when selecting the same parish as both ‘current’ and ‘former’. The system was giving an error when this situation arose and I realised this is because the primary key for the table connecting place-name and parish was composed of the IDs for the relevant place-name and parish – i.e. only one join was possible for each place-name / parish pairing.  I fixed this by adding in the ‘type’ field (current or former) to the primary key, this allowing one of each type to appear for each pairing.
  3. I updated the column sorting in the ‘browse place-names’ page so that pressing on a column heading sorts the complete dataset on this column rather than just the 50 rows that are displayed at any one time. Pressing on the column header once orders it ascending and a second time orders it descending.  This required a pretty major overhaul of the ‘browse’ page as sorting had to be done on the server side rather than the client side.  Still, it works a lot better now.
  4. I added a rudimentary search facility to the ‘browse place-names’ page, which replaces the ‘select parish’ facility. The search facility allows you to select a parish and/or a code and/or supply some text that may be found in the place-name field.  All three search options may be combined – e.g. list all place-names that include the text ‘point’ that are coastal in EYM.  The text search is currently pretty basic: it matches any part of the place-name text and no wildcards can be used.  E.g. a search for ‘rock’ finds ‘Brockholes’.  Hopefully this will suffice until we’re thinking about the public website.
  5. I tested adding IPA characters to the ‘pronunciation’ field and this appears to work fine (I’m sure I would have tested this out when I originally created the CMS anyway but just thought I’d check again).

In addition I also met separately with the project’s PhD student to go over the content management system with him.  That’s probably all I will need to do for the project until we come to develop the front end, which I’ll make a start on sometime next year.

For the SCOSYA project this week I finished work on a table in the CMS that shows consistent / conflicted data.  This can be displayed as a table in your browser or saved as a CSV to open in Excel.  The structure of the table is as Gary suggested to me last week:

One row per attribute (e.g. ‘A1’) and one column per location (e.g. Airdrie).  If all of the ratings for an attribute for a location are 4 or 5 then the cell contains ‘High’.  If all of the ratings are 1-2 then the cell contains ‘Low’.  If the ratings are something else then the cell contains ‘Mixed’.  Note that if the attribute was not recorded for a location the cell is left blank.  In the browser based table I’ve given the ‘Mixed’ cells a yellow border so you can more easily see where these appear.

I have also added in a row at the top of the table that contains the percentage of attributes for each location that are ‘Mixed’.  Note that this percentage does not take into consideration any attributes that are not recorded for a location.  E.g. if ‘Location A’ has ‘High’ for attribute A1, ‘Mixed’ for A2 and blank for A3 then the percentage mixed will be 50%.  I have also added in facilities to limit the data to the young or old age groups.  Towards the end of the week I met with Gary again and he suggested some further updates to the table, which I will hopefully implement next week.

I also met with Gary and Jennifer this week to discuss the tricky situation with grey squares vs grey circles on the map as discussed in last week’s post.  We decided to include grey circles (i.e. there is data but it doesn’t meet your criteria) for all locations where there is data for the specified attributes, so long as the attribute is not included with a ‘NOT’ joiner.  After the meeting I updated the map to include such grey circles and it appears to be working pretty well.  I also updated the map pop-ups to include more information about each rating, specifically the text for the attribute (as opposed to just the ID) and the age group for each rating.

The last big thing I did for the project was to add in a ‘save image’ facility to the atlas, which allows you to save an image of the map you’re viewing, complete with all markers.  This was a pretty tricky thing to implement as the image needs to be generated in the browser by pulling in and stitching together the visible map tiles, incorporating all of the vector based map marker data and then converting all of this into a raster image.  Thankfully I found a plugin that handled most of this (https://github.com/mapbox/leaflet-image), although it required some tweaking and customisation to get it working.  The PNG data is created as Base64 encoded text, which can then be appended to an image tag’s ‘src’ attribute.  What I really wanted was to have the image automatically work as a download rather than get displayed in the browser.  Unfortunately I didn’t manage to get this working.  I know it is possible if the Base64 data is posted to a server which then fires it back as a file for download (I did this with Mapping Metaphor) but for some reason the server was refusing to accept the data.  Also, I wanted something that worked on the client side rather than posting and then retrieving data to / from the server, which seems rather wasteful.  I managed to get the image to open in a new window, but this meant the full image data appeared in the browser’s address bar, which was horribly messy.  It also meant the user still had to manually select ‘save’.  So instead I decided to have the image open in page, in an overlay.  The user still has to manually save the image, but it looks neater and it allows the information about image attribution to be displayed too.  The only further issue was that this didn’t work if the atlas was being viewed in ‘full screen’ mode so I had to figure out a way of programmatically exiting out of full screen mode if the user pressed the ‘save image’ button when in this view.  Thankfully I found a handy function call that did just this: fullScreenApi.cancelFullScreen();

Fraser contacted me on Monday to say that Lancaster have finished tagging the EEBO dataset for the LinguisticDNA project and were looking to hand this over to us.  On Tuesday Lancaster placed the zipped data on a server and I managed to grab it, extracting the 11GB zip file into 25,368 XML files (although upon closer inspection the contents don’t appear to really be XML and are really just tab delimited text with a couple of wrapper tags).  I copied this to the J: drive for Fraser and Marc to look at.

I also had an email chat with Thomas Widmann at SLD about the font used for the DSL website.  Apparently this doesn’t include IPA characters which is causing the ‘pronunciation’ field to display inconsistently (i.e. the letters that the font does include are in the font and the ones it doesn’t are in the computer’s default sans serif font).  On My PC the different in character size is minimal but I think it looks worse on Thomas’s computer.  We discussed possible solutions, the easiest of which would be to simply ensure that the ‘pronunciation’ field is fully displayed in the default sans serif font.  He said he’d get back to me about this.  I also gave a little bit of WordPress help to Maria Economou in HATII, who had an issue with a drop-down menu not working in iOS.  We upgraded her theme to one that supported responsive menus and fixed that issue pretty quickly.

I also met with Fraser and Marc on Friday to discuss the new Historical Thesaurus data that we had received from the OED people.  We are going to want to incorporate words and dates from this data into our database, which is going to involve several potentially tricky stages.  The OED data is in XML and as I mentioned in a previous week there is no obvious ID that can be used to link their data to ours.  Thankfully during the SAMUELS project someone had figured out how to take data in one of our columns and rework it so it matches up with the OED category IDs.  My first step will be to extract the OED data from XML and convert it into a format similar to our structure and then create a script that will allow categories in the two datasets to be aligned.  After that we’ll need to compare the contents of the categories (i.e. the words) and work out which are new ones plus which dates don’t match up.  It’s going to be a fairly tricky process but it should be fun.  On Friday afternoon I decided to up the HT database to remove all of the unnecessary backup tables that I had created over the years.  I did a dump of the database before I did this in case I messed up.  It turns out I did mess up as I accidentally deleted the ‘lexeme_search_terms’ table, which broke the HT search facility.  I then discovered that my SQL dump was incomplete and had quit downloading mid-way through without telling me.  Thankfully Chris managed to get the database from a backup file and I’ve reinstated the required table, but it was a rather stressful way to end the week!