Week Beginning 19th December 2016

This is the last working week of 2016 and I worked the full five days.  The University was very quiet by Friday.  I spent most of the week working on the SCOSYA project, working through my list of outstanding items and meeting a few times with Gary, which resulted in more items being added to the list.  At the project meeting a couple of weeks ago Gary had pointed out that some places like Kirkcaldy were not appearing on the atlas and I managed to figure out why this was.  There are 6 questionnaires for Kirkcaldy in the system and the ‘Questionnaire Locations’ map splits the locations into different layers based on the number of questionnaires completed.  There are only four layers as each location was supposed to have 4 questionnaires.  If a location has more than this then there was no layer to add the location to, so it was getting missed off.  The same issue applied to ‘Fintry’, ‘Ayr’ and ‘Oxgangs’ too as they all have five questionnaires.  Once I identified the problem I updated the atlas so that locations with more than 4 locations do now appear in the location view.  These are marked with a black box so you can tell they might need fixing.  Thankfully the data for these locations was already being used as normal in the ‘attribute’ atlas searches.

With that out of the way I tackled a bigger item on my list:  Adding in facilities to allow staff to record information about the usage of codes in the questionnaire transcripts.  I created a spreadsheet consisting of all of the codes through which Gary can note whether a code is expected to be identifiable in the transcripts or not and I updated the database and the CMS to add in fields for recording this.  I then updated the ‘view questionnaire’ page in the CMS to add in facilities to add / view information about the use of the codes in the transcripts.

Codes that have ‘Y’ or ‘M’ for whether they appear in recordings are highlighted in the ‘view questionnaire’ page with a blue border and the ‘code ratings’ table now has four new columns for number of examples found in the transcript, Examples, whether this matches expectation and transcript notes (there is no data for these columns in the system yet, though).  You can add data to these columns by pressing on the ‘edit’ button at the top of the ‘view questionnaire’ page and then finding the highlighted code rows, which will be the only ones that have text boxes and things in the four new columns.  Add the required data and press the ‘update’ button and the information will be saved.

After implementing this new feature I set about fixing a bug in the Atlas that Gary had alerted me to.  If you perform a search for multiple attributes joined by ‘AND’ which results in no matching data being found, instead of displaying grey circles denoting data that didn’t match your criteria the Atlas was instead displaying nothing and giving a JavaScript error.  I figured out what was causing this and fixed it.  Basically when all data was below the specified threshold the API was outputting the data as a JavaScript object containing the data objects rather than as a JavaScript array containing the data objects and this change in expected format stopped the atlas script working.  A relatively quick fix, thankfully.

After that I started to work on a new ‘interviewed by’ limit for the Atlas that will allow a user to only show data where the interview was conducted by a fieldworker or not.  I didn’t get very far with this, however, as Gary instead wanted me to create a new feature that will help him and Jennifer analyse the data.  It’s a script that allows the interview data in the database to be exported in CSV format for further analysis in Excel.

It allows you to select an age group, select whether to include spurious data or not and limit the output to particular codes / code parents or view all.  Note that ‘view all’ also includes codes that don’t have parents assigned.

The resulting CSV file lists one column per interviewee, arranged alphabetically by location.  For each interviewee there are then rows for their age group and location.  If you’ve included spurious data a further row gives you a count of the number of spurious ratings for the interviewee.

After these rows there are rows for each code that you’ve asked to include.  Codes are listed with their parent and attributes to make it easier to tell what’s what.  With ‘all codes’ selected there are a lot of empty rows at the top as codes with no parent are listed first.  Note that if you want to exclude codes that don’t have parents in the code selection list simply deselect and reselect the checkbox for parent ‘AFTER’.  This means all parents are selected but the ‘All’ box is unselected.

For each code for each interviewee the rating is entered if there was one.  If you’ve selected to include spurious data these ratings are marked with an asterisk.  Where a code wasn’t present in the interview the cell is left blank.

Other than SCOSYA duties I did a bit more Historical Thesaurus work this week, creating the ‘Levensthein’ script that Fraser wanted, as discussed last week.  I started to implement a PHP version of the Levenshtein algorithm on that page I linked to in my previous post but thankfully my text editor highlighted the word ‘Levenshtein’, as it does with existing PHP functions it recognises.  Thank goodness it did as it turns out PHP has its own ready-to-use Levenshtein function!  See http://php.net/manual/en/function.levenshtein.php

All you have to do is pass it two strings and it spits out a number showing you how similar or different they are.  I therefore updated my script to incorporate this as an option.  You can specify a threshold level and also state whether you want to view those that are under and equal to the threshold or over the threshold.  Add the threshold by adding ‘lev=n’ to the URL (where n is the threshold).  By default it will display those categories that are over the threshold but to view those that are under or equal instead then add ‘under=y’ to the URL.

The test seems to work rather well when you set the threshold to 3 with punctuation removed and look for everything over that.  That gives just 3838 categories that are considered different, compared with the 5770 without the Levenshtein test.  Hopefully after Christmas Fraser will be able to put the script to good use.

I spent the remainder of the week continuing to migrate some of the old STELLA resources to the Univeristy’s T4 system.  I completed the migration of the ‘Bibliography of Scottish Literature’, which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/bibliography-of-scottish-literature/.  I then worked through the ‘Analytical index to the publications of the International Phonetic Association 1886-2006’, which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/ipa-index/.  I then began working through the STARN resource (see http://www.arts.gla.ac.uk/stella/STARN/index.html) and managed to complete work on the first section (Criticism & Commentary).  It’s going to take a long time to get the resource fully migrated over, though, as there’s a lot of content.  The migrated site won’t ‘go live’ until all of the content has been moved.

And that’s pretty much it for this week and this year!