Week Beginning 3rd April 2017

I split my time this week pretty evenly between two projects, the Historical Thesaurus visualisations for the Linguistic DNA project and the New Modernist Editing project.  For the Historical Thesaurus visualisations I continued with the new filter options I had started last week that would allow users to view sparklines for thematic categories that exhibited particular features, such as peaks and plateaus (as well as just viewing sparklines for all categories).  It’s been some pretty complicated work getting this operational as the filters often required a lot of calculations to be made on the fly in my PHP scripts, for example calculating standard deviations for categories within specified periods.

I corresponded with Fraser about various options throughout the week.  I have set up the filter page so that when a particular period is selected by means of select boxes a variety of minimum and maximum values throughout the page are update by means of an AJAX call.  For example if you select the period 1500-2000 then the minimum and maximum standard deviation values for this period are calculated and displayed, so you get an idea of what values you should be entering.  See the screenshot below of an idea of how this works:

Eventually I will replace these ‘min’ and ‘max’ values and a textbox for supplying a value with a much nicer slider widget, but that’s still to be done.  I also need to write a script that will cache the values generated by the AJAX call as currently the processing is far too slow.  I’ll write a script that will generate the values for every possible period selection and will store these in a database table, which should mean their retrieval will be pretty much instantaneous rather than taking several seconds as it currently does.

As you can see from the above screenshot, the user has to choose which type of filter they wish to apply by selecting a radio button.  This is to keep things simple, but we might change this in future.  During the week we decided to remove the ‘peak decade’ option as it seemed unnecessary to include this when the user could already select a period.

After removing the additional ‘peak decade’ selector I realised that it was actually needed.  The initial ‘select the period you’re interested in’ selector specifies the total range you’re interested in and therefore sets the start and end points of each sparkline’s x axis.  The ‘peak decade’ selector allows you to specify when in this period a peak must occur for the category to be returned.  So we need the two selectors to allow you to do something like “I want the sparklines to be from 1010s to 2000s and the peak decade to be between 1600 and 1700”.

Section 4 of the form has 5 possible options.  Currently only ‘All’, ‘peak’ and ‘plateau’ are operational and I’ll need to return to this after my Easter holiday.  ‘All’ brings back sparklines for your selected period and a selected average category size and / or minimum size of largest category.

‘Peaks’ allows you to specify a period (a subset of your initial period selection) within which a category must have its largest value for it to be returned.  You can also select a minimum percentage difference between largest and end.  The difference is a negative value and if you select -50 the filter will currently only then bring back categories where the percentage difference is between -50 and 0.  I was uncertain whether this was right and Fraser confirmed that it should instead be between -50 and -100 instead so that’s another thing I’ll need to fix.

You can also select a minimum standard deviation.  This is calculated based on your initial period selection.  E.g. if you say you’re interested in the period 1400-1600 then the standard deviation is calculated for each category based on the values for this period alone.

‘Plateaus’ is something that needs some further tweaking.  The option is currently greyed out until you select a date range that is from 1500 or later.  Currently you can specify a minimum mode, and the script works out the mode for each category for the selected period and if the mode is less than the supplied minimum mode the category is not returned.  I think that specifying the minimum number of times the mode occurs would be a better indicator and will need to implement this.

You can also specify the ‘minimum frequency of 5% either way from mode’.  For your specified period this currently works out 5% under the mode as:  the largest number of words in a decade multiplied by 0.05 and then subtract this from the mode.  5% over is the value largest number of words multiplied by 0.05 added onto the mode.  E.g. if the mode is 261 and the largest is 284 then 5% under is 246.8 and 5% over is 275.2.

For each category in your selected period the script counts the number of times the number of words in a decade falls within this range.  If the tally for a category is less than the supplied ‘minimum frequency of 5% either way from mode’ then the category is removed from the results.

I’ve updated the display of results to include some figures about each category as well as the sparkline and the ‘gloss’.  Information about the largest decade, the average decade, the mode, the standard deviation, the frequency of decades that are within 5% under / over the mode and what this 5% under / over range is are displayed when relevant to your search.

There are some issues with the above that I still need to address.  Sometimes the sparklines are not displaying the red dots representing the largest categories, and this is definitely a bug.  Another serious bug also exists in that when some combinations of options are selected PHP encounters an error and the script terminates.  This seems to be dependent on the data and as PHP errors are turned off on the server I can’t see what the problem is.  I’m guessing it’s a divide by zero error or something like that.  I hope to get to the bottom of this soon.

For the New Modernist Editing project I spent a lot of my time preparing materials for the upcoming workshop.  I’m going to be running a two-hour lab on transcription, TEI and XML for post-graduates and I also have a further half-hour session another day where I will be demonstrating the digital edition I’ve been working on.  It’s taken some time to prepare these materials but I feel I’ve made a good start on them now.

I also met with Bryony this week to discuss the upcoming workshop and also to discuss the digital edition as it currently stands.  It was a useful meeting and after it I made a few minor tweaks to the website and a couple of fixes to the XML transcription.  I still need to add in the edited version of the text but I’m afraid that is going to have to wait until after the workshop takes place.

I also spent a little bit of time on other projects, such as reading through the proposal documentation for Jane Stuart-Smith’s new project that I am going to be involved with, and publishing the final ‘song of the week’ for the Burns project (See http://burnsc21.glasgow.ac.uk/my-wifes-a-wanton-wee-thing/).  I also spoke to Alison Wiggins about some good news she had received regarding a funding application.  I can’t say much more about it just now, though.

Also this week I fixed an issue with the SCOSYA Atlas for Gary.  The Atlas suddenly stopped displaying any content, which was a bit strange as I hadn’t made any changes to it around the time it appears to have broken.  A bit of investigation uncovered the source of the problem.  Questionnaire participants are split into two age groups – ‘young’ and ‘old’.  This is based on the participants age.  However, two questionnaires had been uploaded for participants whose ages did not quite fit into our ‘young’ and ‘old’ categories and they were therefore being given a null age group.  The Atlas didn’t like this and stopped working when it encountered data for these participants.  I have now updated the script to ensure the participants are within one of our age groups and I’ve also updated things so that if any other people don’t fit in the whole thing doesn’t come crashing down.

I’m going to be on holiday all next week and will be back at work the Tuesday after that (as the Monday is Easter Monday).