Week Beginning 26th September 2016

I spent a lot of this week continuing to work on the Atlas and the API for the SCOSYA project, tackling a couple of particularly tricky feature additions, amongst other things.  The first tricky thing was adding in the facility to add all of the selected search options to the page URL so as to allow people to bookmark and share URLs. This feature should be pretty useful for people, allowing them to save and share specific views of the atlas.  People will also be able to cite exact views in papers and we’ll be able to add a ‘cite this page’ feature to the atlas too.  It will also form the basis for the ‘history’ feature I’m going to develop too, which will track all of the views a user has created during a particular session.  There were two things to consider when implementing this feature, firstly getting all of the search options added to the address bar, and secondly adding in the facilities to get the correct options selected and the correct map data loaded when someone loads a page containing all of the search options.  Both tasks were somewhat tricky.

I already use a Leaflet plugin call leaflet hash which adds the zoom level, latitude and longitude to the page URL as a hash (the bit after the ‘#’ in a URL).  This nice little plugin already ensures that a page loads the map at the correct location and zoom level if the variables are present in the URL.  I decided to extend this to add the search criteria as additional hash elements.  This meant I had to rework the plugin slightly as it was set to fail if more than the expected variables were passed in the hash.   But with this updated all I had to do was update my JavaScript that runs when a search is submitted and ensure that all of the submitted options are added to the hash.  I had to test this out a bit as I kept getting an unwanted slash added to the address bar sometimes, but eventually I sorted that.

So far so good, but I still had to process the address bar to extract the search criteria variables and then build the search facilities, including working out how many ‘attribute’ boxes to generate, which attributes should be selected, which limits should be highlighted and which joiners between variables needed to be selected.  This took some figuring out, but it was hugely satisfying to see it all coming together piece by piece.  With all of the search boxes pre-populated based on the information in the address bar the only thing left to do was automatically fire the ‘perform search’ code.  With that in place we now have a facility to store and share exact views of the atlas, which I’m rather pleased with.

I also decided to implement a ‘full screen’ view for the map, mainly because I’d tried it out on my mobile phone and the non-map parts of the page really cluttered things up and made the map pretty much impossible to use.  Thankfully there are already a few Leaflet plugins that provide ‘full screen’ functionality and I chose to use this one: https://github.com/brunob/leaflet.fullscreen.  It works very nicely – just like the ‘full screen’ option in YouTube – and with it added in the atlas becomes much more usable on mobile devices, and much prettier on any screen, actually.

The second big feature addition I focussed on was ensuring that ‘absent’ data points also appear on the map when a search is performed.  There are two different types of ‘absent’ data points – locations where no data exists for the chosen attributes (we decided these would be marked with grey squares) and locations where there is data for the chosen attributes but it doesn’t meet the threshold set in the search criteria (e.g. ratings of 4 or 5 only).  These were to be marked with grey circles.  Adding in the first type of ‘absent’ markers was fairly straightforward, but the second type raised some tricky questions.

For one attribute this is relatively straightforward – if there isn’t at least one rating for the attribute at the location with the supplied limits (age, number of people, rating) then see if there are any ratings without these limits applied.  If so then return these and display these locations with a grey circle.

But what happens if there are multiple attributes?  How should these be handled when different joiners are used between attributes?  If the search is ‘x AND y NOT z’ without any other limits should a location that has x and y and z be returned as a grey circle?  What about a location that has x but not y?  Or should both these locations just be returned as a grey square because there is no recorded data that matches the criteria?

Should locations with grey circles have to match the criteria (x AND y NOT z) but ignore other limits – e.g. for the query ‘x in old group rated by 2 people giving it 4-5 AND y in young group rated by 2 people giving it 1-2 NOT z in all ages rated by 1 or more giving it 1-5’.  A further query that ignores the limits will then run and any locations that appear in this that are not found in the full query will then be displayed as grey circles.  All other locations will be displayed as grey squares.

Deciding on a course of action for this required consultation with other team members, so Gary, Jennifer and I are going to meet next Monday.  I managed to get the grey circles working for a single attribute as described above.  It’s just the multiple attributes that are causing some uncertainty.

Other than SCOSYA work I did a few other things this week.  Windows 10 decided to upgrade to ‘anniversary edition’ without asking me on Tuesday morning when I turned on my PC, which meant I was locked out of my PC for 1.5 hours while the upgrade took place.  This was hugely frustrating as all my work was on the computer and there wasn’t much else I could do.  If only it had given me the option of installing the update when I next shut down the PC I wouldn’t have wasted 1.5 hours of work time.  Very annoying.

Anyway, I did some AHRC review work this week.  I also fixed a couple of bugs in the Mapping Metaphor website.  Some categories were appearing out of order when viewing data via the tabular view.  These categories were the ‘E’ ones – e.g. ‘1E15’.  It turns out that PHP was considering these category IDs to be numbers written using ‘E Notation’ (See https://en.wikipedia.org/wiki/Scientific_notation#E_notation).  Even when I explicitly cast the IDs as strings PHP still treated them as numbers, which was rather annoying.  I eventually solved the problem by adding a space character before the category ID for table ordering purposes.  Having this space made PHP treat the string as an actual String rather than a number.  I also took the opportunity to update the staff ‘browse categories’ pages of the main site and the OE site to ensure that the statistics displayed were the same as the ones on the main ‘browse’ pages – i.e. they include duplicate joins as discussed in a post from a few weeks ago.

I also continued my email conversation with Adrian Chapman about a project he is putting together and I spent about half a day working on the Historical Thesaurus again.  Over the summer the OED people sent us the latest version of their Thesaurus data, which includes a lot of updated information from the OED, such as more accurate dates for when words were first used.  Marc, Fraser and I had arranged to meet on Friday afternoon to think about how we could incorporate this data into the Glasgow Historical Thesaurus website.  Unfortunately Fraser was ill on Friday so the meeting had to be postponed, but I’d spent most of the morning looking at the OED people’s XML data plus a variety of emails and Word documents about the data and had figured out an initial plan for matching up their data with ours.  It’s complicated a bit because there is no ‘foreign key’ we can use to link their data to ours.  We have category and lexeme IDs and so do they but these do not correspond to each other.  They include our category notation information (e.g. 01.01) but we have reordered the HT several times since the OED people got the category notation so what they consider to be category ’01.01’ and what we do don’t match up.  We’re going to have to work some magic with some scripts in order to reassemble the old numbering.  This could quite easily turn into a rather tricky task.  Thankfully we should only have to do it once because after I’d done it this time I’m going to add new columns to our tables that contain the OED’s IDs so in future we can just use these.

Week Beginning 19th September 2016

I returned to work on Monday after being off sick on Friday last week.  I spent most of this week working for the SCOSYA project, further developing the atlas interface and the API.  On Monday I met with Jennifer Smith and Gary Thoms to discuss the interface and the outcomes from Gary’s meeting with the other project Co-I’s in York the week before.  We agreed that we would continue to use individual data points on the atlas for now and will investigate heatmaps at a later point.  Also that for now I will focus on the ‘expert interface’ and we will begin to look at the ‘public interface’ on the New Year.  One feature that Gary wanted to make use of fairly quickly was getting the API to spit out a CSV in addition to just JSON formatted data so he could use the data in Excel.  I managed to update the API to add in such a feature this week.  I wasn’t entirely sure how I would add the data selection type to the URL that gets sent to the API, but decided it made sense for this to be the first element that appears in the URL, before the ‘endpoint’.  I updated the API to include this extra information, added in the PHP code necessary to output a CSV file with column headings in the first row, and updated the atlas so that the data requests it sends to the API includes the data type.  I then updated the atlas interface to add a nice ‘download CSV’ button.  Now no matter what data the atlas is showing, a simple press of the ‘download’ button gives you the data in Excel, which I think is rather nice.

The biggest update I made as a result of the meeting was to completely overhaul the way limits on the data were handled.  Previously there were one set of limits for all attributes, even if multiple attributes were selected.  These limits let you set one age group and one rating level (see the screenshot in my 29th August post).  But the project team wanted limits to be applied per selected attribute, and for these to be a little more extensive.

So for each selected attribute the user needs to be able to select an age limit (all, 18-25 or 65+), select the minimum number of people at each location that rated the attribute (1-4 for ‘all’ age groups or ‘1-2’ for the others) and then select which rating levels they want to focus on (e.g. only ratings 4-5).  This took a fair amount of time to get working, both in terms of the user interface for the atlas and completely reworking the API to take all these new arguments.

After a lot of work I had an updated version of the user interface that worked as intended, and I hid the limit options away in a drop-down so you have to open them up if you’re really wanting to mess with the defaults.  Updates to the API were pretty far-reaching, but all appears to be working as intended, which is a relief.  You can see a screenshot of the interface with the new limits below:

scosya-leaflet-3

I also updated the ‘attribute’ table in the database and the content management system to incorporate a new ‘description’ field that will eventually be displayed somewhere in the atlas (although I’m not sure where as adding a ‘tooltip’ to a specific option in a select box seems to be somewhat tricky and the interface is already getting pretty cluttered.)

I also reworked the markers that I use in the atlas.  The reason for the change is I’m going to have ‘locations’ marked as squares and ‘attribute rating data’ marked as circles.  So on the ‘attribute’ map where there is no data for a particular attribute at a location this will still be marked on the map as a grey square.  However, I ran into a bit of an issue with using square markers on the atlas.  The circles I was using were a special sort of marker called a ‘circleMarker’, which have a fixed radius irrespective of zoom level.  Unfortunately, there is no such equivalent for other shapes.  To add a square marker to the map I need to set lat/lng boundaries for the square, meaning the size of the square changes when the zoom level changes (because the square refers to a particular area of the map).

This means when zoomed very far out the squares are tiny and when zoomed vary far in the squares are massive.  This made things rather confusing if you were zoomed in on the ‘location’ maps and then you switched to an ‘attribute’ map as the markers changed from huge squares to very small and exact circles.

What I’ve done to alleviate this is to revert to using ‘circle’ markers rather than ‘circleMarker’ markers.  I was using these in the first versions of the atlas and as with squares, their size is based on geographical data, meaning they change size based on the zoom level.  I’ve set the radius of the circle to be 2000 metres, which makes them legible when zoomed out and also makes them about the same size as the squares at all zoom levels.  I quite like that these markers give a less ‘exact’ position on the map when zoomed in, as we’re not supposed to be dealing with exact locations.  However, for some areas (e.g. the three Dundee markers) there is rather a lot of overlap and when zoomed out it is rather difficult to make out the different colours of the markers so we may have to revert to using circleMarkers instead.  If I do this and abandon squares we could maybe just have a hollow ring for a location with no data and a grey circle for places with data that doesn’t meet the criteria.

I also updated the API to include an index page that gets displayed if no endpoints are passed to it.  This lists the available endpoints and give some examples of their usage.  I think it works pretty well.  I still have a bunch of things on my ‘To Do’ list for the project that I’ll continue with next week.

Other than SCOSYA duties, I spent a few hours this week in an email conversation with Carolyn Jess-Cooke about a proposal she is putting together and then writing a first draft of a Technical Plan for her.  I’ve sent this off to her so I’ll just need to see if further tweaks are required.  I also spent some time reading through the documentation for Jane Stuart-Smith’s proposal and I participated in a meeting with Matt and Arts IT support about technical costs for the project.  We discussed servers, virtualisation and things like that.  I think we’ve got a clearer idea of what might be required now.

Week Beginning 12th September 2016

I didn’t have any pressing deadlines for any particular projects this week so I took the opportunity to return to some tasks that had been sitting on my ‘to do’ list for a while.  I made some further changes to the Edinburgh Gazetteer manuscript interface:  Previously the width of the interface had a maximum value applied to it, meaning that on widescreen monitors the area available to pan and zoom around the newspaper image was much less wide than the screen width and there was lots of empty, wasted white space on either side.  I’ve now changed this to remove the maximum width restriction, thus making the page much more usable.

I also continued to work with the Hansard data.  Although the data entry processes have now completed it is still terribly slow to query the data, due to both the size of the data and the fact that I haven’t added in any indexes yet.  I tried creating an index when I was working from home last week but the operation timed out before it completed.  This week I tried from my office and managed to get a few indexes created.  It took an awfully long time to generate each one, though – between 5 and 10 hours per index.  However, now that the indexes are in place a query that can utilise an index is now much speedier.  I created a little script on my test server that connects to the database and grabs the data for a specified year and then outputs this as a CSV file and the script only takes a couple of minutes to process.  I’m hoping I’ll be able to get a working version of the visualisation interface for the data up and running, although this will have to be a proof of concept as it will likely still take several minutes for the data to process and display until we can get a heftier database server.

I had a task to perform for the Burns people this week – launching a new section of the website, which can be found here: http://burnsc21.glasgow.ac.uk/performing-burnss-songs-in-his-own-day/.  This section includes performances of many songs, including both audio and video.  I also spent a fair amount of time this week giving advice to staff.  I helped Matt Barr out with a jQuery issue, I advised the MVLS people on some app development issues, I discussed a few server access issues with Chris McGlashan, I responded to an email from Adrian Chapman about a proposal he is hoping to put together, I gave some advice to fellow Arts developer Kirsty Bell who is having some issues with a website she is putting together, I spoke to Andrew Roach from History about web development effort and I spoke to Carolyn Jess-Cooke about a proposal she is putting together.  Wendy also contacted me about an issue with the Mapping Metaphor Staff pages, but thankfully this turned out to be a small matter that I will fix at a later date.  I also met separately with both Gary and Jennifer to discuss the Atlas interface for the SCOSYA project.

Also this week I returned to the ‘Basics of English Metre’ app that I started developing earlier in the year.  I hadn’t had time to work on this since early June so it took quite a bit of time to get back up to speed with things, especially as I’d left off in the middle of a particularly tricky four-stage exercise.  It took a little bit of time to think things through but I managed to get it all working and began dealing with the next exercise, which is unlike any previous exercise type I’ve dealt with as it requires an entire foot to be selected.  I didn’t have the time to complete this exercise so to remind myself for when I next get a chance to work on this:  Next I need to allow the user to click on a foot or feet to select it, which should highlight the foot.  Clicking a second time should deselect it.  Then I need to handle the checking of the answer and the ‘show answer’ option.

On Friday I was due to take part in a conference call about Jane’s big EPSRC proposal, but unfortunately my son was sick during Thursday night and then I caught whatever he had and had to be off work on Friday, both to look after my son and myself.  This was not ideal, but thankfully it only lasted a day and I am going to meet with Jane next week to discuss the technical issues of her project.

Week Beginning 5th September 2016

It was a four-day week for me this week as I’d taken Friday off.  I spent a fair amount of time this week continuing to work on the Atlas interface for the SCOSYA project, in preparation for Wednesday, when Gary was going to demo that Atlas to other project members at a meeting in York.  I spent most of Monday and Tuesday working on the facilities to display multiple attributes through the Atlas.  This has been quite a tricky task and has meant massively overhauling the API as well as the front end so as to allow for multiple attribute IDs and Boolean joining types to be processed.

In the ‘Attribute locations’ section of the ‘Atlas Display Options’ menu underneath the select box there is now an ‘Add another’ button.  Pressing on this slides down a new select box and also options for how the previous select box should be ‘joined’ with the new one (either ‘and’, ‘or’ or ‘not’).  Users can add as many attribute boxes as they want, and can also remove a box by pressing on the ‘Remove’ button underneath it.  This smoothly slides up the box and removes it from the page using the always excellent jQuery library.

The Boolean operators (‘and’, ‘or’ and ‘not) can be quite confusing to use in combination so we’ll have to make sure we explain how we are using them.  E.g. ‘A AND B OR C’ could mean ‘(A AND B) OR C’ or ‘A AND (B OR C)’.  These could give massively different results.  The way I’ve set things up is to go through the attributes and operators sequentially.  So for ‘A AND B OR C’ the API gets the dataset for A, checks this against the dataset for B and makes a new dataset containing only those locations that appear in both datasets.  It then adds all of dataset C to this.  So this is ‘(A AND B) or C’.  It is possible to do the ‘A AND (B OR C)’ search, you’d just have to rearrange the order so the select boxes are ‘B OR C AND A’.

Adding in ‘not’ works in the same sequential way, so if you do ‘A NOT B OR C’ this gets dataset A then removes from it those places found in dataset B, then adds all of the places found in dataset C.  I would hope people would always put a ‘not’ as the last part of their search, but as the above example shows, they don’t have to.  Multiple ‘nots’ are allowed too – e.g. ‘A NOT B NOT C’ will get the dataset for A, remove those places found in dataset B and then remove any further places found in dataset C.

Another thing to note is that the ‘limits’ are applied to the dataset for each attribute independently at the moment.  E.g. a search for ‘A AND B OR C’ with the limits set to ‘Present’ and age group ‘60+’ each dataset A,B and C will have these limits applied BEFORE the Boolean operators are processed.  So the ratings in dataset A will only contain those that are ‘Present’ and ‘60+’, these will then be reduced to only include those locations that are also in dataset B (which only includes ratings that are ‘Present’ and ‘60+’) and then all of the ratings for dataset C (Again which only includes those that are ‘Present’ and ‘60+’) will be added to this.

If the limits weren’t imposed until after the Boolean processes had been applied then the results could possibly be different – especially the ‘present’ / ‘absent’ limits as there would be more ratings for these to be applied to.

I met with Gary a couple of times to discuss the above as these were quite significant additions to the Atlas.  It will be good to hear the feedback he gets from the meeting this week and we can then refine the browse facilities accordingly.

I spent some further time this week on AHRC review duties and Scott Spurlock sent me a proposal document for me to review so I spent a bit of time doing so this week as well.  I also spent a bit of time on Mapping Metaphor as Wendy had uncovered a problem with the Old English data.  For some reason an empty category labelled ‘0’ was appearing on the Old English visualisations.  After a bit of investigation it turned out this had been caused by a category that had been removed from the system (B71) still being present in the last batch of OE data that I uploaded last week.  After a bit of discussion with Wendy and Carole I removed the connections that were linking to this non-existent category and all was fine again.

I met with Luca this week to discuss content management systems for transcription projects and I also had a chat this week with Gareth Roy about getting a copy of the Hansard frequencies database from him.  As I mentioned last week, the insertion of the data has now been completed and I wanted to grab a copy of the MySQL data tables so we don’t have to go through all of this again if anything should happen to the test server that Gareth very kindly set up for the database.  Gareth stopped the database and added all of the necessary files to a tar.gz file for me.  The file was 13Gb in size and I managed to quickly copy this across the University network.  I also began trying to add some new indexes to the data to speed up querying but so far I’ve not had much luck with this.  I tried adding an index to the data on my local PC but after several hours the process was still running and I needed to turn off my PC.  I also tried adding an index to the database on Gareth’s server whilst I was working from home on Thursday but after leaving it running for several hours the remote connection timed out and left me with a partial index.  I’m going to have to have another go at this next week.

Week Beginning 29th August 2016

It’s now been four years since I started this job, so that’s four years’ worth of these weekly posts that are up here now.  I have to say I’m still really enjoying the work I’m doing here.  It’s still really rewarding to be working on all of these different research projects.  Another milestone was reached this week too – the Hansard semantic category dataset that I’ve been running through the grid in batches over the past few months in order to insert it into a MySQL database has finally completed!   The database now has 682,327,045 rows in it, which is by some considerable margin the largest database I’ve ever worked with.  Unfortunately as it currently stands it’s not going to be possible to use the database as a data source for web-based visualisations as a simple ‘Select count(*)’ to return the number of rows took just over 35 minutes to execute!  I will see what can be done to speed things up over the next few weeks, though.  At the moment I believe the database is sitting on what used to be a desktop PC so it may be that moving it to a more meatier machine with lots of memory might speed things up considerably.  We’ll see how that goes.

I met with Scott Spurlock on Tuesday to discuss his potential Kirk Sessions crowdsourcing project.  It was good to catch up with Scott again and we’ve made the beginnings of a plan about how to proceed with a funding application, and also what software infrastructure we’re going to try.  We’re hoping to use the Scripto tool (http://scripto.org/), which in itself is built around MediaWiki, in combination with the Omeka content management system creator (https://omeka.org/), which is a tool I’ve been keen to try out for some time.  This is the approach that was used by the ‘Letters of 1916’ project (http://letters1916.maynoothuniversity.ie/), whose talk at DH2016 I found so useful.  We’ll see how the funding application goes and if we can proceed with this.

I also had my PDR session this week, which took up a fair amount of my time on Wednesday.  It was all very positive and it was a good opportunity to catch up with Marc (my line manager) as I don’t see him very often.  Also on Wednesday I had some communication with the Thomas Widmann of the SLD as the DSL website had gone offline.  Thankfully Arts IT Support got it back up and running again a matter of minutes after I alerted them.  Thomas also asked me about the datafiles for the Scots School Dictionary app, and I was happy to send these on to him.

I gave some advice to Graeme Cannon this week about a project he has been asked to provide technical input costings for, and I also spent some time on AHRC review duties.  Wendy also contacted me about updating the data for the main map and OE maps for Mapping Metaphor so I spent some time running through the data update processes.  For the main dataset the number of connections has gone down from 15301 to 13932 (due to some connections being reclassified as ‘noise’ or ‘relevant’ rather than ‘metaphor’ while the number of lexemes has gone up from 10715 to 13037.  For the OE data the number of metaphorical connections has gone down from 2662 to 2488 and the number of lexemes has gone up from 3031 to 4654.

The rest of my week was spent on the SCOSYA project, for which I continued to developer the prototype Atlas interface and the API.  By Tuesday I had finished an initial version of the ‘attribute’ map (i.e. it allows you to plot the ratings for a specific feature as noted in the questionnaires).  This version allowed users to select one attribute and to see the dots on a map of Scotland, with different colours representing the rating scores of 1-5 (an average is calculated by the system based on the number of ratings at a given location).  I met with Gary and he pointed out that the questionnaire data in the system currently only has latitude / longitude figures for each speaker’s current address, so we’ve got too many spots on the map.  These need to be grouped more broadly by town for the figures to really make sense.  Settlement names are contained in the questionnaire filenames and I figured out a way of automatically querying Google Maps for this settlement name (plus ‘Scotland’ to disambiguate places) in order to grab a more generic latitude / longitude value for the place – e.g. http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=oxgangs+scotland

There will be some situations where there is some ambiguity and multiple places are returned but I just grab the first and the locations can be ‘fine tuned’ by Gary via the CMS.  I updated the CMS to incorporate just such facilities, in fact.  And also updated the questionnaire upload scripts so that the Google Maps data is incorporated automatically from now on.  With this data in place I then updated the API so that it spits out the new data rather than the speaker-specific data, and updated the atlas interface to use the new values too.  The result was a much better map – less dots and better grouping.

I also updated the atlas interface so that it uses leaflet ‘circleMarkers’ rather than just ‘circles’, as this allows the markers to stay the same size at all map zoom levels where previously they looked tiny when zoomed out but then far too big when zoomed in.  I added a thin black stroke around the markers too, to make the lighter coloured circles stand out a bit more on the map.  Oh, I also changed the colour gradient to a more gradual ‘yellow to red’ approach, which works much better than the colours I was using before.  Another small tweak was to move the atlas’s zoom in and out buttons to the bottom right rather than the top left, as the ‘Atlas Display Options’ slide-out menu was obscuring these.  I never noticed as I never use these buttons as I just zoom in and out with the mouse scrollwheel, but Gary pointed out it was annoying to cover them up.  I also prevented the map from resetting its location and zoom level every time a new search was performed, which makes it easier to compare search results.  And I also prevented the scrollwheel from zooming in and out when the mouse is in the attribute drop-down list.  I haven’t figured out a way to make the scrollwheel actually scroll the drop-down list as it really ought to yet, though.

I made a few visual tweaks to the map pop-up boxes, such as linking to the actual questionnaires from the ‘location’ atlas view (this will be for staff only) and including the actual average rating in the attribute view pop-up so you don’t have to guess what it is from the marker colour.  Adding in links to the questionnaires involved reworking the API somewhat, but it’s worked out ok.

One of the biggest features I implemented this week was the ‘limit’ options.  These allow you to focus on just those locations where two or more speakers rated the attribute at 4 or 5 (currently called ‘Present’), or where all speakers at a location rated it at 1 or 2 (currently called ‘Absent’) or just those locations that don’t meet either of these criteria (currently called ‘Unclear’).  I also added in limits by age group too.  I implemented these queries via the API, which meant minimal changes to the actual JavaScript of the atlas were required (although it did mean quite a large amount of logic had to be added to the API!).

The prototype is working very nicely so far.  What I’m going to try to do next week is allow for multiple attributes to be selected, with Boolean operators between them.  This might be rather tricky, but we’ll see.  I’ll finish off with a screenshot of the ‘attribute’ search, so you can compare how it looks now to the screenshot I posted last week:

scosya-leaflet-2