This is the last working week of 2016 and I worked the full five days. The University was very quiet by Friday. I spent most of the week working on the SCOSYA project, working through my list of outstanding items and meeting a few times with Gary, which resulted in more items being added to the list. At the project meeting a couple of weeks ago Gary had pointed out that some places like Kirkcaldy were not appearing on the atlas and I managed to figure out why this was. There are 6 questionnaires for Kirkcaldy in the system and the ‘Questionnaire Locations’ map splits the locations into different layers based on the number of questionnaires completed. There are only four layers as each location was supposed to have 4 questionnaires. If a location has more than this then there was no layer to add the location to, so it was getting missed off. The same issue applied to ‘Fintry’, ‘Ayr’ and ‘Oxgangs’ too as they all have five questionnaires. Once I identified the problem I updated the atlas so that locations with more than 4 locations do now appear in the location view. These are marked with a black box so you can tell they might need fixing. Thankfully the data for these locations was already being used as normal in the ‘attribute’ atlas searches.
With that out of the way I tackled a bigger item on my list: Adding in facilities to allow staff to record information about the usage of codes in the questionnaire transcripts. I created a spreadsheet consisting of all of the codes through which Gary can note whether a code is expected to be identifiable in the transcripts or not and I updated the database and the CMS to add in fields for recording this. I then updated the ‘view questionnaire’ page in the CMS to add in facilities to add / view information about the use of the codes in the transcripts.
Codes that have ‘Y’ or ‘M’ for whether they appear in recordings are highlighted in the ‘view questionnaire’ page with a blue border and the ‘code ratings’ table now has four new columns for number of examples found in the transcript, Examples, whether this matches expectation and transcript notes (there is no data for these columns in the system yet, though). You can add data to these columns by pressing on the ‘edit’ button at the top of the ‘view questionnaire’ page and then finding the highlighted code rows, which will be the only ones that have text boxes and things in the four new columns. Add the required data and press the ‘update’ button and the information will be saved.
After that I started to work on a new ‘interviewed by’ limit for the Atlas that will allow a user to only show data where the interview was conducted by a fieldworker or not. I didn’t get very far with this, however, as Gary instead wanted me to create a new feature that will help him and Jennifer analyse the data. It’s a script that allows the interview data in the database to be exported in CSV format for further analysis in Excel.
It allows you to select an age group, select whether to include spurious data or not and limit the output to particular codes / code parents or view all. Note that ‘view all’ also includes codes that don’t have parents assigned.
The resulting CSV file lists one column per interviewee, arranged alphabetically by location. For each interviewee there are then rows for their age group and location. If you’ve included spurious data a further row gives you a count of the number of spurious ratings for the interviewee.
After these rows there are rows for each code that you’ve asked to include. Codes are listed with their parent and attributes to make it easier to tell what’s what. With ‘all codes’ selected there are a lot of empty rows at the top as codes with no parent are listed first. Note that if you want to exclude codes that don’t have parents in the code selection list simply deselect and reselect the checkbox for parent ‘AFTER’. This means all parents are selected but the ‘All’ box is unselected.
For each code for each interviewee the rating is entered if there was one. If you’ve selected to include spurious data these ratings are marked with an asterisk. Where a code wasn’t present in the interview the cell is left blank.
Other than SCOSYA duties I did a bit more Historical Thesaurus work this week, creating the ‘Levensthein’ script that Fraser wanted, as discussed last week. I started to implement a PHP version of the Levenshtein algorithm on that page I linked to in my previous post but thankfully my text editor highlighted the word ‘Levenshtein’, as it does with existing PHP functions it recognises. Thank goodness it did as it turns out PHP has its own ready-to-use Levenshtein function! See http://php.net/manual/en/function.levenshtein.php
All you have to do is pass it two strings and it spits out a number showing you how similar or different they are. I therefore updated my script to incorporate this as an option. You can specify a threshold level and also state whether you want to view those that are under and equal to the threshold or over the threshold. Add the threshold by adding ‘lev=n’ to the URL (where n is the threshold). By default it will display those categories that are over the threshold but to view those that are under or equal instead then add ‘under=y’ to the URL.
The test seems to work rather well when you set the threshold to 3 with punctuation removed and look for everything over that. That gives just 3838 categories that are considered different, compared with the 5770 without the Levenshtein test. Hopefully after Christmas Fraser will be able to put the script to good use.
I spent the remainder of the week continuing to migrate some of the old STELLA resources to the Univeristy’s T4 system. I completed the migration of the ‘Bibliography of Scottish Literature’, which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/bibliography-of-scottish-literature/. I then worked through the ‘Analytical index to the publications of the International Phonetic Association 1886-2006’, which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/ipa-index/. I then began working through the STARN resource (see http://www.arts.gla.ac.uk/stella/STARN/index.html) and managed to complete work on the first section (Criticism & Commentary). It’s going to take a long time to get the resource fully migrated over, though, as there’s a lot of content. The migrated site won’t ‘go live’ until all of the content has been moved.
And that’s pretty much it for this week and this year!
I spent a fair amount of this week overhauling some of the STELLA resources and migrating them to the University’s T4 system. This has been pretty tedious and time consuming, but it’s something that will only have to be done once and if I don’t do it no-one else will. I completed the migration of the pages about Jane Stuart-Smith’s ‘Accent Change in Glaswegian’ project (which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/accent-change-in-glaswegian/). I ran into some issues with linking to images in the T4 media library and had to ask the web team to manually approve some of the images. It would appear that linking to images before they have been approved by the system by guessing what their filename will be somehow causes the system to block the approval of the images, so I’ll need to make sure I’m not being too clever in future. I also worked my way through the old STELLA resource ‘A Bibliography of Scottish Literature’ but I haven’t quite finished this yet. I have one section left to do, so hopefully I’ll be able to make this ‘live’ before Christmas.
Other than the legacy STELLA work I spent some time on another AHRC review that I’d been given, made another few tweaks to Carolyn Jess-Cooke’s project website and had an email conversation with Alice Jenkins about a project she is putting together. I’m going to meet with her in the first week of January to discuss this further. I also had some App management duties to attend to, namely giving some staff in MVLS access to app analytics.
Other than these tasks, I spent some time working on the Historical Thesaurus, as Fraser and I are still trying to figure out the best strategy for incorporating the new data from the OED. I created a new script that attempts to work out which categories in the two datasets match up based on their names. First of all it picks out all of the categories that are nouns that match between HT and OED. ‘Match’ means the our ‘oedmaincat’ field (combined with ‘subcat’ where appropriate) matches the OED’s ‘path’ field (combined with ‘sub’ where appropriate). Our ‘oedmaincat’ field is the ‘v1maincat’ field that has had some additional reworking done to it based on the document of changes Fraser had previously sent to me.
These categories can be split into three groups:
- 1. Ones where the HT and OED headings are identical (case insensitive)
- 2. Ones where the HT and OED headings are not identical (case insensitive)
- 3. Ones where there is no matching OED category for the HT category (likely due to our ‘empty categories’)
For our current purposes we’re most interested in number 2 in this list. I therefore created a version of the script that only displayed these categories, outputting a table containing the columns Fraser had requested. I also put the category heading string that was actually searched for in brackets after the heading as it appears in the database.
At the bottom of the script I also outputted some statistics: How many noun categories there are in total (124355), how many there are that don’t match (21109) and how many HT noun categories don’t have a corresponding OED category (6334). I also created a version of the script that outputs all categories rather than just number 2 in the list above. And made a further version that strips out punctuation when comparing headings too. This converts dashes to spaces, removes commas, full-stops and apostrophes and replaces a slash with ‘ or ‘. This has rather a good effect on the categories that don’t match, reducing this down to 5770. At least some of these can be ‘fixed’ by further rules – e.g. a bunch starting at ID 40807 that have the format ‘native/inhabitant’ can be matched by ensuring ‘of’ is added after ‘inhabitant’.
Fraser wanted to run some sort of Levenshtein test on the remaining categories to see which ones are closely matched and which ones are clearly very different. I was looking at this page about Levenshtein tests: http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Fall2006/Assignments/editdistance/Levenshtein%20Distance.htm which includes a handy algorithm for testing the similarity or different of two strings. The algorithm isn’t available in PHP, but the Java version looks fairly straightforward to migrate to PHP. The algorithm discussed on this page allows you to compare two strings and to be given a number reflecting how similar or different the strings are, based on how many changes would be required to convert one string into another. E.g. a score or zero means the strings are identical. A score of 2 means two changes would be required to turn the first string into the second one (either changing a character or adding / subtracting a character).
I could incorporate the algorithm on this page into my script, running the 5770 heading pairings through it. We could then set a threshold where we consider the headings to be ‘the same’ or not. E.g. ID 224446 ‘score-book’ and ‘score book’ would give a score of 1 and could therefore be considered ‘the same’, while ID 145656 would give a very high score as the HT heading is ‘as a belief’ while the OED heading is ‘maintains what is disputed or denied’(!).
I met with Fraser on Wednesday and we agreed that I would update my script accordingly. I will allow the user (i.e. Fraser) to pass a threshold number to the script that will then display only those categories that are above or below this threshold (depending on what is selected). I’m going to try and complete this next week.
I spent the majority of this week working for the SCOSYA project, in advance of our all-day meeting on Friday. I met with Gary on Monday to discuss some additional changes he wanted made to the ‘consistency data’ view and other parts of the content management system. The biggest update was to add a new search facility to the ‘consistency data’ page that allows you to select whether data is ‘consistent’ or ‘mixed’ based on the distance between the ratings. Previously to work out ‘mixed’ scores you specified which scores were considered ‘low’ and which were considered ‘high’ and everything else was ‘mixed’, but this new way provides a more useful means of grouping the scores. E.g. you can specify that a ‘mixed’ score is anything where the ratings for a location are separated by 3 or more points. So ratings of 1 and 2 are consistent but ratings of 1 and 4 are mixed. In addition users can state whether a pairing of ‘2’ and ‘4’ is always considered ‘mixed’. This is because ‘2’ is generally always a ‘low’ score and ‘4’ is always a ‘high’ score, even though there are only two rating points between the scores.
I also updated the system to allow users to focus on locations and attributes where a specific rating has been given. Users can select a rating (e.g. 2) and the table of results only shows which attributes at each location have one or more rating of 2. The matching cells just say ‘present’ while other attributes at each location have blank cells in the table. Instead of %mixed, %high etc there is %present – the percentage of each location and attribute where this rating is found.
I also added in the option to view all of the ‘score groups’ for ratings – i.e. the percentage of each combination of scores for each attribute. E.g. 10% of the ratings for Attribute A are ‘1 and 2’, 50% are ‘4 and 5’.
With these changes in place I then updated the narrowing of a consistency data search to specific attributes. Previously the search facility allowed staff to select one or more ‘code parents’ to focus on rather than viewing the data for all attributes at once. I’ve now extended this so that users can open up each code parent and select / deselect the individual attributes contained within. This greatly extends the usefulness of the search tool. I also added in another limiting facility, this time allowing the user to select or deselect questionnaires. This can be used to focus on specific locations or to exclude certain questionnaires from a query if these are considered problematic questionnaires.
When I met with Gary on Monday he was keen to have access to the underlying SCOSYA database to maybe try running some queries directly on the SQL himself. We agreed that I would give him an SQL dump of the database and will help him get this set up on his laptop. I realised that we don’t have a document that describes the structure of the project database, which is not very good as without such a document it would be rather difficult for someone else to work with the system. I therefore spent a bit of time creating an entity-relationship diagram showing the structure of the database and writing a document that describes each table, the fields contained in them and the relationships between them. I feel much better knowing this document exists now.
On Friday was has a team meeting, involving the Co-Is for the project: David Adger and Caroline Heycock, in addition to Jennifer and Gary. I was a good meeting, and from a technical point of view it was particularly good to be able to demonstrate the atlas to David and Caroline and receive their feedback on it. For example, it wasn’t clear to either of them whether the ‘select rating’ buttons were selected or deselected, which led to confusing results (e.g. thinking 4-5 was selected but actually having 1-3 selected). This is something I will have to make a lot clearer. We also discussed alternative visualisation styles and the ‘pie chart’ map markers I mentioned in last week’s post. Jennifer thinks these will be just too cluttered on the map so we’re going to have to think of alternative ways of displaying the data – e.g. have a different icon for each combination of selected attribute, or have different layers that allow you to transition between different views of attributes so you can see what changes are introduced.
Other than SCOSYA related activities I completed a number of other tasks this week. I had an email chat with Carole about the Thesaurus of Old English teaching resource. I have now fixed the broken links in the existing version of the resource. However, it looks like there isn’t going to be an updated version any time soon as I pointed out that the resource would have to work with the new TOE website and not the old search options that appear in a frameset in the resource. As the new TOE functions quite differently from the old resource this would mean a complete rewrite of the exercises, which Carole understandably doesn’t want to do. Carole also mentioned that she and others find the new TOE website difficult to use, so we’ll have to see what we can do about that too.
I also spent a bit more time working through the STELLA resources. I spoke to Marc about the changes I’ve been making and we agreed that I should be added to the list of STELLA staff too. I’m going to be ‘STELLA Resources Director’ now, which sounds rather grand. I made a start on migrating the old ‘Bibliography of Scottish Literature’ website to T4 and also Jane’s ‘Accent change in Glaswegian’ resource too. I’ll try and get these completed next week.
I also completed work on the project website for Carolyn Jess-Cooke, and I’m very pleased with how this is looking now. It’s not live yet so I can’t link to it from here at the moment. I also spoke with Fraser about a further script he would like me to write to attempt to match up the historical thesaurus categories and the new data we received from the OED people. I’m going to try to create the script next week and we’re going to meet to discuss it.
I worked on rather a lot of different projects this week. I made some updates to the WordPress site I set up last week for Carolyn Jess-Cooke’s project, such as fixing an issue with the domain’s email forwarding. I replaced the website design of The People’s Voice project website with the new one I was working on last week, and this is now live: http://thepeoplesvoice.glasgow.ac.uk/. I think it looks a lot more visually appealing that the previous design, and I also added in a twitter feed to the right-hand column. I also had a phone conversation with Maria Dick about her research proposal and we have now agreed on the amount of technical effort she should budget for.
I received some further feedback about the Metre app this week from a colleague of Jean Anderson’s who very helpfully took the time to go through the resource. As a result of this feedback I made the following changes to the app:
- I’ve made the ‘Home’ button bigger
- I’ve fixed the erroneous syllable boundary in ‘ivy’
- When you’re viewing the first or last page in a section a ‘home’ button now appears where otherwise there would be a ‘next’ or ‘previous’ button
- I’ve removed the ‘info’ icon from the start of the text.
Jean also tried to find some introductory text about the app but was unable to do so. She’s asked if Marc can supply some, but I wold imagine he’s probably too busy to do so. I’ll have to chase this up or maybe write some text myself as it would be good to be able to get the app completed and published soon.
Also this week I had a phone conversation with Sarah Jones of the Digital Curation Centre about some help and documentation I’ve given her about AHRC Technical Plans for a workshop she’s running. I also helped out with two other non-SCS issues that cropped up. Firstly, the domain for TheGlasgowStory (http://theglasgowstory.com/), which was one of the first websites I worked on had expired and the website had therefore disappeared. As it’s been more than 10 years since the project ended no-one was keeping track of the domain subscription, but thankfully after some chasing about we’ve managed to get the domain ownership managed by IT Services and the renewal fee has now been paid. Secondly, IT Services were wanting to delete a database that belonged to Archive Services (who I used to work for) and I had to check on the status of this.
I also spent a little bit of time this week creating a few mock-up logos / banners for the Survey of Scottish Place-Names, which I’m probably going to be involved with in some capacity and I spoke to Carole about the redevelopment of the Thesaurus of Old English Teaching Package.
Also this week I finally got round to completing training in the University Web Site’s content management system, T4. After completing training I was given access to the STELLA pages within T4 and I’ve started to rework these. I went through the outdated list of links on the old STELLA site and have checked each one, updating URLS or removing links entirely where necessary. I’ve added in a few new ones too (e.g. to the BYU Corpus in the ‘Corpora’ section). This updated content now appears on the ‘English & Scots Links’ page in the University website (http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/englishscotslinks/) .
I also moved ‘Staff’ from its own page into a section on the STELLA T4 page to reduce left-hand navigation menu clutter. For the same reason I’ve removed the left-hand links to SCOTS, STARN, The Glasgow Review and the Bibliography of Scottish Literature, as these are all linked to elsewhere. I then renamed the ‘Teaching Packages’ page ‘Projects’ and have updated the content to provide direct links to the redeveloped resources first of all, and then links to all other ‘legacy’ resources, thus removing the need for the separate ‘STELLA digital resources’ page. See here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/. I updated the links to STELLA from my redeveloped resources to they go to this page now too. With all of this done I decided to migrate the old ‘The Glasgow Review’ collection of papers to T4. This was a long and tedious process, but it was satisfying to get it done. The resource can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/glasgowreview/
In addition to the above I also worked on the SCOSYA project, looking into alternative map marker possibilities, specifically how we can show information about multiple attributes through markers on the map. At our team meeting last week I mentioned the possibility of colour coding the selected attributes and representing them on the map using pie charts rather than circles for each map point and this week I found a library that will allow us to do such a thing, and also another form of marker called a ‘coxcomb chart’. I created a test version with both forms, that you can see below:
Note that the map is dark because that’s how the library’s default base map looks. Our map wouldn’t look like this. The library is pretty extensive and has other marker types available too, as you can see from this example page: http://humangeo.github.io/leaflet-dvf/examples/html/markers.html.
So in the above example, there are four attributes selected, and these are displayed for four locations. The coxcomb chart splits the circle into the number of attributes and then the depth of each segment reflects the average score for each attribute. E.g. looking at ‘Arbroath’ you can see at a glance that the ‘red’ attribute has a much higher average score than the ‘green’ attribute, while the marker for ‘Airdrie’ (to the east of Glasgow) has an empty segment where ‘pink’ should be, indicating that this attribute is not present at this location.
The two pie chart examples (Barrhead and Dumbarton) are each handled differently. For Barrhead the average score of each attribute is not taken into consideration at all. The ‘pie’ simply shows which attributes are present. All four are present so the ‘pie’ is split into quarters. If one attribute wasn’t found at this location then it would be omitted and the ‘pie’ would be split into thirds. For Dumbarton average scores are taken into consideration, which changes the size of each segment. You can see that the ‘pink’ attribute has a higher average than the ‘red’ one. However, I think this layout is rather confusing as at a glance it seems to suggest that there are more of the ‘pink’ attribute, rather than the average being higher. It’s probably best not to go with this one.
Both the pies and coxcombs are a fixed size no matter what the zoom level, so when you zoom far out they stay big rather than getting too small to make out. On one hand this is good as it addresses a concern Gary raised about not being able to make out the circles when zoomed out. However, when zoomed out the map is potentially going to get very cluttered, which will introduce new problems. Towards the end of the week I heard back from Gary and Jennifer, and they wanted to meet with me to discuss the possibilities before I proceeded any further with this. We have an all-day team meeting planned for next Friday, which seems like a good opportunity to discuss this.