This week was rather a hectic one as I was contacted by many people who wanted my help and advice with things. I think it’s the time of year – the lecturers are returning from their holidays but the students aren’t back yet so they start getting on with other things, meaning busy times for me. I had my PDR session on Monday morning, so I spent a fair amount of time at this and then writing things up afterwards. All went fine, and it’s good to know that the work I do is appreciated. After that I had to do a few things for Wendy for Mapping Metaphor. I’d forgotten to run my ‘remove duplicates’ script after I’d made the final update to the MM data, which meant that many of the sample lexemes were appearing twice. Thankfully Wendy spotted this and a quick execution of my script removed 14,286 duplicates in a flash. I also had to update some of the text on the site, update the way search terms are highlighted in the HT to avoid links through from MM highlighting multiple terms. I also wrote a little script that displays the number of strong and weak metaphorical connections there are for each of the categories, which Wendy wanted.
My big task for the week was to start on the redevelopment of the ARIES app. I had been expecting to receive the materials for this several weeks earlier as Marc wanted the new app to be ready to launch at the beginning of term. As I’d heard nothing I assumed that this was no longer going to happen, but on Monday Marc gave me access to the files and said the launch must still go ahead at the start of term. There is rather a lot to do and very little time to do it in, especially as preparing stuff for the App Store takes so much time once the app is actually developed. Also, Marc is still revising the materials so even though I’m now creating the new version I’m still going to have to go back and make further updates later on. It’s not exactly an ideal situation. However, I did manage to get started on the redevelopment on Tuesday, and spent pretty much all of my time on Tuesday, Wednesday and Thursday on this task. This involved designing a new interface based on the colours found in the logo file, creating the structure of the app, and migrating the static materials that the team had created in HTML to the JSON file I’m creating for the app contents. This included creating new styles for the new content where required and testing things out on various devices to make sure everything works ok. I also implemented two of the new quizzes, which also took quite a bit of time, firstly because I needed to manually migrate the quiz contents to a format that my scripts could work with and secondly because although the quizzes were similar to ones I’ve written before they were not identical in structure, so needed some reworking in order to meet the requirements. I’m pretty happy with how things are developing, but progress is slow. I’ve only completed the content for three subsections of the app, and there are a further nine sections remaining. Hopefully the pace will quicken as I proceed, but I’m worried that the app is not going to be ready for the start of term, especially as the quizzes should really be tested out by the team and possibly tweaked before launch.
I spent most of Friday this week writing the Technical Plan for Thomas Clancy’s new place-name project. Last week I’d sent off a long list of questions about the project and Thomas got back to me with some very helpful answers this week, which really helped in writing the plan. It’s still only a first version and will need further work, but I think the bulk of the technical issues have been addressed now.
Other than these tasks, I responded to a query from Moira Rankin from the Archives about an old project I was involved with, I helped Michael Shaw deal with some more data for The People’s Voice project, I had a chat to Catriona MacDonald about backing up The People’s Voice database, I looked through a database that Ronnie Young had sent me, which I will be turning into an online resource sometime soon (hopefully), I replied to Gerry McKeever about a project he’s running that’s just starting up which I will be involved with, and I replied to John Davies in History about a website query he had sent me. Unfortunately I didn’t get a chance to continue with the Edinburgh Gazetteer work I’d started last week, but I’ll hopefully get a chance to do some further work on this next week.
I was on holiday last week but was back to work on Monday this week. I’d kept tabs on my emails whilst I was away but as usual there were a number of issues that had cropped up in my absence that I needed to sort out. I spent some time on Monday going through emails and updating my ‘to do’ list and generally getting back up to speed again after a lazy week off.
I had rather a lot of meetings and other such things to prepare for and attend this week. On Monday I met with Bryony Randall for a final ‘sign off’ meeting for the New Modernist Editing project. I’ve really enjoyed working on this project, both the creation of the digital edition and taking part in the project workshop. We have now moved the digital edition of Virginia Woolf’s short story ‘Ode written partly in prose on seeing the name of Cutbush above a butcher’s shop in Pentonville’ to what will hopefully be its final and official URL and you can now access it here: http://nme-digital-ode.glasgow.ac.uk
On Tuesday I was on the interview panel for Jane Stuart-Smith’s SPADE project, which I’m also working on for a small percentage of my time. After the interviews I also had a further meeting with Jane to discuss some of the technical aspects of her project. On Wednesday I met with Alison Wiggins to discuss her ‘Archives and Writing Lives’ project, which is due to begin next month. This project will involve creating digital editions of several account books from the 16th century. When we were putting the bid together I did quite a bit of work creating a possible TEI schema for the account books and working out how best to represent all of the various data contained within the account entries. Although this approach would work perfectly well, now that Alison has started transcribing some entries herself we’ve realised that managing complex relational structures via taxonomies in TEI via the Oxygen editor is a bit of a cumbersome process. Instead Alison herself investigated using a relational database structure and had created her own Access database. We went through the structure when we met and everything seems to be pretty nicely organised. It should be possible to record all of the types of data and the relationships between these types using the Access database and so we’ve decided that Alison should just continue to use this for her project. I did suggest making a MySQL database and creating a PHP based content management system for the project, but as there’s only one member of staff doing the work and Alison is very happy using Access it seemed to make sense to just stick with this approach. Later on in the project I will then extract the data from Access, create a MySQL database out of it and develop a nice website for searching, browsing and visualising the data. I will also write a script to migrate the data to our original TEI XML structure as this might prove useful in other projects.
It’s Performance and Development Review time again, and I have my meeting with my line manager coming up, so I spent about a day this week reviewing last year’s objectives and writing all of the required sections for this year. Thankfully having my weekly blog posts makes it easier to figure out exactly what I’ve been up to in the review period.
Other than these tasks I helped Jane Roberts out with an issue with the Thesaurus of Old English, I fixed an issue with the STARN website that Jean Anderson had alerted me to, I had an email conversation with Rhona Brown about her Edinburgh Gazetteer project and I discussed data management issues with Stuart Gillespie. I also uploaded the final set of metaphor data to the Mapping Metaphor database. That’s all of the data processing for this project now completed, which is absolutely brilliant. All categories are now complete and the number of metaphors has gone down from 12938 to 11883, while the number of sample lexemes (including first lexemes) has gone up from 25129 to a whopping 45108.
Other than the above I attended the ‘Future proof IT’ event on Friday. This was an all-day event organised by the University’s IT services and included speakers from JISC, Microsoft, Cisco and various IT related people across the University. It was an interesting day with some excellent speakers, although the talks weren’t as relevant to my role as I’d hoped they would be. I did get to see Microsoft’s HoloLens technology in action, which was great, although I didn’t personally get a chance to try the headset on, which was a little disappointing.
This week was a shorter one than usual as Monday was the May Day holiday and I was off work on Wednesday afternoon to attend a funeral. I worked on a variety of different tasks during the time available. Wendy is continuing to work on the data for Mapping Metaphor and had another batch of it for me to process this week. After dealing with the upload we now have a further nine categories marked off and a total of 12,938 metaphorical connections and 25,129 sample lexemes. I also returned to looking at integrating the new OED data into the Historical Thesaurus. Fraser had enlisted the help of some students to manually check connections between HT and OED categories and I set up a script that will allow to mark off a few thousand more categories as ‘checked’. Before that Fraser needs to QA their selections and I wrote a further script that will help with this. Hopefully next week I’ll be able to actually mark off the selections.
I also returned to SCOSYA for the first time since before Easter. I managed to track down and fix a few bugs that Gary had identified. Firstly, Gary was running into difficulties when importing and displaying data using the ‘my map data’ feature. The imported data simply wouldn’t display at all in the Safari browser and after a bit of investigation I figured out why. It turned out there was a missing square bracket in my code, which rather strangely was being silently fixed in other browsers but was causing issues in Safari. Adding in the missing bracket fixed the issue straight away. The other issue Gary had encountered was when Gary did some work on the CSV file exported from the Atlas and then reimported it. When he did so the import failed to upload any ratings. It turned out that Excel had added in some extra columns to the CSV file whilst Gary was working with it and this change to the structure meant that each row failed the validation checks I had put in place. I decided to rectify this in two ways – firstly the upload would no longer check the number of columns and secondly I added more informative error messages. It’s all working a lot better now.
With these things out of the way I set to work on a larger update to the map. Previously an ‘AND’ search limited results by location rather than by participant. For example, if you did a search that said ‘show me attributes D19 AND D30, all age groups with a rating of 4-5’ a spot for a location would be returned if any combination of participant matched this. As there are up to four participants per location it could mean that a location could be returned as meeting the criteria even if a no individual participant actually met the criteria. For example, participants A and B give D19 a score of 5 but only give D30 a score of 3, while Participants C and D only give D19 a score of 3 and give D30 a score of 5. In combination, therefore, this location meets the criteria even though none of the participants actually do. Gary reckoned this wasn’t the best way to handle the search and I agreed. So, instead I updated the ‘AND’ search to check whether individuals met the criteria. This meant a fairly large reworking of the API and a fair amount of testing, but it looks like the ‘AND’ search now works at a participant level. And the ‘OR’ search doesn’t need to be updated because an ‘OR’ search by its very nature is looking for any combination.
I spent the remainder of the week on LDNA duties, continuing to work on the ‘sparkline’ visualisations for thematic heading categories. Most of the time was actually spent creating a new API for the Historical Thesaurus, which at this stage is used solely to output data for the visualisations. It took a fair amount of time to get the required endpoints working, and to create a nice index page that lists the endpoints with examples of how each can be used. It seems to be working pretty well now, though, including facilities to output the data in JSON or CSV format. The latter proved to be slightly tricky to implement due to the way that the data for each decade was formatted. I wanted each decade to appear in its own column, so as to roughly match the format of Marc’s original Excel spreadsheet, and this meant having to rework how the multi-level associative array was processed.
This week was a pretty busy one, working on a number of projects and participating in a number of meetings. I spent a bit of time working on Bryony Randall’s New Modernist Editing project. This involved starting to plan the workshop on TEI and XML – sorting out who might be participating, where the workshop might take place, what it might actually involve and things like that. We’re hoping it will be a hands-on session for postgrads with no previous technical experience of transcription, but we’ll need to see if we can get a lab booked that has Oxygen available first. I also worked with the facsimile images of the Woolf short story that we’re going to make a digital edition of. The Woolf estate wants a massive copyright statement to be plastered across the middle of every image, which is a little disappointing as it will definitely affect the usefulness of the images, but we can’t do anything about that. I also started to work with Bryony’s initial Word based transcription of the short story, thinking how best to represent this in TEI. It’s a good opportunity to build up my experience of Oxygen, TEI and XML.
I also updated the data for the Mapping Metaphor project, which Wendy has continued to work on over the past few months. We now have 13,083 metaphorical connections (down from 13931), 9,823 ‘first lexemes’ (up from 8,766) and 14,800 other lexemes (up from 13,035). We also now have 300 categories completed, up from 256. I also replaced the old ‘Thomas Crawford’ part of the Corpus of Modern Scottish Writing with my reworked version. The old version was a WordPress site that hadn’t been updated since 2010 and was a security risk. The new version (http://www.scottishcorpus.ac.uk/thomascrawford/) consists of nothing more than three very simple PHP pages and is much easier to navigate and use.
I had a few Burns related tasks to take care of this week. Firstly there was the usual ‘song of the week’ to upload, which I published on Wednesday as usual (see http://burnsc21.glasgow.ac.uk/ye-jacobites-by-name/). I also had a chat with Craig Lamont about a Burns bibliography that he is compiling. This is currently in a massive Word document but he wants to make it searchable online so we’re discussing the possibilities and also where the resource might be hosted. On Friday I had a meeting with Ronnie Young to discuss a database of Burns paper that he has compiled. The database currently exists as an Access database with a number of related images and he would like this to be published online as a searchable resource. Ronnie is going to check where the resource should reside and what level of access should be given and we’ll take things from there.
I had been speaking to the other developers across the College about the possibility of meeting up semi-regularly to discuss what we’re all up to and where things are headed and we arranged to have a meeting on Tuesday this week. It was a really useful meeting and we all got a chance to talk about our projects, the technologies we use, any cool developments or problems we’d encountered and future plans. Hopefully we’ll have these meetings every couple of months or so.
We had a bit of a situation with the Historical Thesaurus this week relating to someone running a script to grab every page of the website in order to extract the data from it, which is in clear violation of our terms and conditions. I can’t really go into any details here, but I had to spend some of the week identifying when and how this was done and speaking to Chris about ensuring that it can’t happen again.
The rest of my week was spent on the SCOSYA project. Last week I updated the ‘Atlas Display Options’ to include accordion sections for ‘advanced attribute search’ and ‘my map data’. I’m still waiting to hear back from Gary about how he would like to advanced search to work so instead I focussed on the ‘my map data’ section. This section will allow people to upload their own map data using the same CSV format as the atlas download files in order to visualise this data on the map. I managed to make some pretty good progress with this feature. First of all I needed to create new database tables to house the uploaded data. Then I needed to add in a facility to upload files. I decided to use the ‘dropzone.js’ scripts that I had previously used for uploading the questionnaires to the CMS. This allows the user to drag and drop one or more files into a section of the browser and for this data to then be processed in an AJAX kind of way. This approach works very well for the atlas as we don’t want the user to have to navigate away from the atlas in order to upload the data – all needs to be managed from within the ‘display options’ slideout section.
I contemplated adding the facility to process the uploaded files to the API but decided against it as I wanted to keep the API ‘read only’ rather than also handling data uploads and deletions. So instead I created a stand-along PHP script that takes the uploaded CSV files and adds them to the database tables I had created. This script then echoes out some log messages that then get pulled into a ‘log’ section of the display in an AJAX manner.
I then had to add in a facility to list previously uploaded files. I decided the query for this should be part of the API as it is a ‘GET’ request. However, I needed to ensure that only the currently logged in user was able to access their particular list of files. I didn’t want anyone to be able to pass a username to the API and then get that user’s files – the passed username must also correspond to the currently logged in user. I did some investigation about securing an API, using access tokens and things like that but in the end I decided that accessing the user’s data would only ever be something that we would want to offer through our website and we could therefore just use session authentication to ensure the correct user was logged in. This doesn’t really fit in with the ethos of a RESTful API, but it suits our purposes ok so it’s not really an issue.
With the API updated to be able to accept requests for listing a user’s data uploads I then created a facility in the front-end for listing these files, ensuring that the list automatically gets updated with each new file upload. You can see the work in progress ‘my map data’ section in the following screenshot.
I had a very relaxing holiday last week and returned to work on Monday. When I got back to work I spent a bit of time catching up with things, going through my emails, writing ‘to do’ lists and things like that and once that was out the way I settled back down into some actual work.
I started off with Mapping Metaphor. Wendy had noticed a bug with the drop-down ‘show descriptors’ buttons in the search results page, which I swiftly fixed. Carole had also completed work on all of the Old English metaphor data, so I uploaded this to the database. Unfortunately, this process didn’t go as smoothly as previous data uploads due to some earlier troubles with this final set of data (this data was the dreaded ‘H27’ data, which originally was one category but which was split into several smaller categories, which caused problems for Flora’s Access database that the researchers were using).
Rather that updating rows the data upload added new ones, and this was because the ordering of cat1 and cat2 appeared to have been reversed since stage 4 of the data processing. For example, in the database cat1 is ‘1A16’ and cat2 is ‘2C01’ but in the spreadsheet these are the other way round. Thankfully this was consistently the case, so once identified it was easy to rectify the problem. For Old English we now have a complete set of metaphorical connections, consisting of 2488 and 4985 example words. I also fixed a slight bug in the category ordering for OE categories and replied to Wendy about a query she had received regarding access to the underlying metaphor data. After that I updated a few FAQs and all was complete.
Also this week I undertook some more AHRC work, which took up the best part of a day, and I replied to a request from Gavin Miller about a Medical Humanities Network mailing list. We’ve agreed a strategy to implement such a thing, which I hope to undertake next week. I also chatted to Chris about migrating the Scots Corpus website to a new server. The fact that the underlying database is PostGreSQL rather than MySQL is causing a few issues here, but we’ve come up with a solution to this.
I spent a couple of days this week working on the SCOSYA project, continuing with the updates to the ‘consistency data’ views that I had begun before I went away. I added an option to the page that allows staff to select which ‘code parents’ they want to include in the output. This defaults to ‘all’ but you can narrow this down to any of them as required. You can also select or deselect ‘all’ which ticks / unticks all the boxes. The ‘in-browser table’ view now colour codes the codes based on their parent, with the colour assigned to a parent listed across the top of the page. The colours are randomly assigned each time the page loads so if two colours are too similar a user can reload the page and different ones will take their place.
Colour coding is not possible in the CSV view as CSV files are plain text and can’t have any formatting. However, I have added the colour coding to the chart view, which colours both the bars and the code text based on each code’s parent. I’ve also added in a little feature that allows staff to save the charts.
I then added in the last remaining required feature to the consistency data page, namely making the figures for ‘percentage high’ and ‘percentage low’ available in addition to ‘percentage mixed’. In the ‘in-browser’ and CSV table views these appear as new rows and columns alongside ‘% mixed’, giving you the figures for each location and for each code. In the Chart view I’ve updated the layout to make a ‘stacked percentage’ bar chart. Each bar is the same height (100) but is split into differently coloured sections to reflect the parts that are high, low and mixed. I’ve made ‘mixed’ appear at the bottom rather than between high and low as mixed is most important and it’s easier to track whatever is at the bottom. This change in chart style does mean that the bars are no longer colour coded to match the parent code (as three colours are now needed per bar), but the x-axis label still has the parent code colour so you can still see which code belongs to each parent.
I spent most of the rest of the week working with the new OED data for the Historical Thesaurus. I had previously made a page that lists all of the HT categories and notes which OED categories match up, or if there are no matching OED categories. Fraser had suggested that it would be good to be able to approach this from the other side – starting with OED categories and finding which ones have a matching HT category and which ones don’t. I created such a script, and I also update both this and the other script so that the output would either display all of the categories or just those that don’t have matches (as these are the ones we need to focus on).
I then focussed on creating a script that matches up HT and OED words for each category where the HT and OED categories match up. What the script does is as follows:
- Finds each HT category that has a matching OED category
- Retrieves the lists of HT and OED words in each
- For each HT word displays it and the HT ‘fulldate’ field
- For each HT word it then checks to see if an OED word matches. This checks the HT’s ‘wordoed’ column against the OED’s ‘ght_lemma’ column and also the OED’s ‘lemma’ column (as I noticed sometimes the ‘ght_lemma’ column is blank but the ‘lemma’ column matches)
- If an OED word matches the script displays it and its dates (OED ‘sortdate’ (=start) and ‘enddate’)
- If there are any additional OED words in the category that haven’t been matched to an HT word these are then displayed
Note that this script has to process every word in the HT thesaurus and every word in the OED thesaurus data so it’s rather a lot of data. I tried running it on the full dataset but this resulted in Firefox crashing. And Chrome too. For this reason I’ve added a limit on the number of categories that are processed. By default the script starts at 0 and processes 10,000 categories. ‘Data processing complete’ appears at the bottom of the output so you can tell it’s finished, as sometimes a browser will just silently stop processing. You can look at a different section of the data by passing parameters to it – ‘start’ (the row to start at) and ‘rows’ (the number of rows to process). I’ve tried it with 50,000 categories at it worked for me, but any more than that may result in a crashed browser. I think the output is pretty encouraging. The majority of OED words appear to match up, and for the OED words that don’t I could create a script that lists these and we could manually decide what to do with them – or we could just automatically add them, but there are definitely some in there that should match – such as HT ‘Roche(‘s) limit’ and OED ‘Roche limit’. After that I guess we just need to figure out how we handle the OED dates. Fraser, Marc and I are meeting next week to discuss how to take this further.
I was left with a bit of time on Friday afternoon which I spent attempting to get the ‘essentials of Old English’ app updated. A few weeks ago it was brought to my attention that some of the ‘C’ entries in the glossary were missing their initial letters. I’ve fixed this issue in the source code (it took a matter of minutes) but updating the app even for such a small fix takes rather a lot longer than this. First of all I had to wait until gigabytes of updates had been installed for MacOS and XCode, as I hadn’t used either for a while. After that I had to update Cordova, and then I had to update the Android developer tools. Cordova kept failing to build my updated app because it said I hadn’t accepted the license agreement for Android, even though I had! This was hugely frustrating, and eventually I figured out the problem was I had the Android tools installed in two different locations. I’d updated Android (and accepted the license agreements) in one place, but Cordova uses the tools installed in the other location. After realising this I made the necessary updates and finally my project built successfully. Unfortunately about three hours of my Friday afternoon had by that point been used up and it was time to leave. I’ll try to get the app updated next week, but I know there are more tedious hoops to jump through before this tiny fix is reflected in the app stores.
I spent a lot of this week continuing to work on the Atlas and the API for the SCOSYA project, tackling a couple of particularly tricky feature additions, amongst other things. The first tricky thing was adding in the facility to add all of the selected search options to the page URL so as to allow people to bookmark and share URLs. This feature should be pretty useful for people, allowing them to save and share specific views of the atlas. People will also be able to cite exact views in papers and we’ll be able to add a ‘cite this page’ feature to the atlas too. It will also form the basis for the ‘history’ feature I’m going to develop too, which will track all of the views a user has created during a particular session. There were two things to consider when implementing this feature, firstly getting all of the search options added to the address bar, and secondly adding in the facilities to get the correct options selected and the correct map data loaded when someone loads a page containing all of the search options. Both tasks were somewhat tricky.
So far so good, but I still had to process the address bar to extract the search criteria variables and then build the search facilities, including working out how many ‘attribute’ boxes to generate, which attributes should be selected, which limits should be highlighted and which joiners between variables needed to be selected. This took some figuring out, but it was hugely satisfying to see it all coming together piece by piece. With all of the search boxes pre-populated based on the information in the address bar the only thing left to do was automatically fire the ‘perform search’ code. With that in place we now have a facility to store and share exact views of the atlas, which I’m rather pleased with.
I also decided to implement a ‘full screen’ view for the map, mainly because I’d tried it out on my mobile phone and the non-map parts of the page really cluttered things up and made the map pretty much impossible to use. Thankfully there are already a few Leaflet plugins that provide ‘full screen’ functionality and I chose to use this one: https://github.com/brunob/leaflet.fullscreen. It works very nicely – just like the ‘full screen’ option in YouTube – and with it added in the atlas becomes much more usable on mobile devices, and much prettier on any screen, actually.
The second big feature addition I focussed on was ensuring that ‘absent’ data points also appear on the map when a search is performed. There are two different types of ‘absent’ data points – locations where no data exists for the chosen attributes (we decided these would be marked with grey squares) and locations where there is data for the chosen attributes but it doesn’t meet the threshold set in the search criteria (e.g. ratings of 4 or 5 only). These were to be marked with grey circles. Adding in the first type of ‘absent’ markers was fairly straightforward, but the second type raised some tricky questions.
For one attribute this is relatively straightforward – if there isn’t at least one rating for the attribute at the location with the supplied limits (age, number of people, rating) then see if there are any ratings without these limits applied. If so then return these and display these locations with a grey circle.
But what happens if there are multiple attributes? How should these be handled when different joiners are used between attributes? If the search is ‘x AND y NOT z’ without any other limits should a location that has x and y and z be returned as a grey circle? What about a location that has x but not y? Or should both these locations just be returned as a grey square because there is no recorded data that matches the criteria?
Should locations with grey circles have to match the criteria (x AND y NOT z) but ignore other limits – e.g. for the query ‘x in old group rated by 2 people giving it 4-5 AND y in young group rated by 2 people giving it 1-2 NOT z in all ages rated by 1 or more giving it 1-5’. A further query that ignores the limits will then run and any locations that appear in this that are not found in the full query will then be displayed as grey circles. All other locations will be displayed as grey squares.
Deciding on a course of action for this required consultation with other team members, so Gary, Jennifer and I are going to meet next Monday. I managed to get the grey circles working for a single attribute as described above. It’s just the multiple attributes that are causing some uncertainty.
Other than SCOSYA work I did a few other things this week. Windows 10 decided to upgrade to ‘anniversary edition’ without asking me on Tuesday morning when I turned on my PC, which meant I was locked out of my PC for 1.5 hours while the upgrade took place. This was hugely frustrating as all my work was on the computer and there wasn’t much else I could do. If only it had given me the option of installing the update when I next shut down the PC I wouldn’t have wasted 1.5 hours of work time. Very annoying.
Anyway, I did some AHRC review work this week. I also fixed a couple of bugs in the Mapping Metaphor website. Some categories were appearing out of order when viewing data via the tabular view. These categories were the ‘E’ ones – e.g. ‘1E15’. It turns out that PHP was considering these category IDs to be numbers written using ‘E Notation’ (See https://en.wikipedia.org/wiki/Scientific_notation#E_notation). Even when I explicitly cast the IDs as strings PHP still treated them as numbers, which was rather annoying. I eventually solved the problem by adding a space character before the category ID for table ordering purposes. Having this space made PHP treat the string as an actual String rather than a number. I also took the opportunity to update the staff ‘browse categories’ pages of the main site and the OE site to ensure that the statistics displayed were the same as the ones on the main ‘browse’ pages – i.e. they include duplicate joins as discussed in a post from a few weeks ago.
I also continued my email conversation with Adrian Chapman about a project he is putting together and I spent about half a day working on the Historical Thesaurus again. Over the summer the OED people sent us the latest version of their Thesaurus data, which includes a lot of updated information from the OED, such as more accurate dates for when words were first used. Marc, Fraser and I had arranged to meet on Friday afternoon to think about how we could incorporate this data into the Glasgow Historical Thesaurus website. Unfortunately Fraser was ill on Friday so the meeting had to be postponed, but I’d spent most of the morning looking at the OED people’s XML data plus a variety of emails and Word documents about the data and had figured out an initial plan for matching up their data with ours. It’s complicated a bit because there is no ‘foreign key’ we can use to link their data to ours. We have category and lexeme IDs and so do they but these do not correspond to each other. They include our category notation information (e.g. 01.01) but we have reordered the HT several times since the OED people got the category notation so what they consider to be category ’01.01’ and what we do don’t match up. We’re going to have to work some magic with some scripts in order to reassemble the old numbering. This could quite easily turn into a rather tricky task. Thankfully we should only have to do it once because after I’d done it this time I’m going to add new columns to our tables that contain the OED’s IDs so in future we can just use these.
It was a four-day week for me this week as I’d taken Friday off. I spent a fair amount of time this week continuing to work on the Atlas interface for the SCOSYA project, in preparation for Wednesday, when Gary was going to demo that Atlas to other project members at a meeting in York. I spent most of Monday and Tuesday working on the facilities to display multiple attributes through the Atlas. This has been quite a tricky task and has meant massively overhauling the API as well as the front end so as to allow for multiple attribute IDs and Boolean joining types to be processed.
In the ‘Attribute locations’ section of the ‘Atlas Display Options’ menu underneath the select box there is now an ‘Add another’ button. Pressing on this slides down a new select box and also options for how the previous select box should be ‘joined’ with the new one (either ‘and’, ‘or’ or ‘not’). Users can add as many attribute boxes as they want, and can also remove a box by pressing on the ‘Remove’ button underneath it. This smoothly slides up the box and removes it from the page using the always excellent jQuery library.
The Boolean operators (‘and’, ‘or’ and ‘not) can be quite confusing to use in combination so we’ll have to make sure we explain how we are using them. E.g. ‘A AND B OR C’ could mean ‘(A AND B) OR C’ or ‘A AND (B OR C)’. These could give massively different results. The way I’ve set things up is to go through the attributes and operators sequentially. So for ‘A AND B OR C’ the API gets the dataset for A, checks this against the dataset for B and makes a new dataset containing only those locations that appear in both datasets. It then adds all of dataset C to this. So this is ‘(A AND B) or C’. It is possible to do the ‘A AND (B OR C)’ search, you’d just have to rearrange the order so the select boxes are ‘B OR C AND A’.
Adding in ‘not’ works in the same sequential way, so if you do ‘A NOT B OR C’ this gets dataset A then removes from it those places found in dataset B, then adds all of the places found in dataset C. I would hope people would always put a ‘not’ as the last part of their search, but as the above example shows, they don’t have to. Multiple ‘nots’ are allowed too – e.g. ‘A NOT B NOT C’ will get the dataset for A, remove those places found in dataset B and then remove any further places found in dataset C.
Another thing to note is that the ‘limits’ are applied to the dataset for each attribute independently at the moment. E.g. a search for ‘A AND B OR C’ with the limits set to ‘Present’ and age group ‘60+’ each dataset A,B and C will have these limits applied BEFORE the Boolean operators are processed. So the ratings in dataset A will only contain those that are ‘Present’ and ‘60+’, these will then be reduced to only include those locations that are also in dataset B (which only includes ratings that are ‘Present’ and ‘60+’) and then all of the ratings for dataset C (Again which only includes those that are ‘Present’ and ‘60+’) will be added to this.
If the limits weren’t imposed until after the Boolean processes had been applied then the results could possibly be different – especially the ‘present’ / ‘absent’ limits as there would be more ratings for these to be applied to.
I met with Gary a couple of times to discuss the above as these were quite significant additions to the Atlas. It will be good to hear the feedback he gets from the meeting this week and we can then refine the browse facilities accordingly.
I spent some further time this week on AHRC review duties and Scott Spurlock sent me a proposal document for me to review so I spent a bit of time doing so this week as well. I also spent a bit of time on Mapping Metaphor as Wendy had uncovered a problem with the Old English data. For some reason an empty category labelled ‘0’ was appearing on the Old English visualisations. After a bit of investigation it turned out this had been caused by a category that had been removed from the system (B71) still being present in the last batch of OE data that I uploaded last week. After a bit of discussion with Wendy and Carole I removed the connections that were linking to this non-existent category and all was fine again.
I met with Luca this week to discuss content management systems for transcription projects and I also had a chat this week with Gareth Roy about getting a copy of the Hansard frequencies database from him. As I mentioned last week, the insertion of the data has now been completed and I wanted to grab a copy of the MySQL data tables so we don’t have to go through all of this again if anything should happen to the test server that Gareth very kindly set up for the database. Gareth stopped the database and added all of the necessary files to a tar.gz file for me. The file was 13Gb in size and I managed to quickly copy this across the University network. I also began trying to add some new indexes to the data to speed up querying but so far I’ve not had much luck with this. I tried adding an index to the data on my local PC but after several hours the process was still running and I needed to turn off my PC. I also tried adding an index to the database on Gareth’s server whilst I was working from home on Thursday but after leaving it running for several hours the remote connection timed out and left me with a partial index. I’m going to have to have another go at this next week.
It’s now been four years since I started this job, so that’s four years’ worth of these weekly posts that are up here now. I have to say I’m still really enjoying the work I’m doing here. It’s still really rewarding to be working on all of these different research projects. Another milestone was reached this week too – the Hansard semantic category dataset that I’ve been running through the grid in batches over the past few months in order to insert it into a MySQL database has finally completed! The database now has 682,327,045 rows in it, which is by some considerable margin the largest database I’ve ever worked with. Unfortunately as it currently stands it’s not going to be possible to use the database as a data source for web-based visualisations as a simple ‘Select count(*)’ to return the number of rows took just over 35 minutes to execute! I will see what can be done to speed things up over the next few weeks, though. At the moment I believe the database is sitting on what used to be a desktop PC so it may be that moving it to a more meatier machine with lots of memory might speed things up considerably. We’ll see how that goes.
I met with Scott Spurlock on Tuesday to discuss his potential Kirk Sessions crowdsourcing project. It was good to catch up with Scott again and we’ve made the beginnings of a plan about how to proceed with a funding application, and also what software infrastructure we’re going to try. We’re hoping to use the Scripto tool (http://scripto.org/), which in itself is built around MediaWiki, in combination with the Omeka content management system creator (https://omeka.org/), which is a tool I’ve been keen to try out for some time. This is the approach that was used by the ‘Letters of 1916’ project (http://letters1916.maynoothuniversity.ie/), whose talk at DH2016 I found so useful. We’ll see how the funding application goes and if we can proceed with this.
I also had my PDR session this week, which took up a fair amount of my time on Wednesday. It was all very positive and it was a good opportunity to catch up with Marc (my line manager) as I don’t see him very often. Also on Wednesday I had some communication with the Thomas Widmann of the SLD as the DSL website had gone offline. Thankfully Arts IT Support got it back up and running again a matter of minutes after I alerted them. Thomas also asked me about the datafiles for the Scots School Dictionary app, and I was happy to send these on to him.
I gave some advice to Graeme Cannon this week about a project he has been asked to provide technical input costings for, and I also spent some time on AHRC review duties. Wendy also contacted me about updating the data for the main map and OE maps for Mapping Metaphor so I spent some time running through the data update processes. For the main dataset the number of connections has gone down from 15301 to 13932 (due to some connections being reclassified as ‘noise’ or ‘relevant’ rather than ‘metaphor’ while the number of lexemes has gone up from 10715 to 13037. For the OE data the number of metaphorical connections has gone down from 2662 to 2488 and the number of lexemes has gone up from 3031 to 4654.
The rest of my week was spent on the SCOSYA project, for which I continued to developer the prototype Atlas interface and the API. By Tuesday I had finished an initial version of the ‘attribute’ map (i.e. it allows you to plot the ratings for a specific feature as noted in the questionnaires). This version allowed users to select one attribute and to see the dots on a map of Scotland, with different colours representing the rating scores of 1-5 (an average is calculated by the system based on the number of ratings at a given location). I met with Gary and he pointed out that the questionnaire data in the system currently only has latitude / longitude figures for each speaker’s current address, so we’ve got too many spots on the map. These need to be grouped more broadly by town for the figures to really make sense. Settlement names are contained in the questionnaire filenames and I figured out a way of automatically querying Google Maps for this settlement name (plus ‘Scotland’ to disambiguate places) in order to grab a more generic latitude / longitude value for the place – e.g. http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=oxgangs+scotland
There will be some situations where there is some ambiguity and multiple places are returned but I just grab the first and the locations can be ‘fine tuned’ by Gary via the CMS. I updated the CMS to incorporate just such facilities, in fact. And also updated the questionnaire upload scripts so that the Google Maps data is incorporated automatically from now on. With this data in place I then updated the API so that it spits out the new data rather than the speaker-specific data, and updated the atlas interface to use the new values too. The result was a much better map – less dots and better grouping.
I also updated the atlas interface so that it uses leaflet ‘circleMarkers’ rather than just ‘circles’, as this allows the markers to stay the same size at all map zoom levels where previously they looked tiny when zoomed out but then far too big when zoomed in. I added a thin black stroke around the markers too, to make the lighter coloured circles stand out a bit more on the map. Oh, I also changed the colour gradient to a more gradual ‘yellow to red’ approach, which works much better than the colours I was using before. Another small tweak was to move the atlas’s zoom in and out buttons to the bottom right rather than the top left, as the ‘Atlas Display Options’ slide-out menu was obscuring these. I never noticed as I never use these buttons as I just zoom in and out with the mouse scrollwheel, but Gary pointed out it was annoying to cover them up. I also prevented the map from resetting its location and zoom level every time a new search was performed, which makes it easier to compare search results. And I also prevented the scrollwheel from zooming in and out when the mouse is in the attribute drop-down list. I haven’t figured out a way to make the scrollwheel actually scroll the drop-down list as it really ought to yet, though.
I made a few visual tweaks to the map pop-up boxes, such as linking to the actual questionnaires from the ‘location’ atlas view (this will be for staff only) and including the actual average rating in the attribute view pop-up so you don’t have to guess what it is from the marker colour. Adding in links to the questionnaires involved reworking the API somewhat, but it’s worked out ok.
The prototype is working very nicely so far. What I’m going to try to do next week is allow for multiple attributes to be selected, with Boolean operators between them. This might be rather tricky, but we’ll see. I’ll finish off with a screenshot of the ‘attribute’ search, so you can compare how it looks now to the screenshot I posted last week:
This was a very short week for me as I was on holiday until Thursday. I still managed to cram a fair amount into my two days of work, though. On Thursday I spent quite a bit of time dealing with emails that had come in whilst I’d been away. Carole Hough emailed me about a slight bug in the Old English version of the Mapping Metaphor website. With the OE version all metaphorical connections are supposed to default to a strength of ‘both’ rather than ‘strong’ like with the main site. However, when accessing data via the quick and advanced search the default was still set to ‘strong’, which was causing some confusion as this was obviously giving different results to the browse facilities, which defaulted to ‘both’. Thankfully it didn’t take long to identify the problem and fix it. I also had to update a logo for the ‘People’s Voice’ project website, which was another very quick fix. Luca Guariento, who is the new developer for the Curious Travellers project, emailed me this week to ask for some advice on linking proper names in TEI documents to a database of names for search purposes and I explained to him how I am working with this for the ‘People’s Voice’ project, which has similar requirements. I also spoke to Megan Coyer about the ongoing maintenance of her Medical Humanities Network website and fixed an issue with the MemNet blog, which I was previously struggling to update. It would appear that the problem was being caused by an out of date version of the sFTP helper plugin, as once I updated that everything went smoothly.
I also set up a new blog for Rob Maslen, who wants to use it to allow postgrad students and others in the University to post articles about fantasy literature. I also managed to get Rob’s Facebook group integrated with the blog for his fantasy MLitt course. I’ve also got the web space set up for Rhona’s Edinburgh Gazetteer project, and extracted all of the images for this project too. I spent about half of Friday working on the Technical Plan for the proposal Alison Wiggins is putting together and I now have a clearer picture of how the technical aspects of the project should fit together. There is still quite a bit of work to do on this document, however, and a number of further questions I need to speak to Alison about before I can finish things off. Hopefully I’ll get a first draft completed early next week, though.
The remainder of my short working week was spent on the SCOSYA project, working on updates to the CMS. I added in facilities to create codes and attributes through the CMS, and also to browse these types of data. This includes facilities to edit attributes and view which codes have which attributes and vice-versa. I also began work on a new page for displaying data relating to each code – for example which questionnaires the code appears in. There’s still work to be done here, however, and hopefully I’ll get a chance to continue with this next week.
This week was another four-day week for me as I’d taken the Friday off. I will also be off until Thursday next week. I was involved in a lot of different project and had a few meeting this week. Wendy contacted me this week with a couple of queries regarding Mapping Metaphor. One part of this was easy – adding a new downloadable material to the ‘Metaphoric’ website. The involved updating the ZIP files and changing a JSON file to make the material findable in the ‘browse’ feature. The other issue was a bit more troublesome. In the Mapping Metaphor ‘browse’ facilities in the main site, the OE site and the ‘Metaphoric’ site Carole had noticed that the number of metaphorical connections given for the top level categories didn’t match up with the totals given for the level two categories within these top level ones. E.g. Browse view gives the External World total as 13115, but adding up the individual section totals comes to 17828.
It took quite a bit of investigation to figure out what was causing this discrepancy. But I finally figured out how to make the totals consistent and applied the update to the main site, the OE site and the Metaphoric website (but not the app as I’ll need to submit a new version to the stores to get the change implemented here).
There were inconsistencies in the totals at both the top level and level 2. These were caused by metaphorical connections that include links within a category only being counted once (e.g. a connection from Category 1 to Category 2 counts as 2 ‘hits’ – one for Category 1 and another for Category 2 but a connection from Category 1 to another Category 1 only counts as one ‘hit’. This was also true for Level 2 categories – e.g. 1A to 1B is a ‘hit’ for each category but 1A to another 1A is only one ‘hit’.
It could be argued that this is an acceptable way to count things, but in our browse page we have to go from the bottom up as we display the number of metaphorical connections each Level 3 category is involved in. Here’s another example:
2C has 2 categories, 2C01 and 2C02. 2C01 has 127 metaphorical connections and 2C02 has 141, making a total of 268 connections. However, one of these connections is between 2C01 and 2C02, so in the Level 2 count ‘how many connections are there involving a 2C category in either cat1 or cat2’ this connection was only being counted once, meaning the 2C total was only showing 267 connections instead of 268.
It could be argued that 2C does only have 267 metaphorical connections, but as our browse page shows the individual number of connections for each Level 3 category we need to include these ‘duplicates’ otherwise the numbers for levels 1 and 2 don’t match up.
Perhaps using the term ‘metaphorical connections’ on the browse page is misleading. We only have a total of 15,301 ‘metaphorical connections’ in our database. What we’re actually counting on the browse page is the number of times a category appears in a metaphorical connection, as either cat1, cat2 or both. But at least the figures used are now consistent.
On Monday I had a meeting with Gary Thoms to discuss further developments of the Content Management System for the SCOSYA project. We agreed that I would work on a number of different tasks for the CMS. This includes adding a new field to the template and ensuring the file upload scripts can process this, adding a facility to manually enter a questionnaire into the CMS rather than uploading a spreadsheet, adding example sentences and ‘attributes’ to the questionnaire codes and providing facilities in the CMS for these to be managed and creating some new ‘browse’ facilities to access the data. It was a very useful meeting and after writing up my notes from it I set to work on some of the tasks. By the end of my working week I had updated the file upload template, the database and the pages for viewing and editing questionnaires in the CMS. I had also created the database tables and fields necessary for holding information about example sentences and attributes and I created the ‘add record’ facility. There is still quite a lot to do here, and I’ll return to this after my little holiday. I’ll also need to get started on the actual map interface for the data too – the actual ‘atlas’.
On Tuesday I had a meeting with Rob Maslen to discuss a new website he wants to set up to allow members of the university to contribute stories and articles involving fantasy literature. We also discussed his existing website and some possible enhancements to this. I’ll aim to get these things done over the summer.
Last week Marc had contacted me about a new batch of Historical Thesaurus data that had been sent to us from the OED people and I spent a bit of time this week looking at the data. The data is XML based and I managed to figure out how it all fits together but as of yet I’m having trouble seeing how it relates to our HT data.
For example ‘The Universe (noun)’ in the OED data has an ID of 1628 and a ‘path’ of ‘01.01’, which looks like it should correspond to our hierarchical structure, but in our system ‘The Universe (noun)’ has the number ‘01.01.10 n’. Also the words listed in the HT data for this category are different to ours. We have the Old English words, which are not part of the OED data, but there are other differences too, e.g. The OED data has ‘creature’ but this is not in the HT data. Dates are different too, e.g. in our data ‘World’ is ‘1390-‘ while in the OED data it’s ‘?c1200’.
It doesn’t look to me like there is anything in the XML that links to our primary keys – at least not the ones in the online HT database. The ID in the XML for ‘The Universe (noun)’ is 1628 but in our system the ID for this category is 5635. The category with ID 1628 in our system is ‘Pool :: artificially confined water :: contrivance for impounding water :: weir :: place of’ which is rather different to ‘The Universe’!
I’ve also checked to see whether there might by an ID for each lexeme that is the same as our ‘HTID’ field (if there was then we could get to the category ID from this) but alas there doesn’t seem to be either. For example, the lexeme ‘world’ has a ‘refentry’ of ‘230262’ but this is the HTID for a completely different word in our system. There are ‘GHT’ (Glasgow Historical Thesaurus) tags for each word but frustratingly an ID isn’t one of them – only original lemma, dates and roget category. I hope aligning the data is going to be possible as it’s looking more than a little tricky from my initial investigation. I’m going to meet with Marc and Fraser later in the summer to look into this in more detail.
On Wednesday I met with Rhona Brown from Scottish Literature to discuss a project of hers that is just starting and that I will be doing the technical work for. The project is a small grant funded by the Royal Society of Edinburgh and its main focus is to create a digital edition of the Edinburgh Gazetteer, a short-lived but influential journal that was published in the 1790s.
The Mitchell has digitised the journal and this week I managed to see the images for the first time. Our original plan was to run the images through OCR software in order to get some text that would be used behind the scenes for search purposes, with the images being the things the users will directly interact with. However, now I’ve seen the images I’m not so sure this approach is going to work as the print quality of the original materials is pretty poor. I tried running one of the images through Tesseract, which is the OCR engine Google uses for its Google Books project and the results are not at all promising. Practically every word is wrong, although it looks like it has at least identified multiple columns – in places anyway. However, this is just a first attempt and there are various things I can do to make the images more suitable and possibly to ‘train’ the OCR software too. I will try other OCR software as well. We are also going to produce an interactive map of various societies that emerged around this time so I created an Excel template and some explanatory notes for Rhona to use to compile the information. I also contacted Chris Fleet of the NLS Maps department about the possibility of reusing the base map from 1815 that he very kindly helped us to use for the Burns highland tour feature. Chris got back to me very quickly to say this would be find, which is great.
On Wednesday I also met with Frank Hopfgartner from HATII to discuss an idea he has had to visualise a corpus of German radio plays. We discussed various visualisation options and technologies, the use of corpus software and topic modelling and hopefully some of this was useful to him. I also spent some time this week chatting to Alison Wiggins via email about the project she is currently putting together. I am going to write the Technical Plan for the proposal so we had a bit of a discussion about the various technical aspects and how things might work. This is another thing that I will have to prioritise when I get back from my holidays. It’s certainly been a busy few days.