I returned to work after the Christmas holidays on Thursday this week, and spent the day dealing with a few issues that had cropped up whilst I’d been away. The DSL Advanced Search had stopped working on Wednesday this week. I remembered that this had happened a few years ago and was caused by an issue with the Apache Solr search engine, which the advanced search uses. Previously restarting the server had sorted the issue but this didn’t work this time. Thankfully after speaking to Chris about this we realised that Solr runs on an Apache Tomcat server rather than the main Apache server software, and this had been updated the day before. It would appear that the update had stopped Solr working, but restarting Tomcat got things working again. I also made a minor tweak to the Scots Corpus for Wendy.
After that, and dealing with a few emails, I returned to the Historical Thesaurus timeline visualisations I’d created the day before the Christmas holidays. I’d emailed Marc and Fraser about these before the holidays and they’d got back to me with some encouraging comments. The initial visualisations I’d made only worked with the approximate start and end dates from the Thesaurus database – the ‘apps’ and ‘appe’ dates that give a single start and end date for each word. However, the actual dates for lexemes are considerably more complicated than this. In fact there are 18 fields relating to dates in the underlying database that allow different ranges of dates to be recorded, for example ‘OE + a1400/50–c1475 + a1746– now History’. Writing an algorithm that could process every different possible permutation of the date fields proved to be rather tricky and took quite a bit of time to get my head around. I managed to get an algorithm working by mid-morning on the Friday, and although this still needs quite a bit of detailed testing it does at least seem to work (and does work with the above example), giving a nice series of dots and dashes along a timeline.
Marc, Fraser and I met on Friday to discuss the timeline and how we might improve on it and integrate it into the site. Our meeting lasted almost three hours and was very useful. It looks like the feature I created just because I had some free time and I had wanted to experiment is going to be fully integrated with many aspects of the site. The only downside is there is now a massive amount of additional functionality we want to implement, and I know I’m going to be pretty busy with other projects once the new year properly gets under way, so it might take quite a while to get all this up and running. Still, it’s exciting, though. Also whilst working through my algorithm I’d spotted some occurrences where the dates were wrong, in that they had a range of dates where a range did not make sense, e.g. ‘OE–c1200–a1500’. I generated a few CSV files with such rows (there are a couple of hundred) and Marc and Fraser are going to try and sort them out. After our lengthy meeting I started to add in pop-ups to the timeline, so that when you click on an item a pop-up opens displaying information about the word (e.g. the word and its full date text). I still need to do some work on this, but it’s good to get the basics in place. Here’s a screenshot showing the timeline using the full date fields and with a pop-up open:
After an enjoyable week’s holiday I returned to work on Monday, spending quite a bit of Monday catching up with some issues people had emailed me about whilst I was away, such as making further tweaks to the ‘Concise Scots Dictionary’ page on the DSL website for Rhona Alcorn (the page is now live if you’d like to order the book: http://dsl.ac.uk/concise-scots-dictionary/), speaking with Luca about a project he’s involved in the planning of that’s going to use some of the DSL data, helping Carolyn Jess-Cooke with some issues she was encountering when accessing one of her websites, giving some information to Brianna of the RNSN project about timeline tools we might use, and a few other such things.
I had a couple of queries from Wendy Anderson this week. The first was for Mapping Metaphor. Wendy wanted to grab all of the bidirectional metaphors in both the main and OE datasets, including all of their sample lexemes. I wrote a script that extracted the required data and formatted it as a CSV file, which is just the sort of thing she wanted. The second query was for all of the metadata associated with the Corpus of Modern Scots Writing texts. A researcher had contacted Wendy to ask for a copy but although the metadata is in the database and can be viewed on a per text basis through the website, we didn’t have the complete dataset in an easy to share format. I wrote a little script that queried the database and retrieved all of the data. I had to do a little digging into how the database was structure in order to do this, as it is a system that wasn’t developed by me. However, after a little bit of exploration I managed to write a script that grabbed the data about each text, including multiple authors that can be associated with each text. I then formatted this as a CSV file and sent the outputted file to Wendy.
I met with Gary on Monday to discuss some changes to the SCOSYA atlas and CMS that he wanted me to implement ahead of an event the team are at next week. This included adding Google Analytics to the website, updating the legend of the Atlas to make it clearer what the different rating levels meant, separating out the grey squares (which mean no data is present) and the grey circles (meaning data is present but doesn’t meet the specified criteria) into separate layers so they can be switched on and off independently of each other, making the map markers a little smaller, and adding in facilities to allow Gary to delete codes, attributes and code parents via the CMS. This all took a fair amount of time to implement, and unfortunately I lost a lot of time on Thursday due to a very strange situation with my access to the server.
I work from home on Thursdays and I had intended to work on the ‘delete’ facilities that day, but when I came to log into the server the files and the database appeared to have reverted back to the state they were in in May – i.e. it looked like we had lost almost six months of data, plus all of the updates to the code I’d implemented during this time. This was obviously rather worrying and I spent a lot of time toing and froing with Arts IT Support to try and figure out what had gone wrong. This included restoring a backup from the weekend before, which strangely still seemed to reflect the state of things in May. I was getting very concerned about this when Gary noted that he was seeing two different views of the data on his laptop. In Safari on his laptop his view of the data appeared to have ‘stuck’ at May while in Chrome he could see the up to date dataset. I then realised that perhaps the issue wasn’t with the server after all but instead the problem was my home PC (and Safari on Gary’s laptop) was connecting to the wrong server. Arts IT Support’s Raymond Brasas suggested it might be an issue with my ‘hosts’ file and that’s when I realised what had happened. As the SCOSYA domain is an ‘ac.uk’ domain and it takes a while for these domains to be set up, we had set up the server long before the domain was running, so to allow me to access the server I had added a line to the ‘hosts’ file on my PC to override what happens when the SCOSYA URL is requested. Instead of it being resolved by a domain name service my PC pointed at the IP address of the server as I had entered it in my ‘hosts’ file. Now in May, the SCOSYA site was moved to a new server, with a new IP address, but the old server had never been switched off, so my home PC was still connecting to this old server. I had only encountered the issue this week because I hadn’t worked on SCOSYA from home since May. So, it turned out there was no problem with the server, or the SCOSYA data. I removed the line from my ‘hosts’ file, restarted my browser and immediately I could access the up to date site. All this took several hours of worry and stress, but it was quite a relief to actually figure out what the issue was and to be able to sort it.
I had intended to start setting up the server for the SPADE project this week, but the machine has not yet been delivered, so I couldn’t work on this. I did make a few further tweaks to the SPADE website, however, and responded to a couple of queries from Rachel about the SCOTS data and metadata, which the project will be using.
I also met with Fraser to discuss the ongoing issue of linking up the HT and OED data. We’re at the stage now where we can think about linking up the actual words with categories. I’d previously written a script that goes through each HT category that matches an OED category and compares the words in each, checking whether an HT word matches the next found in either the OED ‘ght_lemma’ or ‘lemma’ fields. After our meeting I updated the HT lexeme table to include extra fields for the ID of a matching OED lexeme and whether the lexeme had been checked. After that I updated the script to go through every matching category in order to ‘tick off’ the matching words within. The first time I ran my script it crashed the browser, but with a bit of tweaking I got it to successfully complete the second time. Here are some stats:
There are 655513 HT lexemes that are now matched up with an OED lexeme. There are 47074 HT lexemes that only have OE forms, so with 793733 HT lexemes in total this means there are 91146 HT lexemes that should have an OED match but don’t. Note, however, that we still have 12373 HT categories that don’t match OED categories and these categories contain a total of 25772 lexemes.
On the OED side of things, we have a total of 688817 lexemes, and of these 655513 now match an HT lexeme, meaning there are 33304 OED lexemes that don’t match anything. At least some of these will also be cleared up by future HT / OED category matches. Of the 655513 OED lexemes that now match, 243521 of them are ‘revised’. There are 262453 ‘revised’ OED lexemes in total, meaning there are 18932 ‘revised’ lexemes that don’t currently match an HT lexeme. I think this is all pretty encouraging as it looks like my script has managed to match up bulk of the data. It’s just the several thousand edge cases that are going to be a bit more work.
On Wednesday I met with Thomas Widmann of Scots Language Dictionaries to discuss our plans to merge all three of the SLD websites (DSL, SLD and Scuilwab) into one resource that will have the DSL website’s overall look and feel. We’re going to use WordPress as a CMS for all of the site other than the DSL’s dictionary pages, so as to allow SLD staff to very easily update the content of the site. It’s going to take a bit of time to migrate things across (e.g. making a new WordPress theme based on the DSL website, create quick search widgets, updating the DSL dictionary pages to work with the WordPress theme), but we now have the basis of a plan. I’ll try to get started on this before the year is out.
Finally this week, I responded to a request from Simon Taylor to make a few updates to the REELS system, and I replied to Thomas Clancy about how we might use existing Ordinance Survey data in the Scottish Place-Names survey. All in all it has been a very busy week.
This week was a pretty busy one, working on a number of projects and participating in a number of meetings. I spent a bit of time working on Bryony Randall’s New Modernist Editing project. This involved starting to plan the workshop on TEI and XML – sorting out who might be participating, where the workshop might take place, what it might actually involve and things like that. We’re hoping it will be a hands-on session for postgrads with no previous technical experience of transcription, but we’ll need to see if we can get a lab booked that has Oxygen available first. I also worked with the facsimile images of the Woolf short story that we’re going to make a digital edition of. The Woolf estate wants a massive copyright statement to be plastered across the middle of every image, which is a little disappointing as it will definitely affect the usefulness of the images, but we can’t do anything about that. I also started to work with Bryony’s initial Word based transcription of the short story, thinking how best to represent this in TEI. It’s a good opportunity to build up my experience of Oxygen, TEI and XML.
I also updated the data for the Mapping Metaphor project, which Wendy has continued to work on over the past few months. We now have 13,083 metaphorical connections (down from 13931), 9,823 ‘first lexemes’ (up from 8,766) and 14,800 other lexemes (up from 13,035). We also now have 300 categories completed, up from 256. I also replaced the old ‘Thomas Crawford’ part of the Corpus of Modern Scottish Writing with my reworked version. The old version was a WordPress site that hadn’t been updated since 2010 and was a security risk. The new version (http://www.scottishcorpus.ac.uk/thomascrawford/) consists of nothing more than three very simple PHP pages and is much easier to navigate and use.
I had a few Burns related tasks to take care of this week. Firstly there was the usual ‘song of the week’ to upload, which I published on Wednesday as usual (see http://burnsc21.glasgow.ac.uk/ye-jacobites-by-name/). I also had a chat with Craig Lamont about a Burns bibliography that he is compiling. This is currently in a massive Word document but he wants to make it searchable online so we’re discussing the possibilities and also where the resource might be hosted. On Friday I had a meeting with Ronnie Young to discuss a database of Burns paper that he has compiled. The database currently exists as an Access database with a number of related images and he would like this to be published online as a searchable resource. Ronnie is going to check where the resource should reside and what level of access should be given and we’ll take things from there.
I had been speaking to the other developers across the College about the possibility of meeting up semi-regularly to discuss what we’re all up to and where things are headed and we arranged to have a meeting on Tuesday this week. It was a really useful meeting and we all got a chance to talk about our projects, the technologies we use, any cool developments or problems we’d encountered and future plans. Hopefully we’ll have these meetings every couple of months or so.
We had a bit of a situation with the Historical Thesaurus this week relating to someone running a script to grab every page of the website in order to extract the data from it, which is in clear violation of our terms and conditions. I can’t really go into any details here, but I had to spend some of the week identifying when and how this was done and speaking to Chris about ensuring that it can’t happen again.
The rest of my week was spent on the SCOSYA project. Last week I updated the ‘Atlas Display Options’ to include accordion sections for ‘advanced attribute search’ and ‘my map data’. I’m still waiting to hear back from Gary about how he would like to advanced search to work so instead I focussed on the ‘my map data’ section. This section will allow people to upload their own map data using the same CSV format as the atlas download files in order to visualise this data on the map. I managed to make some pretty good progress with this feature. First of all I needed to create new database tables to house the uploaded data. Then I needed to add in a facility to upload files. I decided to use the ‘dropzone.js’ scripts that I had previously used for uploading the questionnaires to the CMS. This allows the user to drag and drop one or more files into a section of the browser and for this data to then be processed in an AJAX kind of way. This approach works very well for the atlas as we don’t want the user to have to navigate away from the atlas in order to upload the data – all needs to be managed from within the ‘display options’ slideout section.
I contemplated adding the facility to process the uploaded files to the API but decided against it as I wanted to keep the API ‘read only’ rather than also handling data uploads and deletions. So instead I created a stand-along PHP script that takes the uploaded CSV files and adds them to the database tables I had created. This script then echoes out some log messages that then get pulled into a ‘log’ section of the display in an AJAX manner.
I then had to add in a facility to list previously uploaded files. I decided the query for this should be part of the API as it is a ‘GET’ request. However, I needed to ensure that only the currently logged in user was able to access their particular list of files. I didn’t want anyone to be able to pass a username to the API and then get that user’s files – the passed username must also correspond to the currently logged in user. I did some investigation about securing an API, using access tokens and things like that but in the end I decided that accessing the user’s data would only ever be something that we would want to offer through our website and we could therefore just use session authentication to ensure the correct user was logged in. This doesn’t really fit in with the ethos of a RESTful API, but it suits our purposes ok so it’s not really an issue.
With the API updated to be able to accept requests for listing a user’s data uploads I then created a facility in the front-end for listing these files, ensuring that the list automatically gets updated with each new file upload. You can see the work in progress ‘my map data’ section in the following screenshot.
I continued working on the SCOSYA project this week, further refining the ‘or’ search that I spent much of last week working on. Gary had noted that the ‘or’ search wasn’t working as intended (i.e. different icons representing different combinations of attributes) when the same attribute was selected but with different limit options – e.g. Attribute D3 rated 1-3 by young speakers OR Attribute D3 rated 1-3 by old speakers. Unfortunately, this is because all of my code was written around each attribute in an ‘or’ search being different – if two selected attributes are the same the code for splitting things up and assigning different icons simply doesn’t trigger. To get the display to work for the same attribute required some fairly major reworking of the code, which took rather a long time to get working. By the end of the week I’d got something that was sort of working in place. Now the ‘or’ search checks every selected ‘limit’ option including the attribute selection to see whether the selected attribute is the same or not. This means the ‘or’ search also works for selecting different scores, ‘interviewed by’ options and other limit options in addition to the ‘age’ selection. However, I’m noticing some errors in the choice of icons when more than two attributes are chosen. Specifically, it would appear that different attribute combinations for locations are being assigned the same icon, which is quite clearly a bug, so this is going to need some further work next week I’m afraid. Below is a screenshot of an ‘or’ search for the same attribute with different combinations of limit options, so you can see that (for two selected items at least) the ‘or’ search is now working better than last week.
I continued with more AHRC review work for a day or so this week and I also spent about a day reworking an old resource that was in desperate need of attention. The Thomas Crawford’s Diary section of Corpus of Modern Scottish Writing (http://www.scottishcorpus.ac.uk/thomascrawford/) is a WordPress powered site that was set up a couple of years before I started in this post. The WordPress instance hasn’t been updated since the site launched in 2010 and is very out of date and almost certainly a security risk. Unfortunately, the software is so old that I can’t even upgrade it using the WordPress tools – I tried doing so before and the upgrade failed and the entire site broke. The site doesn’t really need to be a WordPress site, at least not now it’s launched and the day by day postings are well and truly over. Instead I’ve created a version that uses nothing more than a small amount of PHP scripting and stores all of the diary entries in a PHP array. I think the version I’ve created works a lot better than the existing version (navigating between diary entries is easier and the order they’re listed in makes more sense) and hopefully I’ll be able to replace the existing version with my new version soon. I’ve contacted Wendy to let her approve things before I take down the old site. Hopefully the switchover will be able to take place in the next week or so and this ancient WordPress instance can be deleted.
I had two meetings with members of staff this week. The first was with Johanna Green, who now works in HATII but previously worked for the School of Critical Studies. Whilst she was still in SCS we submitted a Chancellor’s Fund proposal to develop a ‘web app’ based around an exhibition in Special Collections, and this was granted funding. We are now starting to think about developing the app and we met to discuss the options. Helpfully, Johanna had produced a series of Powerpoint based mockups of how she would like the app to look and function. We went through these slides and we have a pretty good idea about how development should proceed. I’ve requested a subdomain for the site and once we have the space available, and Johanna has got back to me with some images and other content, I’ll start to develop an initial version of the app. Note that at this stage we are merely going to create a ‘web app’ – it will use purely client-side scripting and will be optimised for touchscreens but it won’t be ‘wrapped’ as an iOS or Android app. Instead it will be accessed via a web browser. However, I will ensure the code can be ‘wrapped’ at a later date if needs be.
My second meeting was with Hannah Tweed, who is in the process of submitting a proposal for funding for a project. I can’t really go into details about this here, but we met and discussed the data management aspects of her project and after the meeting I wrote a few paragraphs for her data management plan.
I also launched the second of our weekly Burns songs this week (see http://burnsc21.glasgow.ac.uk/contented-wi-little-c/) and I spent the remainder of the week continuing to work through the OED data import for the Historical Thesaurus with Fraser. I created a bunch of new scripts to process potential category matches that Fraser had identified. For example, where the HT has ‘something/something else’ and the OED has ‘something (something else)’ these categories aren’t being flagged as the same so a little script to switch the formatting and then compare the category names helped to tick off a bunch of categories. At the start of the week we had 38,676 categories across the two datasets that were not marked as ‘checked’ and be the end of the week we had got this down to 23,595, which is pretty good going.
A brief report this week as I’m off for my Easter hols soon and I don’t have much time to write. I will be off all of next week. It was a four-day week this week as Friday is Good Friday. Last week was rather hectic with project launches and the like but this week was thankfully a little calmer. I spent some time helping Chris out with an old site that urgently needed fixing and I spent about a day on AHRC duties, which I can’t go into here. Other than that I helped Jane with the data management plan for her ESRC bid, which was submitted this week. I also had a meeting with Gavin Miller and Jenny Eklöf to discuss potential collaboration tools for medical humanities people. This was a really interesting meeting and we had a great discussion about the various possible technical solutions for the project they are hoping to put together. I also spoke to Fraser about the Hansard data for SAMUELS but there wasn’t enough time to work through it this week. We are going to get stuck into it after Easter.
This was an important week, as the new version of the Historical Thesaurus went live! I spent most of Monday moving the new site across to its proper URL, testing things, updating links, validating the HTML, putting in redirects from the old site and ensuring everything was working smoothly and the final result was unveiled at the Samuels lecture on Tuesday evening. You can find the new version here:
It was also the week that the new version of the SCOTS corpus and CMSW went live too, and I spent a lot of time on Tuesday and Wednesday working on setting up these new versions, undertaking the same tasks as I did for the HT. The new version of SCOTS and CMSW can be accessed here:
I had to spend some extra time following the relaunch of SCOTS updating the ‘Thoms Crawford’s diary’ microsite. This is a WordPress powered site and I updated the header and footer template files so as to give each page of the microsite the same header and footer as the rest of the CMSW site, which looks a lot better than the old version which didn’t have any link back to the main CMSW site.
I had a couple of meetings this week, the first with a PhD student who is wanting to OCR some 19th century Scottish courts records. I received some very helpful pointers on this from Jenny Bann, who was previously involved with the massive amounts of OCR work that went on in the CMSW project and was able to give some good advice on how to improve the success rate of OCR on historical documents thanks to Jenny’s input.
My second meeting was with a member of staff in English Literature who is putting together an AHRC bid. Her project will involve the use of Google Books and the Ngram interface, plus developing some visualisations of links between themes in novels. We had a good meeting and should hopefully proceed further with the writing of the bid in future weeks. Having not used Ngrams much I spent a bit of time researching it and playing around with it. It’s a pretty amazing system and has lots of potential for research due to the size of the corpus and the also the sophisticated query tools that are on offer.
I completed all of the outstanding work on the redevelopment of the CMSW website this week, including replacing the green colour scheme I’d previously chosen with a snazzy purple one. The new version of the site can be found here:
But I will be replacing the old SCOTS and CMSW sites with the new version in the next week or so, at which point the ‘new-design’ URLs will stop working.
The biggest challenge this week was creating a script that could process the digitised images of the texts, including displaying them in page, displaying a thumbnail index view and generating ‘next’ and ‘previous’ navigation paths. These all used the JSON files I created last week that extracted image file IDs and directory names from the file structure on the server. As I hoped I would be able to do, I managed to get one single (and pretty simple) PHP script to process all of the image management stuff, where previously there existed an individual HTML page for each of the thousands of images in the site. It works very well, although as I’m embedding the full-size images in the page that’s generated (in order to make the standard view bigger than the more limited view the old site gave) it can take a while for the page images to load. If this is an issue I might have to revert to using the smaller images.
Also this week I completed work on the ‘microsites’ – ‘Life in old letters’ and the ‘Burns Kilmarnock Edition’. This mainly required tweaking the image navigation code I had previously created and wasn’t too tricky a task to complete. I’ve had to leave the update of the Thomas Crawford diary microsite until we go live with the new design as it’s a WordPress powered site and I don’t want to update the templates until everything is ready to go. I also updated the search input form to make the multi-select options (Genre and year group) look nicer – HTML multi-select boxes look pretty awful as it’s not possible to apply styles to the selected options so they just stay as the browser’s default colours (e.g. a garish blue in Firefox). Instead of this I’ve implemented checkboxes in a scrollable div, which looks a lot nicer.
There was a strike this week so I lost a day’s work so there is less to report than normal. Other than the CMSW stuff I wrote a blog post for the Mapping Metaphor project blog and I began looking into the development of the fourth STELLA App – ‘Essentials of Old English’. I’m just at the planning stage with this for the moment though – going through the existing site and noting any possible tricky parts. I also read through the feedback the Course 20 students had given for the three existing STELLA Apps, which was all very positive apart for some suggestions about navigation paths in the Grammar app, something that I definitely need to look into.
This week I polished off the remaining tasks related to the redevelopment of the SCOTS website (mainly just updating site text) and moved on to the CMSW website. I made good progress with the redevelopment and by the end of the week I had completed work on all of the ancillary pages plus the search and browse mechanism. I’m still midway through tackling the ‘view document’ page, but most of the functionality is in place, such as viewing the document, viewing the document metadata, highlighting search terms etc.
There is still work to be done, however, as CMSW is quite different to SCOTS in some respects, namely that it provides access to colour and greyscale scans of the texts. In the original site some kind of HTML generator must have been used to create an individual HTML page for every image in the site (so many hundreds of pages). I could update the design of all of these by creating a batch process, but I think it would be better to use just one single PHP script to generate the pages as required instead. It’s a much more efficient way of doing things and will make updating the page design considerably easier in future. I created a PHP script that ran through all of the image directories and logged the image filenames as a JSON file, which I will then use as the ‘database’ for generating a count of the number of images of each document, the thumbnail index page for each document and the individual webpage for each full-size page image. I aim to get all of this done next week, all being well.
In addition to CMSW work I also gave some feedback to Jean about the web hosting agreement between the University and SLD. I’m going to need to arrange a meeting with SLD people soon as well as I will be unable to attend next Thursday’s meeting.
I also attended a couple of meetings this week, firstly a Mapping Metaphor meeting on Monday and then a meeting with the Burns people on Wednesday. The Mapping meeting was useful, especially as we discussed an abstract I have been involved n writing for a conference, and also the plans for the colloquium next year, where I will be running a testing session for the visualisation interface. The Burns meeting was a good opportunity to meet the new staff who have recently started working for the project, and we also discussed some of the outstanding project ideas such as the timeline and the interactive map. Hopefully we can get things moving with the map soon.