I had a few meetings to attend this week, the first of which was with the Burns people. It was good to meet up with the project again as it has been a while. The meeting was to discuss some of the technical implications of the next phase of the project, which is focussed on Burns’ songs for George Thomson. The team have a variety of different sub-projects / online exhibitions and other assorted material that will need to be presented through the project website (http://burnsc21.glasgow.ac.uk/) and we explored some of the technical possibilities for these. There will be another meeting to discuss some more general technical matters (e.g. getting the timeline published, updating the home page) some time in early May.
My next meeting was with Lesley Richmond from Special Collections and some people from HATII to discuss a possible project for visualising medical history. It sounds like an interesting project with lots of relevance to the School of Critical Studies, including the Medical Humanities people and also the semantic tagger and text processing people – plus I’m always interested in visualisations. However, the major sticking point is the timescale as the funders would be wanting the successful team to begin work in June. I’m rather too busy to be the major technical person on another project at this time, which is a shame as it sounds like the kind of project I would have liked to have been involved with. HATII have agreed to look into the possibility of finding a technical person to do the work, but I would hope that the Critical Studies (and me) would still be able to contribute to the project in some capacity.
My third meeting was with David Borthwick from the School of Interdisciplinary Studies at the Dumfries campus. He is wanting to develop a digital resource to do with poetry and metaphor and Wendy Anderson told him I might be able to give some technical advice. We had a really interesting meeting and I managed to give him a number of useful pointers about what technical direction he might want to take and some of the issues he may wish to bear in mind. I won’t be able to be involved in his project any further as he is beyond the School of Critical Studies but it was good to provide some help at this stage.
Also this week I attended an event about digital preservation. My old boss Seamus Ross was discussing digital preservation issues past and present with William Kilbride, head of the DPC. It was enlightening and entertaining to hear them both discuss the issues and I wish I could have stayed for the full event but I had to leave after the first hour to collect my son from nursery. I’m glad I managed to sit in for the first hour, though.
In terms of actual development work, I spent a bit more time working with the Hansard data for the SAMUELS project this week. Last week I had managed to unzip the massive archive but had run into problems when attempting to ‘de-tar’ it. On my PC I just kept getting an error message stating that the archive was unreadable. I had left the data with Marc to see if he could extract the files on his Mac and I checked back with him this week and he had managed to successfully ‘de-tar’ the file! Marc suspects that the issue was being caused by long filenames or file extensions used in the tar file that Windows was unable to cope with. But whatever the reason I managed to get access to the actual files at last. Well, when I say the actual files, what I mean is the files that have been all joined together into one file using a method created by one of the guys at Lancaster. I still needed to split the files up into individual XML files using the Ruby Gem that Lancaster had created for this purpose.
And here is where I ran into further problems. The Gem script just kept giving me errors (‘Invalid argument @ rb_sysopen’ Errno::EINVAL) when I tried to run it on the files that Marc had extracted from the tar file. I tried a variety of approaches to check whether this has been caused by the path to the files or because the files are on an external hard drive but I still got the same error. The strange thing was that I wasn’t getting this error when I ran the Gem script on the data I had before the project meeting. After a bit of Googling it looked like the problem might be that the tar file was extracted in OSX and was trying to process it in Windows. So I took the hard drive home and plugged it into my MacBook.
After installing the Gem I ran it and… success! It worked without any errors. So there must be some reason why extracting the tar file in OSX results in the file being unreadable by the Gem script under Windows. A bit more Googling about the error code that was displayed suggests it’s to do with the different way OSX and Windows handle line endings (one is ‘\r\n’ and the other is ‘\n’).
I started with the smallest file (lords.mr, which is just over 10Gb) and it took about 6 hours to extract the data. All good so far, so I started extracting the next file (lords.tagged.v3.1.mr at 36Gb). But unfortunately I ran out of space on the external hard drive. This really perplexed me for a long time as based on the sizes returned by OSX’s ‘get info’ there should still have been 250Gb of free space on it! What I’ve since discovered is there is a massive structural overhead on the data, due to it consisting of hundreds of thousands of small files and directories. The extraction of lords.tagged.v3.1.mr got to a directory size of 12.7Gb before the script quit due to running out of space. But the amount of space this directory actually takes up on the disk (as opposed to the size of the actual data) is a whopping 75.9Gb! It’s quite astounding, really. Suddenly I understood why Lancaster had created a script to join all of the data together in one file! Fraser got in touch to say that he has ordered a new 2Tb external hard drive for the project and I will continue with the extraction once I have access to this drive.
I spent the rest of the week continuing with the Essentials of Old English app. Picking up where I left off last week, I completed work on the ‘plus’ book and began work on the exercises. I adapted the structure of the exercises from the Grammar app that I had previously completed and created an infrastructure whereby the question and answer data is stored in a JSON file and is pulled into the exercise page with an exercise ID defining which set of data is used. This approach works quite well, although custom structures need to be written to handle the individual structures of each exercise. Thankfully a lot of the exercises follow the same pattern so only a handful of different structures needed to be written. In order to allow users to enter the Old English characters ‘æ’ and ‘þ’ when typing in answers to questions I decided to bypass the keyboard that a user would normally use (e.g. the on-screen keyboard of a mobile phone) and use my own custom keyboard. This was based on the custom keyboard I had previously created for the ARIES app, but with new keys for the required Old English characters. I think it works quite well and it also ensures that a device’s spell checker doesn’t try to correct any Old English spelling. An example of the keyboard (and the exercises in general) can be found here: http://www.arts.gla.ac.uk/STELLA/apps/eoe/basic-exercise.html?5
I managed to complete the ‘Basic’ exercises and I began work on the ‘Plus’ exercises. There are rather a lot of ‘Plus’ exercises though, some I’m not sure if I’ll get these all completed next week. We’ll see though.
I was off for Easter last week and spent a lovely, sunny week visiting family in Yorkshire. Upon returning from this relaxing week I got stuck into a few projects, the first of which was SAMUELS. At the final project meeting before Easter Fraser was given a hard drive with the complete Hansard data on it – a 40Gb tar.gz file. I got this off Fraser with a view to extracting the data and figuring out exactly what it contained and just what I would need to do with it. Unzipping the file took many hours and resulted in a tar file that was approaching 200Gb in size. Unfortunately, although the unzipping process appeared to complete successfully when I attempted to ‘de-tar’ the file (i.e. split it up into its individual files) my zip program just gave an error message about the archive being unreadable. I repeated the extraction process, which took many more hours, but alas, the same error was given. I had a meeting with Marc and Fraser on Tuesday and Marc said he’d try to extract the files on his computer so I handed the hard drive over. I haven’t heard anything back from Marc yet but fingers crossed he has managed to make some progress. What I really need is a new desktop PC that has more storage and processing power as I’m currently rather hampered by the hardware I have access to.
The Tuesday meeting with Marc and Fraser was primarily to discuss the Thesaurus of Old English (TOE). There is an online version of this resource which is hosted at Glasgow, but it really needs to be redeveloped along the lines of the main HT website and we discussed how we might proceed with this. I would very much like to get a reworked TOE website up and available as soon as possible to complement the HT website and Marc is of the same opinion. As there is a big Anglo-Saxon conference being held in Glasgow in August (http://www.isas2015.com/) Marc would really like the new TOE to be available for this, alongside the Old English metaphor map which I will be working on in June. We agreed that Marc and Fraser would work on the underlying data and will try to get it to me in the next week or so and I will then adapt the scripts I’ve already created for the HT to work with this data. Structurally the data from each thesaurus are very similar so it shouldn’t be too tricky a task.
One item that has been sitting on my ‘to do’ list for a long time is to redevelop the map interface of the SCOTS corpus website. This was an aspect of the site that I didn’t update significantly when I revamped the SCOTS website previously but I always intended to update it. I’ve updated the map to use the current version of the Google Maps API (version 3). The old map used version 2, which Google no longer supports. Google still allow access to version 2 (they actually migrate calls to version 2 to version 3 at their end), but this facility could be switched off at any time so it’s important that we moved to version 3. I updated the map so that it displays a map rather than satellite images – I decided that being able to see placenames and locations would be more useful that seeing the geography. I’ve also removed options to switch from map to satellite and to view street view as these don’t really seem necessary.
I’ve styled the map to make it different from a standard Google map. The map is coloured so that water is the same colour as the website header and land masses are grey. I’ve also set it so that road markings, stations and businesses are not labelled to avoid clutter. I’ve also added a ‘key’ to the info-box on the right by default so people can tell what the icons mean. This gets replaced by document details when a ‘show details’ link is pressed. I was originally intending to replace the icons used on the map with new ones but I think on the grey map the icons still look pretty good. The new version of the map can be found here: http://www.scottishcorpus.ac.uk/advanced-search/
I also updated some favicons (the little icons used in browser table) used by a few sites this week. I’d noticed that the Mapping Metaphor icon I had created looked really blocky and horrible on my iPad and realised that higher resolution favicons were required for this site, plus SCOTS and CMSW. I found a website that can create such icons rather nicely (http://www.xiconeditor.com/) and created some new and considerably less pixilated favicons. Much better!
I also spent a bit of time on DSL duties, updating the front end of the ‘dev’ version so that it worked nicely with Peter’s newly released Boolean search functionality. It is now possible to use Boolean keywords AND, OR and NOT, but if these words were found at the beginning or the end of a search string they resulted in an HTTP error being returned. I’ve now added in a check that strips out such words. I also made another couple of tweaks to the search results browser. Once these updates have been approved by Ann I will update the ‘live’ site.
The remainder of the week was mostly spent with Essentials of Old English (EOE). I’ve been meaning to update this ageing and slightly broken website (see http://www.arts.gla.ac.uk/stella/OE/HomePage.html) for some time but other work commitments have taken priority. As I’m awaiting the Hansard data for SAMUELS it seemed like a good opportunity to make a start, plus I think it would be great to have this resource available before ISAS in August too. The old website uses Java applets for the exercises, which is a bit of a pain as most modern browsers recognise Java applets as major security risks these days and refuse to run them without a lot of customisation. It took an hour or so just to get my browser to open the exercises, and even then I’m having trouble getting some of the ‘Plus’ exercises to display. However, I came across the uncompiled Java source files in a directory on the STELLA server so these should be some help.
I’m creating an ‘app’ version of EOE that will sit alongside the three other STELLA apps I’ve previously created, so visually this new app fits in with the previous ones. So far I’ve managed to complete the ‘Basic’ book, the glossary and the ‘about’ pages, leaving the ‘Plus’ book and all of the exercises still to do. You can view a work in progress version here: http://www.arts.gla.ac.uk/STELLA/apps/eoe/ (Note that this is URL may cease to function at any time).
I hope to be able to find the time to continue with this app next week, although I have a few meetings and other commitments that might limit how much I can do.
A brief report this week as I’m off for my Easter hols soon and I don’t have much time to write. I will be off all of next week. It was a four-day week this week as Friday is Good Friday. Last week was rather hectic with project launches and the like but this week was thankfully a little calmer. I spent some time helping Chris out with an old site that urgently needed fixing and I spent about a day on AHRC duties, which I can’t go into here. Other than that I helped Jane with the data management plan for her ESRC bid, which was submitted this week. I also had a meeting with Gavin Miller and Jenny Eklöf to discuss potential collaboration tools for medical humanities people. This was a really interesting meeting and we had a great discussion about the various possible technical solutions for the project they are hoping to put together. I also spoke to Fraser about the Hansard data for SAMUELS but there wasn’t enough time to work through it this week. We are going to get stuck into it after Easter.