Week Beginning 20th April 2015

I had a few meetings to attend this week, the first of which was with the Burns people. It was good to meet up with the project again as it has been a while. The meeting was to discuss some of the technical implications of the next phase of the project, which is focussed on Burns’ songs for George Thomson. The team have a variety of different sub-projects / online exhibitions and other assorted material that will need to be presented through the project website (http://burnsc21.glasgow.ac.uk/) and we explored some of the technical possibilities for these. There will be another meeting to discuss some more general technical matters (e.g. getting the timeline published, updating the home page) some time in early May.

My next meeting was with Lesley Richmond from Special Collections and some people from HATII to discuss a possible project for visualising medical history. It sounds like an interesting project with lots of relevance to the School of Critical Studies, including the Medical Humanities people and also the semantic tagger and text processing people – plus I’m always interested in visualisations. However, the major sticking point is the timescale as the funders would be wanting the successful team to begin work in June. I’m rather too busy to be the major technical person on another project at this time, which is a shame as it sounds like the kind of project I would have liked to have been involved with. HATII have agreed to look into the possibility of finding a technical person to do the work, but I would hope that the Critical Studies (and me) would still be able to contribute to the project in some capacity.

My third meeting was with David Borthwick from the School of Interdisciplinary Studies at the Dumfries campus. He is wanting to develop a digital resource to do with poetry and metaphor and Wendy Anderson told him I might be able to give some technical advice. We had a really interesting meeting and I managed to give him a number of useful pointers about what technical direction he might want to take and some of the issues he may wish to bear in mind. I won’t be able to be involved in his project any further as he is beyond the School of Critical Studies but it was good to provide some help at this stage.

Also this week I attended an event about digital preservation. My old boss Seamus Ross was discussing digital preservation issues past and present with William Kilbride, head of the DPC. It was enlightening and entertaining to hear them both discuss the issues and I wish I could have stayed for the full event but I had to leave after the first hour to collect my son from nursery. I’m glad I managed to sit in for the first hour, though.

In terms of actual development work, I spent a bit more time working with the Hansard data for the SAMUELS project this week. Last week I had managed to unzip the massive archive but had run into problems when attempting to ‘de-tar’ it. On my PC I just kept getting an error message stating that the archive was unreadable. I had left the data with Marc to see if he could extract the files on his Mac and I checked back with him this week and he had managed to successfully ‘de-tar’ the file! Marc suspects that the issue was being caused by long filenames or file extensions used in the tar file that Windows was unable to cope with. But whatever the reason I managed to get access to the actual files at last. Well, when I say the actual files, what I mean is the files that have been all joined together into one file using a method created by one of the guys at Lancaster. I still needed to split the files up into individual XML files using the Ruby Gem that Lancaster had created for this purpose.

And here is where I ran into further problems. The Gem script just kept giving me errors (‘Invalid argument @ rb_sysopen’ Errno::EINVAL) when I tried to run it on the files that Marc had extracted from the tar file. I tried a variety of approaches to check whether this has been caused by the path to the files or because the files are on an external hard drive but I still got the same error. The strange thing was that I wasn’t getting this error when I ran the Gem script on the data I had before the project meeting. After a bit of Googling it looked like the problem might be that the tar file was extracted in OSX and was trying to process it in Windows. So I took the hard drive home and plugged it into my MacBook.

After installing the Gem I ran it and… success! It worked without any errors. So there must be some reason why extracting the tar file in OSX results in the file being unreadable by the Gem script under Windows. A bit more Googling about the error code that was displayed suggests it’s to do with the different way OSX and Windows handle line endings (one is ‘\r\n’ and the other is ‘\n’).

I started with the smallest file (lords.mr, which is just over 10Gb) and it took about 6 hours to extract the data. All good so far, so I started extracting the next file (lords.tagged.v3.1.mr at 36Gb). But unfortunately I ran out of space on the external hard drive. This really perplexed me for a long time as based on the sizes returned by OSX’s ‘get info’ there should still have been 250Gb of free space on it! What I’ve since discovered is there is a massive structural overhead on the data, due to it consisting of hundreds of thousands of small files and directories. The extraction of lords.tagged.v3.1.mr got to a directory size of 12.7Gb before the script quit due to running out of space. But the amount of space this directory actually takes up on the disk (as opposed to the size of the actual data) is a whopping 75.9Gb! It’s quite astounding, really. Suddenly I understood why Lancaster had created a script to join all of the data together in one file! Fraser got in touch to say that he has ordered a new 2Tb external hard drive for the project and I will continue with the extraction once I have access to this drive.

I spent the rest of the week continuing with the Essentials of Old English app. Picking up where I left off last week, I completed work on the ‘plus’ book and began work on the exercises. I adapted the structure of the exercises from the Grammar app that I had previously completed and created an infrastructure whereby the question and answer data is stored in a JSON file and is pulled into the exercise page with an exercise ID defining which set of data is used. This approach works quite well, although custom structures need to be written to handle the individual structures of each exercise. Thankfully a lot of the exercises follow the same pattern so only a handful of different structures needed to be written. In order to allow users to enter the Old English characters ‘æ’ and ‘þ’ when typing in answers to questions I decided to bypass the keyboard that a user would normally use (e.g. the on-screen keyboard of a mobile phone) and use my own custom keyboard. This was based on the custom keyboard I had previously created for the ARIES app, but with new keys for the required Old English characters. I think it works quite well and it also ensures that a device’s spell checker doesn’t try to correct any Old English spelling. An example of the keyboard (and the exercises in general) can be found here: http://www.arts.gla.ac.uk/STELLA/apps/eoe/basic-exercise.html?5

I managed to complete the ‘Basic’ exercises and I began work on the ‘Plus’ exercises. There are rather a lot of ‘Plus’ exercises though, some I’m not sure if I’ll get these all completed next week. We’ll see though.