Week Beginning 15th June 2015

I spent a fair amount of time this week working on AHRC duties, conducting reviews and also finishing off the materials I’d been preparing for a workshop on technical plans. This involved writing a sample ‘bad’ plan (or at least a plan with quite a few issues with it) and then writing comments on each section stating what was wrong with it. It has been enjoyable to prepare these materials. I’ve been meaning to write a list of “dos and don’ts” for technical plans for some time and it was a good opportunity to get all this information out of my head and written down somewhere. It’s likely that a version of these materials will also be published on the Digital Curation Centre website at some point, and it’s good to know that the information will have a life beyond the workshop.

I continued to wrestle with the Hansard data this week after the problems I encountered with the frequency data last week. Rather than running the ‘mrarchive’ script that Lancaster had written in order to split a file into millions of tiny XML files I decided to write my own script that would load each line of the archived file, extract the data and upload it directly to a database instead. Steve Wattam at Lancaster emailed me some instructions and an example shell script that splits the archive files and I set to work adapting this. Each line of the archive file (in this case a 10Gb file containing the frequency data) consists of two parts, each of which is Base64 encoded. The first part is the filename and the second part is the file contents. All I needed to do for each line was split the two parts and decode each part. I would then have the filename, which includes information such as the year, month and day, plus all of the frequency data for the speech the file refers to. The frequency data consisted of a semantic category ID and a count, one per line and separated by a tab so it would be easy to split this information up and then upload each count for each category for each speech into a database table.

It took a little bit of time to get the script running successfully due to some confusion over how the two base64 encoded parts of each line were separated. In his email, Steve had said that the parts were split by ‘whitespace’, which I took to mean a space character. Unfortunately there didn’t appear to be a space character present but looking at the encoded lines I could see that each section appeared to be split with an equals sign so I set my script going using this. I also contacted Steve to check this was right and it turned out that by ‘whitespace’ he’d meant a tab character and that the equals sign I was using to split the data was a padding character that couldn’t be relied upon to always be present. After hearing this I managed to update my script and set it off again. However, my script is unfortunately not going to be a suitable way to extract the data as its execution is just too slow for the amount of data we’re dealing with. Having started the process on Wednesday evening it took until Sunday before the script had processed the data for one year. During this time it had extracted more than 7.5million frequencies relating to tens of thousands of speeches, but at the current rate it will take more than two years to finish processing the data for the 200ish years of data that we have. A more efficient method is going to be required.

Following on from my meeting with Scott Spurlock last week I spent a bit of time researching crowdsourcing tools. I managed to identify three open source tools that might be suitable for Scott’s project (and potentially other projects in future).

First of all is one called PyBossa: http://pybossa.com/. It’s written in the Python programming language, which I’m not massively familiar with but have used a bit.  The website links through to some crowdsourcing projects that have been created using the tool and one of them is quite similar to what Scott is wanting to do.  The example project is getting people to translate badly printed German text into English, an example of which can be found here: http://crowdsourced.micropasts.org/app/NFPA-SetleyNews2/task/40476. Apparently you can create a project for free via a web interface here: http://crowdcrafting.org/ but I haven’t investigated this.

The second one is a tool called Hive that was written by the New York Times and has been released for anyone to use:  https://github.com/nytlabs/hive with an article about it here: http://blog.nytlabs.com/2014/12/09/hive-open-source-crowdsourcing-framework/. This is written in ‘Go’ which I have to say I’d never heard of before so have no experience of.  The system is used to power a project to crowdsource historical adverts in the NYT, and you can access this here: http://madison.nytimes.com/contribute/. It deals with categorising content rather than transcribing it, though.  I haven’t found any other examples of projects that use the tool as of yet.

The third option is the Zooniverse system, which does appear to be available for download: https://github.com/zooniverse/Scribe. It’s written in Ruby, which I only have a passing knowledge of.  I haven’t been able to find any examples of other projects using this software and I’m also not quite sure how the Scribe tool (which says it’s “a framework for crowdsourcing the transcription of text-based documents, particularly documents that are not well suited for Optical Character Recognition) fits in with other Zooniverse tools, for example Panoptes (https://github.com/zooniverse/Panoptes), which says it’s “The new Zooniverse API for supporting user-created projects.”  It could be difficult to get everything set up, but is probably worth investigating further.

I spent a small amount of time this week dealing with App queries from other parts of the University, and I also communicated briefly with Jane Stuart-Smith about a University data centre. I made a few further tweaks for the SciFiMedHums website for Gavin Miller and talked with Megan Coyer about her upcoming project, which is now due to commence in August, if recruitment goes to plan.

What remained of the week after all of the above I mostly spent on Mapping Metaphor duties. Ellen had sent through the text for the website (which is now likely to go live at the end of the month!) and I made the necessary additions and changes. My last task of the week was to begin to process the additional data that Susan’s students had compiled for the Scots Thesaurus project. I’ve so far managed to process two of these files and there are another few still to go, which I’ll get done on Monday.