I was struck down by a rather unpleasant, feverish throat infection this week. I managed to struggle through Wednesday, even though I should really have been in bed, but then was off sick on Thursday and Friday. It was very frustrating as I am really quite horribly busy at the moment with so many projects on the go and so many people needing advice, and I had to postpone three meetings I’d arranged for Thursday. But it can’t be helped.
I had a couple of meetings this week, one with Carole Hough to help her out with her Cogtop.org site. Whilst I was away on holiday a few weeks ago there were some problems with a multilingual plugin that we use on this site to provide content in English and Danish and the plugin had to be deactivated in order to get content added to the site. I met with Carole to discuss who should be responsible for updating the content of the site and what should be done about the multilingual feature. It turns out Carole will be updating the content herself so I gave her a quick tutorial on managing a WordPress site. I also replaced the multilingual plugin with a newer version that works very well. This plugin is called qTranslate X: https://wordpress.org/plugins/qtranslate-x/ and I would definitely recommend it.
My other meeting was with Gavin Miller, and we discussed the requirements for his bibliography of text relating to Medical Humanities and Science Fiction. I’m going to be creating a little WordPress plugin that he can use to populate the bibliography. We talked through the sorts of data that will need to be managed and Gavin is going to write a document listing the fields and some examples and we’ll take it from there.
I had hoped to be able to continue with the Hansard visualisation stuff on Wednesday this week but I just was feeling well enough to tackle it. My data extraction script had at least managed to extract frequency data for two whole years of the Commons by Wednesday, though. This may not seem like a lot of data when we have over 200 years to deal with, but it will be enough to test out how the Bookworm system will work with the data. Once I have get this test data working and I’m sure that the structure I’ve extracted the data into can be used with Bookworm we can then think about using Cloud or Grid computing to extract chunks of the data in parallel. If we don’t take this approach it will take another two years to complete the extraction of the data!
Instead of working with Hansard, I spent most of Wednesday working with the Thesaurus of Old English data that Fraser had given to me earlier in the week. I’ll be overhauling the old ‘TOE’ website and database and Fraser has been working to get the data into a consistent format. He gave me the data as a spreadsheet and I spent some time on Wednesday creating the necessary database structure for the data and writing scripts that would be able to process and upload the data. I managed to get all of the data uploaded into the new online database, consisting of almost 22,500 categories and 51,500 lexemes. I still need to do some work on the data, specifically fixing length symbols, which currently appear in the data as underscores after the letter (e.g. eorþri_ce) when what is needed is the modern UTF8 character (e.g. eorþrīce). I also need to create the search terms for variant forms in the data too, which could prove to be a little tricky.
Other tasks I carried out this week included completing the upload of all of the student created data for the Scots Thesaurus project, investigating the creation of the Google Play account for the STELLA apps and updating a lot of the ancillary content for the Mapping Metaphor website ahead of next week’s launch, a task which took a fair amount of time.
I spent a fair amount of time this week working on AHRC duties, conducting reviews and also finishing off the materials I’d been preparing for a workshop on technical plans. This involved writing a sample ‘bad’ plan (or at least a plan with quite a few issues with it) and then writing comments on each section stating what was wrong with it. It has been enjoyable to prepare these materials. I’ve been meaning to write a list of “dos and don’ts” for technical plans for some time and it was a good opportunity to get all this information out of my head and written down somewhere. It’s likely that a version of these materials will also be published on the Digital Curation Centre website at some point, and it’s good to know that the information will have a life beyond the workshop.
I continued to wrestle with the Hansard data this week after the problems I encountered with the frequency data last week. Rather than running the ‘mrarchive’ script that Lancaster had written in order to split a file into millions of tiny XML files I decided to write my own script that would load each line of the archived file, extract the data and upload it directly to a database instead. Steve Wattam at Lancaster emailed me some instructions and an example shell script that splits the archive files and I set to work adapting this. Each line of the archive file (in this case a 10Gb file containing the frequency data) consists of two parts, each of which is Base64 encoded. The first part is the filename and the second part is the file contents. All I needed to do for each line was split the two parts and decode each part. I would then have the filename, which includes information such as the year, month and day, plus all of the frequency data for the speech the file refers to. The frequency data consisted of a semantic category ID and a count, one per line and separated by a tab so it would be easy to split this information up and then upload each count for each category for each speech into a database table.
It took a little bit of time to get the script running successfully due to some confusion over how the two base64 encoded parts of each line were separated. In his email, Steve had said that the parts were split by ‘whitespace’, which I took to mean a space character. Unfortunately there didn’t appear to be a space character present but looking at the encoded lines I could see that each section appeared to be split with an equals sign so I set my script going using this. I also contacted Steve to check this was right and it turned out that by ‘whitespace’ he’d meant a tab character and that the equals sign I was using to split the data was a padding character that couldn’t be relied upon to always be present. After hearing this I managed to update my script and set it off again. However, my script is unfortunately not going to be a suitable way to extract the data as its execution is just too slow for the amount of data we’re dealing with. Having started the process on Wednesday evening it took until Sunday before the script had processed the data for one year. During this time it had extracted more than 7.5million frequencies relating to tens of thousands of speeches, but at the current rate it will take more than two years to finish processing the data for the 200ish years of data that we have. A more efficient method is going to be required.
Following on from my meeting with Scott Spurlock last week I spent a bit of time researching crowdsourcing tools. I managed to identify three open source tools that might be suitable for Scott’s project (and potentially other projects in future).
First of all is one called PyBossa: http://pybossa.com/. It’s written in the Python programming language, which I’m not massively familiar with but have used a bit. The website links through to some crowdsourcing projects that have been created using the tool and one of them is quite similar to what Scott is wanting to do. The example project is getting people to translate badly printed German text into English, an example of which can be found here: http://crowdsourced.micropasts.org/app/NFPA-SetleyNews2/task/40476. Apparently you can create a project for free via a web interface here: http://crowdcrafting.org/ but I haven’t investigated this.
The second one is a tool called Hive that was written by the New York Times and has been released for anyone to use: https://github.com/nytlabs/hive with an article about it here: http://blog.nytlabs.com/2014/12/09/hive-open-source-crowdsourcing-framework/. This is written in ‘Go’ which I have to say I’d never heard of before so have no experience of. The system is used to power a project to crowdsource historical adverts in the NYT, and you can access this here: http://madison.nytimes.com/contribute/. It deals with categorising content rather than transcribing it, though. I haven’t found any other examples of projects that use the tool as of yet.
The third option is the Zooniverse system, which does appear to be available for download: https://github.com/zooniverse/Scribe. It’s written in Ruby, which I only have a passing knowledge of. I haven’t been able to find any examples of other projects using this software and I’m also not quite sure how the Scribe tool (which says it’s “a framework for crowdsourcing the transcription of text-based documents, particularly documents that are not well suited for Optical Character Recognition) fits in with other Zooniverse tools, for example Panoptes (https://github.com/zooniverse/Panoptes), which says it’s “The new Zooniverse API for supporting user-created projects.” It could be difficult to get everything set up, but is probably worth investigating further.
I spent a small amount of time this week dealing with App queries from other parts of the University, and I also communicated briefly with Jane Stuart-Smith about a University data centre. I made a few further tweaks for the SciFiMedHums website for Gavin Miller and talked with Megan Coyer about her upcoming project, which is now due to commence in August, if recruitment goes to plan.
What remained of the week after all of the above I mostly spent on Mapping Metaphor duties. Ellen had sent through the text for the website (which is now likely to go live at the end of the month!) and I made the necessary additions and changes. My last task of the week was to begin to process the additional data that Susan’s students had compiled for the Scots Thesaurus project. I’ve so far managed to process two of these files and there are another few still to go, which I’ll get done on Monday.
I returned to the office on Monday to check how the Hansard data extraction was going to discover that our 2TB external hard drive had been completed filled without the extraction completing. I managed to extract the files that have the semantic tags (commons.tagged.v3.1 and lord.tagged.v3.1) but due to the astonishing storage overheads the directory structures and tiny XML files have the 2Tb external drive just wasn’t big enough to hold the full-text files as well. The commons full-text file (commons.mr) is 36.55Gb but when the extraction quit due to using up all available space this file had already taken up 736Gb. Rather strangely, OSX’s ‘get info’ facility gives completely wrong directory size values. After taking about 15 minutes to check through the directories it reckoned the commons.mr extraction directory was taking up just 38.1Gb of disk space and the data itself was taking up just 25.78Gb. I had to run the ‘du’ command at the command line (du -hcs) to get the accurate figure of 736Gb. It makes me wonder what command ‘get info’ is using.
I had a couple of meetings with Fraser and some chats with Marc about the Hansard data and what it is the we need to do with it, and while we do need access to all of the files I’ve been extracting it turns out that what we really need for the bookworm visualisations we’re hoping to put together (see a similar example for the US congress here: http://bookworm.culturomics.org/congress/) is the data about the frequency of occurrence for each thematic heading in each speech. This data wasn’t actually located in the files I had previously been extracting but was in a different tar.gz file that we had received from Steve previously. I set to work extracting the data from this file, only to find that the splitting tool kept quitting out during processing.
I had decided to extract the ‘thm’ file first, as this contained the frequencies for the thematic headings, but the commons file, commons.tagged.v3.1.mr.thm.fql.mr, which is 9.51Gb in size, when passed through the mrarchive script quit out after only processing 210Mb. I tried this twice and it quit out at the same point each time having only extracted some of the days from ‘commons 2000-2005’. I then tried to extract the file in Windows rather than on my Mac but encountered the same problem. I tried the other files (HT and sem frequency lists) but ran into the same problem. We contacted Steve at Lancaster about this and he’s given me some helpful pointers about how I can create a script that will be able to process the data from the joined file rather than having to split the file up first and I’m going to try this approach next week.
Other than these rather frustrating Hansard matters I worked on a number of different projects this week. I spent some time on AHRC duties, undertaking more reviews plus writing materials for a workshop on writing Technical Plans that I’m creating in collaboration with colleagues in HATII. I also helped Gavin Miller with his ‘Sci-Fi and the Medical Humanities’ project, creating some graphics for the website, making banners and doing other visual things, which was good fun. I helped Vivien Williams of the Burns project with some issues she’s been having with managing images on the Burns site and I had some admin duties to perform with regards to the University’s Apple Developer account too.
I met with Craig Lamont, a PhD student who is working on a project with Murray Pittock. They are putting together an interactive historical map of Edinburgh with lots of points of interest on it and I helped Craig get this set up. We tried to get the map embedded in the University’s T4 system but unfortunately we didn’t have much success. We have since heard back from the University’s T4 people and it may be possible to embed such content using a different method. I’ll need to try this next time we meet. In the meantime I set up the map interface on Craig’s laptop and showed him how he could add new points to the map, so he will be able to add all the necessary content himself now. I also met with Scott Spurlock on Friday to discuss a project he is putting together involving Kirk records. I can’t really go into detail about this here but I’m going to be helping him to write the bid over the next few weeks.
For the Scots Thesaurus project I had a fair amount of data to format and upload. Magda had sent me some more data late last week before she went away so I added that to our databases. Two students had been working on the project over the past couple of weeks too and they had also produced some datasets which I uploaded. I met with the students and Susan on Thursday to discuss the data and to show them how it was being used in the system.
For the DSL we finally got round to going live with a number of updates, including Boolean searching and the improved search results navigation facilities. There was a scary moment on Thursday morning when the API was broken and it wasn’t possible to access any data, but Peter soon got this sorted and the new facilities are now available for all (see http://www.dsl.ac.uk/advanced-search/).
I was also involved with a few Mapping Metaphor duties this week. After working with the OE data that I had written a script to collate last week, Ellen sent me a version of the data that needed duplicate rows stripped out of. I passed this file through a script I’d written and this reduced the number of rows from 5732 down to 2864. Ellen then realised that she needed consolidated metaphor codes too (i.e. the code noted from A>B, e.g. ‘metaphor strong’ doesn’t always correspond to the code that is recorded for B>A, e.g. ‘metaphor weak’) so I passed the data through a script that generated these codes too. All in all it’s been another busy week.
I returned to work this week after spending most of the past two weeks off, which I spent swimming in the warm seas off Turkey and visiting the wonderful city of Stockholm. I did work on Wednesday last week, which I mostly spent catching up with emails, arranging meetings with people, making critical updates to the variety of WordPress installations I’m responsible for, dealing with some issues with the University’s Apple developer account and reading through some AHRC materials.
I have been really rather busy this week and seem to have worked on a large number of projects and proposals. I’ve had six meetings, which I’ll briefly summarise now. On Monday I attended the first meeting of the ‘Metaphor in the Curriculum’ project, the follow-on project for Mapping Metaphor. It was nice to meet up with members of the team again and hear about how the plans for the project have been progressing. We discussed some of the technical requirements for the project and how and when development will proceed. Some focus groups will take place over the next couple of weeks to gather feedback and requirements and I will probably start on the development work towards the end of the summer. We will have another meeting in July to finalise this. At the meeting Wendy mentioned that another batch of Mapping Metaphor data was ready to be uploaded, which Ellen sent to me after the meeting. I ran the file through my batch upload script and the online database now has 17,171 metaphor connections, down from 17,952 (due to the deletion of ‘noise’ and ‘relevant’), and 5,531 sample lexemes, up from 1,235. It’s looking pretty good, I think. Wendy is intending to launch the Mapping Metaphor website sometime in July, so we’re getting close now.
I had a further Metaphor related meeting with Ellen on Thursday to discuss the Old English metaphor data. We’d previously agreed that I would create an Old English version of the metaphor map in June, but there have been some delays getting the data together. The data was located in around 400 Excel spreadsheets and Ellen was wondering whether I could create a script that would automatically extract this data, pick out the columns that she needed and create one big file for her to work on. I spent some time on Thursday and Friday creating such a script, using a handy PHP library called PHPExcel (https://phpexcel.codeplex.com/) to automatically read the spreadsheet files and extract the content. This worked pretty well, although it did take a while for the script to run through all 399 files, and for some reason it silently failed on file number 282, which took some investigation. I think I’ve got a version of the data that Ellen will be able to use now, containing 32,421rows of data.
On Monday I had an impromptu meeting with Susan and Magda about the Scots Thesaurus. Magda had spent some time working with the tool that I’d created that connects the Historical Thesaurus of English with the Dictionary of the Scots Language to allow for searching between the sites and the creation of category records for the Scots Thesaurus. She’d come up with a few suggestions for improvements so we discussed these and I made a bit of a ‘to do’ list. I also went through the WordPress plugin I’d created with Susan and Magda and showed them how it might be used. They seemed pretty pleased with the way things were working out, although there is still a lot of technical work left to do. Later in the week Magda sent me a CSV file containing a lot more of the content she had created, and I uploaded this to the database for the Tool and the WordPress plugin. I really need to get around to amalgamating these databases and joining the functionality of the tool with that of WordPress. I aim to get this done (together with the updates Magda has requested) in the next couple of weeks.
On Tuesday I had a meeting the Jennifer Smith to discuss a possible dialect resource for high-school children. Jennifer wondered whether creating an app might work out and we discussed some possibilities. As she is wanting one of her students to work on developing the resource we agreed that for the time being the best approach would be for the student to just create the resource using the form capabilities of Google Docs and we’ll think about how this content can then be reshaped in future. We also briefly discussed her big AHRC project. This is due to start in August and I will be involved quite closely with this once it kicks off. It should be an interesting project.
I also met with Carolyn Jess-Cooke on Tuesday to discuss her ideas for a project. I can’t really go into too much detail here but it will probably take the form of an app. We spent some time discussing the possibilities and I wrote a brief outline of how the technical portion of the project may proceed to help with the bid.
I had a further meeting on Friday with Mary Gibson and her PhD student about a project they are hoping to put together, which will probably involve geographical and temporal data. It seems like quite an exciting project, but I can’t really go into any detail here. I will probably be contributing to the technical side of this bid some time towards the end of the summer.
Other than all of the above, I spent several hours this week on AHRC duties and I will have to spend several more over the next few weeks too. I also helped Gavin Miller out with a website he is setting up for a project that I previously advised him on and which recently received funding from the Wellcome Trust. I also heard from Gerry Carruthers that a project he and Catriona Macdonald put together and which I gave technical advice on has received funding from the Carnegie Trust, which is excellent news. I also spoke to Megan Coyer and George Pattison about projects they are working on that will need by input. Christian had also gone through the Essentials of Old English app and had made a list of things for me to change and although I didn’t have time to update these this week I will try to get this done soon as it would be good to launch the app.
I spent some time on Friday making some updates to the development version of the DSL website, including adding in new text for the advanced search and fixing the highlighting of search terms in the entry page when using Boolean terms. No words were being highlighted when Boolean terms such as ‘AND’ were part of the search string but I’ve figured out how to get around this and the terms now get highlighted in their fetching purple.
Throughout the course of the week I have also been working with the Hansard data for the Samuels project. You may recall from earlier reports that I’d managed to get the script from Lancaster that splits the data up to work using my Macbook but the data was just too massive for the storage capabilities I had at my disposal. Before I went off on my holidays Fraser had given me a 2Tb external hard drive, which should be just about big enough for the data. I set about extracting the data on Monday, and it’s a very long process indeed. My poor little laptop is still pegging away at it, having been running constantly day and night for almost 5 days now. I was hoping that the process would have completed by Friday but it’s going to continue into next week. Hopefully the data that is being extracted is going to be usable and complete. I am going to have to ask the Lancaster people to compare the file size and counts of our data with theirs as rather strangely the extracted data appears to be smaller than the joined data, which doesn’t seem right to me. Phew, that’s all for this week.