I returned to the office on Monday to check how the Hansard data extraction was going to discover that our 2TB external hard drive had been completed filled without the extraction completing. I managed to extract the files that have the semantic tags (commons.tagged.v3.1 and lord.tagged.v3.1) but due to the astonishing storage overheads the directory structures and tiny XML files have the 2Tb external drive just wasn’t big enough to hold the full-text files as well. The commons full-text file (commons.mr) is 36.55Gb but when the extraction quit due to using up all available space this file had already taken up 736Gb. Rather strangely, OSX’s ‘get info’ facility gives completely wrong directory size values. After taking about 15 minutes to check through the directories it reckoned the commons.mr extraction directory was taking up just 38.1Gb of disk space and the data itself was taking up just 25.78Gb. I had to run the ‘du’ command at the command line (du -hcs) to get the accurate figure of 736Gb. It makes me wonder what command ‘get info’ is using.
I had a couple of meetings with Fraser and some chats with Marc about the Hansard data and what it is the we need to do with it, and while we do need access to all of the files I’ve been extracting it turns out that what we really need for the bookworm visualisations we’re hoping to put together (see a similar example for the US congress here: http://bookworm.culturomics.org/congress/) is the data about the frequency of occurrence for each thematic heading in each speech. This data wasn’t actually located in the files I had previously been extracting but was in a different tar.gz file that we had received from Steve previously. I set to work extracting the data from this file, only to find that the splitting tool kept quitting out during processing.
I had decided to extract the ‘thm’ file first, as this contained the frequencies for the thematic headings, but the commons file, commons.tagged.v3.1.mr.thm.fql.mr, which is 9.51Gb in size, when passed through the mrarchive script quit out after only processing 210Mb. I tried this twice and it quit out at the same point each time having only extracted some of the days from ‘commons 2000-2005’. I then tried to extract the file in Windows rather than on my Mac but encountered the same problem. I tried the other files (HT and sem frequency lists) but ran into the same problem. We contacted Steve at Lancaster about this and he’s given me some helpful pointers about how I can create a script that will be able to process the data from the joined file rather than having to split the file up first and I’m going to try this approach next week.
Other than these rather frustrating Hansard matters I worked on a number of different projects this week. I spent some time on AHRC duties, undertaking more reviews plus writing materials for a workshop on writing Technical Plans that I’m creating in collaboration with colleagues in HATII. I also helped Gavin Miller with his ‘Sci-Fi and the Medical Humanities’ project, creating some graphics for the website, making banners and doing other visual things, which was good fun. I helped Vivien Williams of the Burns project with some issues she’s been having with managing images on the Burns site and I had some admin duties to perform with regards to the University’s Apple Developer account too.
I met with Craig Lamont, a PhD student who is working on a project with Murray Pittock. They are putting together an interactive historical map of Edinburgh with lots of points of interest on it and I helped Craig get this set up. We tried to get the map embedded in the University’s T4 system but unfortunately we didn’t have much success. We have since heard back from the University’s T4 people and it may be possible to embed such content using a different method. I’ll need to try this next time we meet. In the meantime I set up the map interface on Craig’s laptop and showed him how he could add new points to the map, so he will be able to add all the necessary content himself now. I also met with Scott Spurlock on Friday to discuss a project he is putting together involving Kirk records. I can’t really go into detail about this here but I’m going to be helping him to write the bid over the next few weeks.
For the Scots Thesaurus project I had a fair amount of data to format and upload. Magda had sent me some more data late last week before she went away so I added that to our databases. Two students had been working on the project over the past couple of weeks too and they had also produced some datasets which I uploaded. I met with the students and Susan on Thursday to discuss the data and to show them how it was being used in the system.
For the DSL we finally got round to going live with a number of updates, including Boolean searching and the improved search results navigation facilities. There was a scary moment on Thursday morning when the API was broken and it wasn’t possible to access any data, but Peter soon got this sorted and the new facilities are now available for all (see http://www.dsl.ac.uk/advanced-search/).
I was also involved with a few Mapping Metaphor duties this week. After working with the OE data that I had written a script to collate last week, Ellen sent me a version of the data that needed duplicate rows stripped out of. I passed this file through a script I’d written and this reduced the number of rows from 5732 down to 2864. Ellen then realised that she needed consolidated metaphor codes too (i.e. the code noted from A>B, e.g. ‘metaphor strong’ doesn’t always correspond to the code that is recorded for B>A, e.g. ‘metaphor weak’) so I passed the data through a script that generated these codes too. All in all it’s been another busy week.