It’s been another busy week, but I have to keep this report brief as I’m running short of time and I’m off next Monday. I came into work on Monday to find that the script I had left executing on the Grid to extract all of the Hansard data had finished working successfully! It left me with a nice pile of text files containing SQL insert statements – about 10Gb of them. As we don’t currently have a server on which to store the data I instead started a script executing that runs each SQL insert command on my desktop PC and puts the data into a local MySQL database. Unfortunately it looks like it’s going to take a horribly long time to process the data. I’m putting the estimate at about 229 days.
My arithmetic skills are sometimes rather flaky so here’s how I’m working out the estimate. My script is performing about 2000 inserts a minute. There are about 1200 output files and based on the ones I’ve looked at they contain about 550,000 lines each. 550,000 x 1200 = 660,000,000 lines in total. This figure divided by 2000 gives the number of minutes it would take (330,000). Divide this by 60 gives the number of hours (5,500). Divide this by 24 gives the number of days (229). My previous estimate for doing all of the processing and uploading on my desktop PC was more than 2 years, so using the Grid has speeded things up enormously, but we’re going to need something more than my desktop PC to get all of the data into a usable form any time soon. Until we get a server for the database there’s not much more I can do.
On Tuesday this week we had a REELS team meeting where we discussed some of the outstanding issues relating to the structure of the database (amongst other things). This was very useful and I think we all now have a clear idea of how the database will be structured and what it will be able to do. After the meeting I wrote up and distributed an updated version of my database specification document and I also worked with some map images to create a more pleasing interface for the project website (it’s not live yet though, so no URL). Later in the week I also created the first version of the database for the project, based on the specification document I’d written. Things are progressing rather nicely at this stage.
I spent a bit of time fixing some issues that had cropped up with other projects. The Medical Humanities Network people wanted a feature of the site tweaked a little bit, so I did this. I also fixed an issue with the lexeme upload facility of the Scots Corpus, which was running into some maximum form size limits. I had a funeral to attend on Thursday afternoon so I was away from work for that.