On Friday last week I submitted a job to ScotGrid that would extract all of the data from the Hansard dataset that was supplied by Lancaster. I had to submit this job because I’d noticed that the structure of the metadata had changed midway through the data, which had messed up my extraction script. I submitted the 1252 files and left them running over the weekend and by Monday morning they had all completed, giving me a set of 1252 SQL files. None of the error checks I’d added into my extraction script last week had tripped so hopefully the metadata structure doesn’t have any other surprises waiting. On Monday I started running batches of the SQL files into the MySQL database that I have for the data, but it’s going to take quite a while for these to process as I have to send them through to ScotGrid in small batches of around 20 otherwise the poor database has too many connections and returns an error.
I spent most of the rest of the week working on the ‘Basics of English Metre’ app and made some good progress with it. I have now completed Unit 2 and have made a start on Unit 3. I did get rather bogged down in Unit 2 for a while as several of the exercises looked like regular exercises that I had already developed code to process, only to have extra pesky questions added on the end that only appear when the final question on a page is correctly answered. These included selecting the foot type for a set of lines (e.g. Iambic pentameter) or identifying a poem based on its metre. However, I managed to find a solution to all of these quirks and added in some new question styles. I’m currently on page 2 of Unit 3, which consists of four questions that each have four stages. The first is syllable boundary identification, the second is metre analysis, the third is putting in the foot boundaries while the fourth is adding the rhythm. I’ve got all of this working, although have only supplied data for the fourth stage for the last of the lines on the page. Also there are some more of the pesky additional questions that need to be integrated and rather strangely the existing website doesn’t supply answers for the fourth stage, so I’m going to need to get someone in English Language to supply these.
Other than the above I helped Carolyn Jess-Cooke from English Literature to add a forum to her ‘writing mental health’ website. I also had an email conversation with Rhona Brown about the digitised images and OCR possibilities for her ‘Edinburgh Gazetteer’ project that is starting soon. I had a chat with Graeme Cannon about an on-screen keyboard I had developed for the Essentials of Old English app, as he is going to need a similar feature for one of his projects. I also spoke with Flora about the dreaded H27 error with the OE data for Mapping Metaphor. A solution to this is still eluding her, but I’m afraid I wasn’t able to offer much advice as I don’t know much about the Access database and the forms that were created for the data. I might see if I can extract the data and do something with it using PHP if she hasn’t found a solution soon. I also spoke to Rob Maslen about a new blog he’s wanting to set up for student of his Fantasy course next year and talked to Scott Spurlock about a possible crowdsourcing tool for a project he is putting together.
I am going to be on holiday for the next two weeks so there won’t be a further update until after I’m back.
I spent about a day this week working with the Hansard data again. By Friday morning the frequencies database contained 358,408,449 rows, with just under half of the data processed. However, I’m going to have to go back to square one again as I’ve noticed an inconsistency with the data. I had split the base64 encoded data from Lancaster up into about 1200 separate files and I noticed on Friday that up until about midway through the 49th file the metadata has the following structure:
But then after that the structure changes as follows:
That extra /commons/ in there messed up the part of my file that split this information up and lead to the loss of the actual filename from my processed data. It meant that I had to re-run everything through the grid again, wipe the database and re-run the insertion jobs again.
I returned to my original shell script that extracted the Base64 data and reworked it to add in some checks for the structure of the data. I also added in some error checking to ensure that if (for example) the ‘year’ field doesn’t contain a number that an error is raised. I also took the opportunity to update the SQL statements that were generated, firstly to add in the all-important semi-colon delimiting character that I had missed out first time around and secondly to make the insert statements standard SQL rather than the MySQL specific syntax that I’ve tended to use in the past. The standard way is ‘insert into table(column1, column2) values(‘value1’, ‘value2’);’ while MySQL also allows ‘insert into table set column1 = ‘value1’, column2 = ‘value2’’. Having updated and tested out the file I then submitted a new batch of jobs to ScotGrid, and the output files seemed to work well with both possible metadata structures. I submitted all of the 1200 odd files to run over the weekend.
In addition to the above work I did a few other tasks. I met with Jane Stuart Smith to discuss a couple of upcoming projects she’s putting together, plus I gave her some further input into the project I advised her on last week. I also upgraded the WordPress installations for a number of sites that I’ve set up over the years as Chris had pointed out that they were running older versions of the software. I was also supposed to meet Flora on Friday to discuss the issue relating to the H27 categories for the Old English data for Mapping Metaphor, but unfortunately Flora was ill and we weren’t able to meet. Hopefully we can fit this in next week.
Last week I started to redevelop the old STELLA resource ‘The Basics of English Metre’ and I spent much of this week continuing with it. The resource is split into three sections, each of which feature a variety of interactive exercises throughout. Last week I made a start on the first exercise, and this week I made quite a bit of progress with the content, completing the first 12 out of 13 pages of the first section. As with the previous STELLA resources I redeveloped, I’ve been using the jQueryMobile framework to handle the interface and jQuery itself to handle the logic of the exercises. The contents of each page are stored in a JSON file, with the relevant content pulled in and processed when a page loads. The first exercise I completed required the user to note the syllable boundaries within words. I was thankfully able to reuse a lot of the code from the ARIES app for this. The second exercise type required the user to choose whether the syllables in a word were strongly or weakly stressed. For this I repurposed the ‘part of speech’ selector exercise type I had created for the Essentials of English Grammar app. The third type of exercise was a multi-stage exercise requiring syllable identification for stage 1 and then stress identification for stage 2. Rather than just copying the existing code from the other apps I also refined it as I know a lot more about the workings of jQueryMobile than I did when I put these other apps together. For example, with the ‘part of speech’ selector the different parts of speech appeared in a popup that appeared when the user pressed on a dotted box. After a part of speech was selected it then appeared in the dotted box and the popup closed. However, I had previously set things up so that a separate popup was generated for each of the dotted boxes, which is hugely inefficient as the content of each popup is identical. With the new app there is only one popup and the ID of the dotted box is passed to it when the user presses on it. This is a much better approach. As most of the remaining interactive exercise are variations on the exercises I’ve already tackled I’m hoping that I’ll be able to make fairly rapid progress with the rest of the resource.
Other than working on the ‘Metre’ resource I communicated with Jane Stuart Smith, who is currently putting a new proposal together. I’m not going to be massively involved in it, but will contribute a little so I read through all of the materials and gave some feedback. Bill Kretzschmar is also working on a new proposal that I will have a small role in too, so I had an email chat with him about this too. I also completed a second version of the Technical Plan for the proposal Murray Pittock is currently writing. The initial version required some quite major revisions due to changes in how a lot of the materials will be handled, but I think we are now getting close to a final version.
I also spent a little bit of time working with some of the Burns materials for the new section and gave a little bit of advice to a colleague who was looking into incorporating historical maps into a project. I also fixed a couple of bugs with the SciFiMedHums ‘suggest a new bibliographical item’ page and then made it live for Gavin Miller. You can suggest a new item here: http://scifimedhums.glasgow.ac.uk/suggest-new-item/ (but you’ll need to register first). Finally, I continued processing the Hansard data using the ScotGrid infrastructure. By the end of the week 200 of the 1200 SQL files had been processed, resulting in more than 150,000,000 rows. I’ll just keep these scripts running over the next few weeks until all of the data is done. I’m not sure than an SQL database is going to be quick enough to actually process this amount of data, however. I did a simple ‘count rows’ query and it took over two minutes to return an answer, which is a little worrying. It’s possible that after all of the data is inserted I may then have to look for another solution. But we’ll see.
Monday was a bank holiday so this was a four-day week for me. I had yet more AHRC review duties to perform this week so quite a lot of Tuesday was devoted to that. I also had an email conversation with Murray Pittock about the technical plan for a proposal he is currently putting together. We’re still trying to decide on the role of OCR in the project, but I think a bit of progress is being made. I also spent some further time helping to sort out the materials for the new section of the Burns website. This mainly consisted of sorting out a series of video clips of song performances, uploading them to YouTube and embedding them in the appropriate pages. For Mapping Metaphor, Wendy had sent me on some new teaching materials that she wanted me to add to the ‘Metaphoric’ resource (http://mappingmetaphor.arts.gla.ac.uk/metaphoric/teaching-materials.html). This involved uploading the files, adding them to the zip files and updating the browse by type and topic facilities to incorporate the resources.
My two big tasks of the week were working with the Hansard data and starting the redevelopment of another old STELLA resource. As mentioned in previous posts, Gareth Roy of Physics and Astronomy has kindly set up a database server for me where the Hansard data can reside as we’re processing it. Last week I added all of the ancillary tables to the database (e.g. Information about speakers) and I ‘fixed’ the SQL files so that MySQL could process them at the command line and I wrote a very simple Shell script that takes the path to an SQL file as an argument and then invokes the MySQL command to import that file in to the specified database. I tested this out on the first output file, running the Shell script on the test server I have in my office and it successfully inserted all 581,409 rows contained in the file into the database. It did take quite a long time for the script to execute, though. About an hour and 20 minutes, in fact. With that individual test successfully completed I wrote another Shell script that would submit a batch of jobs to the Grid. It took a little bit of trial and error to get this script to work successfully within the Grid, mainly due to needing to specify the full path to the MySQL binary from the nodes. Thankfully I got the script working in the end and set it to work on the first batch of 9 files (files 2-10 as I had already processed file 1). It took about four hours for the nodes to finish processing the files, which means we’re looking at (very roughly) 30 minutes per file and there are about 1200 files so it might take about 25 days to process them all. That is unless processing more than 9 jobs at a time is faster, but I suspect speed might be limited to some extent at the database server end. I submitted a further 20 jobs before I left the office for the weekend so we’ll just need to see how quickly these are processed.
The STELLA resource I decided to look into redeveloping is ‘The Basics of English Metre’. The old version can be found here: http://www.arts.gla.ac.uk/stella/Metre/MetreHome.html. It’s a rather out of date website that only works properly in Internet Explorer. The exercises contained in it also require the Flash plugin in order to function. Despite these shortcomings the actual content (if you can access it) is still very useful, which I think makes it a good candidate for redevelopment. As with the previous STELLA resources I’ve redeveloped, I intend to make web and app versions of the resource. I spent some time this week going through the old resource, figuring out how the exercises function and how the site is structured. After that I began to work on a new version, setting up the basic structure (which in common with the other STELLA resources I’ve redeveloped will use the jQueryMobile framework). By the end of the week I had decided on a structure for the new site (i.e. which of the old pages should be kept as separate pages and which should be merged) and had created a site index (this is already better than the old resource which only featured ‘next’ and ‘previous’ links between pages with no indication of what any of the pages contained or any way to jump to a specific page). I also made a start processing the content, but I only got as far as working on the first exercise. Some of the latter exercises are quite complicated and although I will be able to base a lot of the exercise code on the previous resources I had created it is still going to require quite a bit of customisation to get things working for the sorts of questions these exercises ask. I hope to be able to continue with this next week, although I would also like to publish Android versions of the two STELLA apps that are currently only available for iOS (English Grammar: An Introduction and ARIES) so I might focus on this first.
I participated in the University and College Union’s strike action this week so I didn’t work on Wednesday and Thursday. I spent the rest of the week on lots of relatively small bits of work. I had some more AHRC review duties to carry out so I spent some time on that. I also spent a bit of time with the Hansard data and the new database that I have access to on a server in Physics and Astronomy. I migrated all of my existing tables, including data about speeches, members, constituencies and things like that to the new database. I didn’t copy my two-year sample of the frequency data across, but I did copy the structure of the table over, so it is now ready to accept the full 200 year data once I get things going. I looked into the possibility of using Python to run through the SQL files that had previously been generated via the Grid, as Python is available on the Grid nodes. However, in order to use Python with MySQL a new library needs to be installed and I wasn’t sure whether this would be possible on the Grid. I reached the conclusion that writing another shell script to run the MySQL command to process each SQL file probably made more sense than pulling each SQL file into Python and then processing it line by line. The only problem is that MySQL expects each of the insert statements within the SQL file to be terminated by a semi-colon, which is standard SQL syntax. Unfortunately I was so used to processing SQL commands based on line breaks using PHP that I omitted the semi-colon from the shell script that generated the SQL files. So I had over 80Gb of SQL files that MySQL itself was unable to process. I could have fixed the shell script and then re-run it on the Grid again, but I must admit I felt a little foolish for having missed off the semi-colon and decided to just fix the files on my own PC instead. I used Notepad++’s ‘find and replace in files’ feature to replace each line break with a semi-colon followed by a line break. It took a few days for Notepad++ to complete the task, but that was fine as I just left it running in the background whilst I got on with other things. I now have a set of SQL files that the MySQL commands will be able to read. The next step will be to write a shell script that connects to the MySQL server and runs a single SQL file. It should hopefully not be too tricky to create such a script.
I also spent a fair amount of time this week trying to investigate why the H27 categories had somehow been omitted from the Old English Mapping Metaphor data. I started looking into this last week but hadn’t found out what was the cause. This week I did some more investigation, both on my own and with the help of Wendy. We eventually figured out that although the H27 data had been present in the spreadsheet I had generated from the individual category spreadsheets last year, it didn’t appear in the Access database that Flora had created and which was being used by Carole and others to process the ‘Stage 5’ data for the metaphorical connections (i.e. directionality and sample lexemes). I had a long but useful meeting with Wendy, Flora and Carole on Friday where we went through all of this, trying to work out where the omission had occurred. It looks very much like the H27 categories were always processed independently of the main OE dataset, which is why they never appeared in the Access table. We agreed that Flora would update the Access form that Carole was using in order to incorporate the H27 categories, which will hopefully allow these categories to be processed without having any impact on the other categories. Flora is going to work on this next week, all being well.
During the rest of the week I wrote up my notes from the SCOSYA meeting last week and continued to help Vivien out with some updates to the new section of the Burns website she is putting together. I also gave some advice to Luca Guariento, who has this week taken over the management of the Curious Travellers website.