I spent about a day this week working with the Hansard data again. By Friday morning the frequencies database contained 358,408,449 rows, with just under half of the data processed. However, I’m going to have to go back to square one again as I’ve noticed an inconsistency with the data. I had split the base64 encoded data from Lancaster up into about 1200 separate files and I noticed on Friday that up until about midway through the 49th file the metadata has the following structure:
But then after that the structure changes as follows:
That extra /commons/ in there messed up the part of my file that split this information up and lead to the loss of the actual filename from my processed data. It meant that I had to re-run everything through the grid again, wipe the database and re-run the insertion jobs again.
I returned to my original shell script that extracted the Base64 data and reworked it to add in some checks for the structure of the data. I also added in some error checking to ensure that if (for example) the ‘year’ field doesn’t contain a number that an error is raised. I also took the opportunity to update the SQL statements that were generated, firstly to add in the all-important semi-colon delimiting character that I had missed out first time around and secondly to make the insert statements standard SQL rather than the MySQL specific syntax that I’ve tended to use in the past. The standard way is ‘insert into table(column1, column2) values(‘value1’, ‘value2’);’ while MySQL also allows ‘insert into table set column1 = ‘value1’, column2 = ‘value2’’. Having updated and tested out the file I then submitted a new batch of jobs to ScotGrid, and the output files seemed to work well with both possible metadata structures. I submitted all of the 1200 odd files to run over the weekend.
In addition to the above work I did a few other tasks. I met with Jane Stuart Smith to discuss a couple of upcoming projects she’s putting together, plus I gave her some further input into the project I advised her on last week. I also upgraded the WordPress installations for a number of sites that I’ve set up over the years as Chris had pointed out that they were running older versions of the software. I was also supposed to meet Flora on Friday to discuss the issue relating to the H27 categories for the Old English data for Mapping Metaphor, but unfortunately Flora was ill and we weren’t able to meet. Hopefully we can fit this in next week.