This was the third week of the strike action and I therefore only worked on Friday. I started the day making a couple of further tweaks to the ‘Storymap’ for the RNSN project. I’d inadvertently uploaded the wrong version of the data just before I left work last week, which meant the embedded audio players weren’t displaying, so I fixed that. I also added a new element language to the REELS database and added the new logo to the SPADE project website (see http://spade.glasgow.ac.uk/).
With these small tasks out of the way I spent the rest of the day on Historical Thesaurus and Linguistic DNA duties. For the HT I had previously created a ‘fixed’ header that appears at the top of the page if you start scrolling down, so you can always see what it is you’re looking at, and also quickly jump to other parts of the hierarchy. You can also click on a subcategory to select it, which adds the subcategory ID to the URL, allowing you to quickly bookmark or cite a specific subcategory. I made this live today, and you can test it out here: http://historicalthesaurus.arts.gla.ac.uk/category/#id=157035. I also fixed a layout bug that was making the quick search box appear in less than ideal places on certain screen widths and I also updated the display of the category and tree on narrow screens: Now the tree is displayed beneath the category information and a ‘jump to hierarchy’ button appears. This in combination with the ‘top’ button makes navigation much more easy on narrow screens.
I then started looking at the tagged EEBO data. This is a massive dataset (about 50Gb of text files) that contains each word on a subset of EEBO that has been semantically tagged. I need to extract frequency data from this dataset – i.e. how many times each tag appears both in each text and overall. I have initially started to tackle this using PHP and MySQL as these are the tools I know best. I’ll see how feasible it is to use such an approach and if it’s going to take too long to process the whole dataset I’ll investigate using parallel computing and shell scripts, as I did for the Hansard data. I managed to get a test script working that managed to go through one of the files in about a second, which is encouraging. I did encounter a bit of a problem processing the lines, though. Each line is tab delimited and rather annoyingly, PHP’s fgetcsv function doesn’t treat ‘empty’ tabs as separate columns. This was giving me really weird results as if a row had any empty tabs the data I was expecting to appear in columns wasn’t there. Instead I had to use the ‘explode’ function on each line, splitting it up by the tab character (\t), and this thankfully worked. I still need confirmation from Fraser that I’m extracting the right columns, as strangely there appear to be thematic heading codes in multiple columns. Once I have confirmation I’ll be able to set the script running on the whole dataset (once I’ve incorporated the queries for inserting the frequency data into the database I’ve created).