I continued to work with the Hansard dataset this week, working with Chris McGlashan to get the dataset onto a server. Once it was there I could access the data, but as there are more than 682 million rows of frequency data things were a little slow to query, especially as no indexes were included in the dump. As I don’t have command-line access to the server I needed to ask Chris to run the commands to create indexes, as each index takes several hours to compile. He set on going that indexed the data by year, and after a few hours it had completed, resulting in an 11GB index file. With that in place I could much more swiftly retrieve the data for each year. I’ve let Marc know that this data is now available again, and I just need to wait to hear back from him to see exactly what he wants to do with the dataset.
I spent a fair amount of time this week advising staff on technical aspects of research proposals. It’s the time of year when the students are all away and staff have time to think about such proposals, meaning things get rather busy for me. I created a Data Management Plan for a follow-on project that Bryony Randall in English Literature is putting together. I also started to migrate a project website she had previously set up through WordPress.com onto an instance of WordPress hosted at Glasgow. Her site on WordPress.com was full of horribly intrusive adverts that did not give a good impression and really got in the way, and moving to hosting at Glasgow will stop this, and give the site a more official looking URL. It will also ensure the site can continue to be hosted in future, as free commercial hosting is generally not very reliable. I hope to finish the migration next week. I also responded to a query about equipment from Joanna Kopaczyk, discussed a couple of timescale issues with Thomas Clancy and gave some advice to Karen Lury from TFTS about video formats and storage requirements. I also met with Clara Cohen to discuss her Data Management Plan.
Also this week I sorted out my travel arrangements for the DH2019 conference and updated the site layout slightly for the DSL website, and on Wednesday I attended the English Language and Linguistics Christmas lunch, which was lovely. I also continued with my work on the HT / OED category linking, ticking off another batch of matches, which takes us down to 1894 unmatched OED categories that have words and a part of speech.
I also spent about a day continuing to work on the Bilingual Thesaurus. Last week I’d updated the ‘category’ box on the search page to make it an ‘autocomplete’ box, that lists matching categories as you type. However, I’d noticed that this was often not helpful as the same title is used for multiple categories (like the three ‘used in buildings’ categories mentioned in last week’s post). I therefore implemented a solution that I think works pretty well. When you type into the ‘category’ box the top two levels of the hierarchy to which a matching category belongs now appear in addition to the category name. If the category is more than two hierarchical levels down this is represented by ellipsis. Listed categories are now ordered by their ID rather than alphabetically too, so categories in the same part of the tree appear together. So now, for example, if you type in ‘processes’ the list contains ‘Building > Processes’ , ‘Building > Processes > … > Other processes’ etc. Hopefully this will make the search much easier to use. I also updated the search results page so the hierarchy is shown in the ‘you searched for’ box too, and I fixed a bug that was preventing the search results page displaying results if you searched for a category then followed a link through to the category page then pressed on the ‘Back to search results’ button.
Louise had noticed that there were two ‘processes’ categories within ‘Building’ so I amalgamated these. I also changed ‘Advanced Search’ back to plain old ‘Search’ again in all locations, and I created a new menu item and page for ‘How to use the thesaurus’.
As the Bilingual Thesaurus is almost ready to go live and it ‘hangs off’ the thesaurus.ac.uk domain I added some content to the homepage of the domain, as you can see in the screenshot below:
It currently just has boxes for the three thesauruses featuring a blurb and a link, with box colours taken from each site’s colour schemes. I did think about adding in the ‘sample category’ feature for each thesaurus here too, but as it might make the top row boxes rather long (if it’s a big category) I decided to keep things simple. I added the tagline ‘The home of academic thesauri’ (‘thesauruses’ seemed a bit clumsy here) just to give visitors a sense of what the site is. I’ll need some feedback from Marc and Fraser before this officially goes live.
Finally this week I spent some time working on some new song stories for the Romantic National Song Network. I managed to create about one and a half, which took several hours to do. I’ll hopefully manage to get the remaining half and maybe even a third one done next week.
As with previous weeks recently, I spent quite a bit of time this week on the HT / OED category linking issue. One of the big things was to look into using the search terms for matching. The HT lexemes have a number of variant forms hidden in the background for search purposes, such as alternative spellings, forms with bracketed text removed or included, and text either side of slashes split up into different terms. Marc wondered whether we could use these to try and match up lexemes with OED lexemes, which would also mean generating similar terms for the OED lexemes too. For the HT I can get variants with or without any bracketed text easily enough, but slashes are not going to be straightforward. The search terms for HT lexemes were generated using multiple passes through the data, which would be very slow to do on the fly when comparing the contents of every category. An option might be to use the existing search terms for the HT and generate a similar set for the OED, but as things stand the HT search terms contain rows that would be too broad for us to use for matching purposes. For example, ‘sway (the sceptre/sword)’ has ‘sword’ on its own as one of the search terms and we wouldn’t want to use this for matching purposes.
Slashes in the HT are used to mean so many different things that it’s really hard to generate an accurate list of possible forms, and this is made even more tricky when brackets are added into the mix. Simple forms would be easy, e.g. for ‘Aimak/Aymag’ just split the form on the slash and treat the before and after parts as separate. This is also the case for some phrases too, e.g. ‘it is (a) wonder/wonder it is’. But then elsewhere the parts on either side of the slash are alternatives that should then be combined with the rest of the term after the word after the slash – e.g. ‘set/start the ball rolling’, or combined with the rest of the term before the word before the slash – e.g. ‘sway (the sceptre/sword)’, or combined with both the beginning and the end of the term while switching stuff out in the middle – e.g. ‘of a/the same suit’. In other places an ‘etc’ appears that shouldn’t be combined with any resulting form – e.g. ‘bear (rule/sway, etc.)’. Then there are a further group where the slash means there’s an alternative ending to the word before the slash – e.g. ‘connecter/-or’. But in other forms the bits after the slash should be added on rather than replacing the final letters – e.g. ‘radiogoniometric/-al’. Sometimes there are multiple slashes that might be treated in one or more of the above ways, e.g. ‘lie of/on/upon’. The there are multiple slashes in the same form, e.g. ‘throw/cast a stone/stones’.
It’s a horrible mess and even after several passes to generate the search terms I don’t think we managed to generate all legitimate search term, while we certainly did generate a lot of incorrect terms, the thinking at the time being that the weird forms didn’t matter as no-one would search for them anyway and they’d never appear on the site. But we should be wary about using them for comparison, as the ‘sword’ example demonstrates.
Thankfully the OED lexemes don’t include slashes. There are only 16 OED lexemes that include a slash, and these are things like ‘AC/DC’, so I could generate some search terms for the OED data without too much risk of forms being incorrect, but the HT data is pretty horrible and is going to be an issue when it comes to matching lexemes too.
I met with Marc on Tuesday and we discussed the situation and agreed that we’d just use the existing search terms, and I’d generate a similar set for the OED and we’d just see how much use these might be. I didn’t have time to implement this during the week, but hopefully will do next week. Other HT tasks I tackled this week included adding in a new column to lots of our matching scripts that lists the Leveshtein score between the HT and OED path and subcats. This will help us to spot categories that have moved around a lot. I also updated the sibling matching script so that categories with multiple potential matches are separated out into a separate table.
I then rearranged the advanced search form to make the chose of language more prominent (i.e. whether ‘Anglo Norman’, ‘Middle English’ or ‘Both’). I used the label ‘Headword Language’ as opposed to ‘Section’ as it seemed to be an accurate description and we needed some sort of label to attach the help icon to. Language choice is now handled by radio buttons rather than a drop-down list so it’s easier to see what the options are.
The thing that took the longest to implement was changing the way ‘category’ works in a search. Whereas before you entered some text and your search was then limited to any individual categories that featured this text in their headings, now as you start typing into the category box a list of matching categories appears, using the jQuery UI AutoComplete widget. You can then select a category from the list and your search is then limited to any categories from this point downwards in the hierarchy. Working out the code for grabbing all ‘descendant’ categories from a specified category took quite some time to do, as every branch of the tree from that point downwards needs to be traversed and its ID and child categories returned. E.g. if you start typing in ‘build’ and select ‘builder (n.)’ from the list and then limit your search to Anglo Norman headwords your results will display AN words from ‘builder (n.)’ and categories within this, such as ‘Plasterer/rough-caster’. Unfortunately I can’t really squeeze the full path into the list of categories that appears as you type into the category box, as that would be too much text, and it’s not possible to style the list using the AutoComplete plugin (e.g. to make the path information smaller than the category heading). This means some category headings are unclear due to a lack of context (e.g. there are 3 ‘Used in building’ categories that appear with nothing to differentiate them). However, the limit by category is a lot more useful now.
On Wednesday I gave a talk about AHRC Data Management Plans at an ArtsLab workshop. This was basically a repeat of the session I was involved with a month or so ago, and it all went pretty smoothly. I also sent a couple of sample data management plans to Mary Donaldson of the University’s Research Data Management team, as she’d asked whether I had any I could let her see. It was rather a busy week for data management plans, as I also had to spend some time writing an updated plan for a place-names project for Thomas Clancy and gave feedback and suggested updates to a plan for an ESRC project that Clara Cohen is putting together. I also spoke to Bryony Randall about a further plan she needs me to write for a proposal she’s putting together, but I didn’t have time to work on that plan this week.
Also this week I met with Andrew from Scriptate, who I’d previously met to discuss transcription services using an approach similar to the synchronised audio / text facilities that the SCOTS Corpus offers. Andrew has since been working with students in Computing Science to develop some prototypes for this and a corpus of Shakespeare adaptations and he showed me some of the facilities thy have been developing. It looks like they are making excellent progress with the functionality and the front-end and I’d say things are progressing very well.
I also had a further chat with Valentina Busin in MVLS about an app she’s wanting to put together and I spoke to Rhona Alcorn of SLD about the Scots School Dictionary app I’d created about four years ago. Rhona wanted to know a bit about the history of the app (the content originally came from the CD-ROM made in the 90s) and how it was put together. It looks like SLD are going to be creating a new version of the app in the near future, although I don’t know at this stage whether this will involve me.
I also spoke to Gavin Miller about a project I’m named on that recently got funded. I can’t say much more about it for now, but will be starting on this in January. I also started to arrange travel and things for the DH2019 conference I’ll be attending next year, and rounded off the week by looking at retrieving the semantically tagged Hansard dataset that Marc wants to be able to access for a paper he’s writing. Thankfully I managed to track down this data, inside a 13GB tar.gz file, which I have now extracted into a 67Gb MySQL dataset. I just need to figure out where to stick this so we can query it.
I didn’t have any pressing deadlines for any particular projects this week so I took the opportunity to return to some tasks that had been sitting on my ‘to do’ list for a while. I made some further changes to the Edinburgh Gazetteer manuscript interface: Previously the width of the interface had a maximum value applied to it, meaning that on widescreen monitors the area available to pan and zoom around the newspaper image was much less wide than the screen width and there was lots of empty, wasted white space on either side. I’ve now changed this to remove the maximum width restriction, thus making the page much more usable.
I also continued to work with the Hansard data. Although the data entry processes have now completed it is still terribly slow to query the data, due to both the size of the data and the fact that I haven’t added in any indexes yet. I tried creating an index when I was working from home last week but the operation timed out before it completed. This week I tried from my office and managed to get a few indexes created. It took an awfully long time to generate each one, though – between 5 and 10 hours per index. However, now that the indexes are in place a query that can utilise an index is now much speedier. I created a little script on my test server that connects to the database and grabs the data for a specified year and then outputs this as a CSV file and the script only takes a couple of minutes to process. I’m hoping I’ll be able to get a working version of the visualisation interface for the data up and running, although this will have to be a proof of concept as it will likely still take several minutes for the data to process and display until we can get a heftier database server.
I had a task to perform for the Burns people this week – launching a new section of the website, which can be found here: http://burnsc21.glasgow.ac.uk/performing-burnss-songs-in-his-own-day/. This section includes performances of many songs, including both audio and video. I also spent a fair amount of time this week giving advice to staff. I helped Matt Barr out with a jQuery issue, I advised the MVLS people on some app development issues, I discussed a few server access issues with Chris McGlashan, I responded to an email from Adrian Chapman about a proposal he is hoping to put together, I gave some advice to fellow Arts developer Kirsty Bell who is having some issues with a website she is putting together, I spoke to Andrew Roach from History about web development effort and I spoke to Carolyn Jess-Cooke about a proposal she is putting together. Wendy also contacted me about an issue with the Mapping Metaphor Staff pages, but thankfully this turned out to be a small matter that I will fix at a later date. I also met separately with both Gary and Jennifer to discuss the Atlas interface for the SCOSYA project.
Also this week I returned to the ‘Basics of English Metre’ app that I started developing earlier in the year. I hadn’t had time to work on this since early June so it took quite a bit of time to get back up to speed with things, especially as I’d left off in the middle of a particularly tricky four-stage exercise. It took a little bit of time to think things through but I managed to get it all working and began dealing with the next exercise, which is unlike any previous exercise type I’ve dealt with as it requires an entire foot to be selected. I didn’t have the time to complete this exercise so to remind myself for when I next get a chance to work on this: Next I need to allow the user to click on a foot or feet to select it, which should highlight the foot. Clicking a second time should deselect it. Then I need to handle the checking of the answer and the ‘show answer’ option.
On Friday I was due to take part in a conference call about Jane’s big EPSRC proposal, but unfortunately my son was sick during Thursday night and then I caught whatever he had and had to be off work on Friday, both to look after my son and myself. This was not ideal, but thankfully it only lasted a day and I am going to meet with Jane next week to discuss the technical issues of her project.
It was a four-day week for me this week as I’d taken Friday off. I spent a fair amount of time this week continuing to work on the Atlas interface for the SCOSYA project, in preparation for Wednesday, when Gary was going to demo that Atlas to other project members at a meeting in York. I spent most of Monday and Tuesday working on the facilities to display multiple attributes through the Atlas. This has been quite a tricky task and has meant massively overhauling the API as well as the front end so as to allow for multiple attribute IDs and Boolean joining types to be processed.
In the ‘Attribute locations’ section of the ‘Atlas Display Options’ menu underneath the select box there is now an ‘Add another’ button. Pressing on this slides down a new select box and also options for how the previous select box should be ‘joined’ with the new one (either ‘and’, ‘or’ or ‘not’). Users can add as many attribute boxes as they want, and can also remove a box by pressing on the ‘Remove’ button underneath it. This smoothly slides up the box and removes it from the page using the always excellent jQuery library.
The Boolean operators (‘and’, ‘or’ and ‘not) can be quite confusing to use in combination so we’ll have to make sure we explain how we are using them. E.g. ‘A AND B OR C’ could mean ‘(A AND B) OR C’ or ‘A AND (B OR C)’. These could give massively different results. The way I’ve set things up is to go through the attributes and operators sequentially. So for ‘A AND B OR C’ the API gets the dataset for A, checks this against the dataset for B and makes a new dataset containing only those locations that appear in both datasets. It then adds all of dataset C to this. So this is ‘(A AND B) or C’. It is possible to do the ‘A AND (B OR C)’ search, you’d just have to rearrange the order so the select boxes are ‘B OR C AND A’.
Adding in ‘not’ works in the same sequential way, so if you do ‘A NOT B OR C’ this gets dataset A then removes from it those places found in dataset B, then adds all of the places found in dataset C. I would hope people would always put a ‘not’ as the last part of their search, but as the above example shows, they don’t have to. Multiple ‘nots’ are allowed too – e.g. ‘A NOT B NOT C’ will get the dataset for A, remove those places found in dataset B and then remove any further places found in dataset C.
Another thing to note is that the ‘limits’ are applied to the dataset for each attribute independently at the moment. E.g. a search for ‘A AND B OR C’ with the limits set to ‘Present’ and age group ‘60+’ each dataset A,B and C will have these limits applied BEFORE the Boolean operators are processed. So the ratings in dataset A will only contain those that are ‘Present’ and ‘60+’, these will then be reduced to only include those locations that are also in dataset B (which only includes ratings that are ‘Present’ and ‘60+’) and then all of the ratings for dataset C (Again which only includes those that are ‘Present’ and ‘60+’) will be added to this.
If the limits weren’t imposed until after the Boolean processes had been applied then the results could possibly be different – especially the ‘present’ / ‘absent’ limits as there would be more ratings for these to be applied to.
I met with Gary a couple of times to discuss the above as these were quite significant additions to the Atlas. It will be good to hear the feedback he gets from the meeting this week and we can then refine the browse facilities accordingly.
I spent some further time this week on AHRC review duties and Scott Spurlock sent me a proposal document for me to review so I spent a bit of time doing so this week as well. I also spent a bit of time on Mapping Metaphor as Wendy had uncovered a problem with the Old English data. For some reason an empty category labelled ‘0’ was appearing on the Old English visualisations. After a bit of investigation it turned out this had been caused by a category that had been removed from the system (B71) still being present in the last batch of OE data that I uploaded last week. After a bit of discussion with Wendy and Carole I removed the connections that were linking to this non-existent category and all was fine again.
I met with Luca this week to discuss content management systems for transcription projects and I also had a chat this week with Gareth Roy about getting a copy of the Hansard frequencies database from him. As I mentioned last week, the insertion of the data has now been completed and I wanted to grab a copy of the MySQL data tables so we don’t have to go through all of this again if anything should happen to the test server that Gareth very kindly set up for the database. Gareth stopped the database and added all of the necessary files to a tar.gz file for me. The file was 13Gb in size and I managed to quickly copy this across the University network. I also began trying to add some new indexes to the data to speed up querying but so far I’ve not had much luck with this. I tried adding an index to the data on my local PC but after several hours the process was still running and I needed to turn off my PC. I also tried adding an index to the database on Gareth’s server whilst I was working from home on Thursday but after leaving it running for several hours the remote connection timed out and left me with a partial index. I’m going to have to have another go at this next week.
It’s now been four years since I started this job, so that’s four years’ worth of these weekly posts that are up here now. I have to say I’m still really enjoying the work I’m doing here. It’s still really rewarding to be working on all of these different research projects. Another milestone was reached this week too – the Hansard semantic category dataset that I’ve been running through the grid in batches over the past few months in order to insert it into a MySQL database has finally completed! The database now has 682,327,045 rows in it, which is by some considerable margin the largest database I’ve ever worked with. Unfortunately as it currently stands it’s not going to be possible to use the database as a data source for web-based visualisations as a simple ‘Select count(*)’ to return the number of rows took just over 35 minutes to execute! I will see what can be done to speed things up over the next few weeks, though. At the moment I believe the database is sitting on what used to be a desktop PC so it may be that moving it to a more meatier machine with lots of memory might speed things up considerably. We’ll see how that goes.
I met with Scott Spurlock on Tuesday to discuss his potential Kirk Sessions crowdsourcing project. It was good to catch up with Scott again and we’ve made the beginnings of a plan about how to proceed with a funding application, and also what software infrastructure we’re going to try. We’re hoping to use the Scripto tool (http://scripto.org/), which in itself is built around MediaWiki, in combination with the Omeka content management system creator (https://omeka.org/), which is a tool I’ve been keen to try out for some time. This is the approach that was used by the ‘Letters of 1916’ project (http://letters1916.maynoothuniversity.ie/), whose talk at DH2016 I found so useful. We’ll see how the funding application goes and if we can proceed with this.
I also had my PDR session this week, which took up a fair amount of my time on Wednesday. It was all very positive and it was a good opportunity to catch up with Marc (my line manager) as I don’t see him very often. Also on Wednesday I had some communication with the Thomas Widmann of the SLD as the DSL website had gone offline. Thankfully Arts IT Support got it back up and running again a matter of minutes after I alerted them. Thomas also asked me about the datafiles for the Scots School Dictionary app, and I was happy to send these on to him.
I gave some advice to Graeme Cannon this week about a project he has been asked to provide technical input costings for, and I also spent some time on AHRC review duties. Wendy also contacted me about updating the data for the main map and OE maps for Mapping Metaphor so I spent some time running through the data update processes. For the main dataset the number of connections has gone down from 15301 to 13932 (due to some connections being reclassified as ‘noise’ or ‘relevant’ rather than ‘metaphor’ while the number of lexemes has gone up from 10715 to 13037. For the OE data the number of metaphorical connections has gone down from 2662 to 2488 and the number of lexemes has gone up from 3031 to 4654.
The rest of my week was spent on the SCOSYA project, for which I continued to developer the prototype Atlas interface and the API. By Tuesday I had finished an initial version of the ‘attribute’ map (i.e. it allows you to plot the ratings for a specific feature as noted in the questionnaires). This version allowed users to select one attribute and to see the dots on a map of Scotland, with different colours representing the rating scores of 1-5 (an average is calculated by the system based on the number of ratings at a given location). I met with Gary and he pointed out that the questionnaire data in the system currently only has latitude / longitude figures for each speaker’s current address, so we’ve got too many spots on the map. These need to be grouped more broadly by town for the figures to really make sense. Settlement names are contained in the questionnaire filenames and I figured out a way of automatically querying Google Maps for this settlement name (plus ‘Scotland’ to disambiguate places) in order to grab a more generic latitude / longitude value for the place – e.g. http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=oxgangs+scotland
There will be some situations where there is some ambiguity and multiple places are returned but I just grab the first and the locations can be ‘fine tuned’ by Gary via the CMS. I updated the CMS to incorporate just such facilities, in fact. And also updated the questionnaire upload scripts so that the Google Maps data is incorporated automatically from now on. With this data in place I then updated the API so that it spits out the new data rather than the speaker-specific data, and updated the atlas interface to use the new values too. The result was a much better map – less dots and better grouping.
I also updated the atlas interface so that it uses leaflet ‘circleMarkers’ rather than just ‘circles’, as this allows the markers to stay the same size at all map zoom levels where previously they looked tiny when zoomed out but then far too big when zoomed in. I added a thin black stroke around the markers too, to make the lighter coloured circles stand out a bit more on the map. Oh, I also changed the colour gradient to a more gradual ‘yellow to red’ approach, which works much better than the colours I was using before. Another small tweak was to move the atlas’s zoom in and out buttons to the bottom right rather than the top left, as the ‘Atlas Display Options’ slide-out menu was obscuring these. I never noticed as I never use these buttons as I just zoom in and out with the mouse scrollwheel, but Gary pointed out it was annoying to cover them up. I also prevented the map from resetting its location and zoom level every time a new search was performed, which makes it easier to compare search results. And I also prevented the scrollwheel from zooming in and out when the mouse is in the attribute drop-down list. I haven’t figured out a way to make the scrollwheel actually scroll the drop-down list as it really ought to yet, though.
I made a few visual tweaks to the map pop-up boxes, such as linking to the actual questionnaires from the ‘location’ atlas view (this will be for staff only) and including the actual average rating in the attribute view pop-up so you don’t have to guess what it is from the marker colour. Adding in links to the questionnaires involved reworking the API somewhat, but it’s worked out ok.
The prototype is working very nicely so far. What I’m going to try to do next week is allow for multiple attributes to be selected, with Boolean operators between them. This might be rather tricky, but we’ll see. I’ll finish off with a screenshot of the ‘attribute’ search, so you can compare how it looks now to the screenshot I posted last week:
Monday was a holiday this week so I returned to work on Tuesday, after being out of the office for most of the past three weeks on holidays and at the DH2016 conference. A lot of the week was spent catching up with emails and finishing off conference related things, such as writing last week’s lengthy blog post that summarised the conference parallel sessions I attended. I also had to submit my travel expenses and get my remaining Zlotys changed back. Other than these tasks the rest of my week was spend on a range of relatively small tasks. I continued to work with the Hansard data extraction using the ScotGrid infrastructure. By the end of the week the total number of rows extracted and inserted into the MySQL database stood at 123,636,915, and that’s with only 170 files out of over 1,200 processed.
I spent a little bit of time discussing the dreaded H27 issue for the Old English data of the Mapping Metaphor project. Wendy and Ellen have been having a chat about this and it looks like they’ve come up with a plan to get the data sorted. Carole is going to use the content management system I created for the project in order to add in the stage 5 data for the H27 categories. Once this is in place I will then be able to extract this data and pass it over to Flora so she can integrate it with the rest of the data in her Access database. Here’s hoping this strategy will work.
I also had a chat with Gary Thoms about the SCOSYA project and added some new codes to the project database for him. We will be meeting next week to go over plans for the next stage of technical development for the project, but Gary wanted to check a few things out before this, such as whether it would be possible to allow the editors to create records directly through the system rather than uploading CSV files.
I also responded to a request for help from someone in the School of Social and Political Sciences about an interactive online teaching course she was wanting to put together. As I only really work within the School of Critical Studies I couldn’t really get involved too much, but I suggested she speak to the University’s MOOC people as a MOOC (Massive Open Online Course) seemed to be very similar to what she had in mind. I also spend some time in an email conversation with Christine Ferguson and a technical person at Stirling University. Christine has a project starting up and I was supposed to get the project website up and running over the summer. However, Christine is started a new post at Stirling and the project needs to move with her. After a bit of toing and froing we managed to come up with a plan of action for setting up the website at Stirling, and that should be the end of my involvement with the project, all being well.
Ann Ferguson of Scottish Language Dictionaries contacted me whilst I was on holiday about doing some further work on the DSL website so I also spent a bit of time going through the materials she had sent me and getting back up to speed on the project. There are a few outstanding tasks that we had intended to complete about 18 months ago that Ann would now like to see finalised so I replied to her about how we might go about this.
I also spoke to Rob Maslen about the student blog he is hoping to set up before next term. I’m going to meet with him next week to figure out exactly what is required. Finally, Marc sent me on some new data for the Historical Thesaurus that has come from the OED people. We’re going to have to figure out how best to integrate this over the next couple of months, and it will be really great to have the updated data.
On Friday last week I submitted a job to ScotGrid that would extract all of the data from the Hansard dataset that was supplied by Lancaster. I had to submit this job because I’d noticed that the structure of the metadata had changed midway through the data, which had messed up my extraction script. I submitted the 1252 files and left them running over the weekend and by Monday morning they had all completed, giving me a set of 1252 SQL files. None of the error checks I’d added into my extraction script last week had tripped so hopefully the metadata structure doesn’t have any other surprises waiting. On Monday I started running batches of the SQL files into the MySQL database that I have for the data, but it’s going to take quite a while for these to process as I have to send them through to ScotGrid in small batches of around 20 otherwise the poor database has too many connections and returns an error.
I spent most of the rest of the week working on the ‘Basics of English Metre’ app and made some good progress with it. I have now completed Unit 2 and have made a start on Unit 3. I did get rather bogged down in Unit 2 for a while as several of the exercises looked like regular exercises that I had already developed code to process, only to have extra pesky questions added on the end that only appear when the final question on a page is correctly answered. These included selecting the foot type for a set of lines (e.g. Iambic pentameter) or identifying a poem based on its metre. However, I managed to find a solution to all of these quirks and added in some new question styles. I’m currently on page 2 of Unit 3, which consists of four questions that each have four stages. The first is syllable boundary identification, the second is metre analysis, the third is putting in the foot boundaries while the fourth is adding the rhythm. I’ve got all of this working, although have only supplied data for the fourth stage for the last of the lines on the page. Also there are some more of the pesky additional questions that need to be integrated and rather strangely the existing website doesn’t supply answers for the fourth stage, so I’m going to need to get someone in English Language to supply these.
Other than the above I helped Carolyn Jess-Cooke from English Literature to add a forum to her ‘writing mental health’ website. I also had an email conversation with Rhona Brown about the digitised images and OCR possibilities for her ‘Edinburgh Gazetteer’ project that is starting soon. I had a chat with Graeme Cannon about an on-screen keyboard I had developed for the Essentials of Old English app, as he is going to need a similar feature for one of his projects. I also spoke with Flora about the dreaded H27 error with the OE data for Mapping Metaphor. A solution to this is still eluding her, but I’m afraid I wasn’t able to offer much advice as I don’t know much about the Access database and the forms that were created for the data. I might see if I can extract the data and do something with it using PHP if she hasn’t found a solution soon. I also spoke to Rob Maslen about a new blog he’s wanting to set up for student of his Fantasy course next year and talked to Scott Spurlock about a possible crowdsourcing tool for a project he is putting together.
I am going to be on holiday for the next two weeks so there won’t be a further update until after I’m back.
I spent about a day this week working with the Hansard data again. By Friday morning the frequencies database contained 358,408,449 rows, with just under half of the data processed. However, I’m going to have to go back to square one again as I’ve noticed an inconsistency with the data. I had split the base64 encoded data from Lancaster up into about 1200 separate files and I noticed on Friday that up until about midway through the 49th file the metadata has the following structure:
But then after that the structure changes as follows:
That extra /commons/ in there messed up the part of my file that split this information up and lead to the loss of the actual filename from my processed data. It meant that I had to re-run everything through the grid again, wipe the database and re-run the insertion jobs again.
I returned to my original shell script that extracted the Base64 data and reworked it to add in some checks for the structure of the data. I also added in some error checking to ensure that if (for example) the ‘year’ field doesn’t contain a number that an error is raised. I also took the opportunity to update the SQL statements that were generated, firstly to add in the all-important semi-colon delimiting character that I had missed out first time around and secondly to make the insert statements standard SQL rather than the MySQL specific syntax that I’ve tended to use in the past. The standard way is ‘insert into table(column1, column2) values(‘value1’, ‘value2’);’ while MySQL also allows ‘insert into table set column1 = ‘value1’, column2 = ‘value2’’. Having updated and tested out the file I then submitted a new batch of jobs to ScotGrid, and the output files seemed to work well with both possible metadata structures. I submitted all of the 1200 odd files to run over the weekend.
In addition to the above work I did a few other tasks. I met with Jane Stuart Smith to discuss a couple of upcoming projects she’s putting together, plus I gave her some further input into the project I advised her on last week. I also upgraded the WordPress installations for a number of sites that I’ve set up over the years as Chris had pointed out that they were running older versions of the software. I was also supposed to meet Flora on Friday to discuss the issue relating to the H27 categories for the Old English data for Mapping Metaphor, but unfortunately Flora was ill and we weren’t able to meet. Hopefully we can fit this in next week.
Last week I started to redevelop the old STELLA resource ‘The Basics of English Metre’ and I spent much of this week continuing with it. The resource is split into three sections, each of which feature a variety of interactive exercises throughout. Last week I made a start on the first exercise, and this week I made quite a bit of progress with the content, completing the first 12 out of 13 pages of the first section. As with the previous STELLA resources I redeveloped, I’ve been using the jQueryMobile framework to handle the interface and jQuery itself to handle the logic of the exercises. The contents of each page are stored in a JSON file, with the relevant content pulled in and processed when a page loads. The first exercise I completed required the user to note the syllable boundaries within words. I was thankfully able to reuse a lot of the code from the ARIES app for this. The second exercise type required the user to choose whether the syllables in a word were strongly or weakly stressed. For this I repurposed the ‘part of speech’ selector exercise type I had created for the Essentials of English Grammar app. The third type of exercise was a multi-stage exercise requiring syllable identification for stage 1 and then stress identification for stage 2. Rather than just copying the existing code from the other apps I also refined it as I know a lot more about the workings of jQueryMobile than I did when I put these other apps together. For example, with the ‘part of speech’ selector the different parts of speech appeared in a popup that appeared when the user pressed on a dotted box. After a part of speech was selected it then appeared in the dotted box and the popup closed. However, I had previously set things up so that a separate popup was generated for each of the dotted boxes, which is hugely inefficient as the content of each popup is identical. With the new app there is only one popup and the ID of the dotted box is passed to it when the user presses on it. This is a much better approach. As most of the remaining interactive exercise are variations on the exercises I’ve already tackled I’m hoping that I’ll be able to make fairly rapid progress with the rest of the resource.
Other than working on the ‘Metre’ resource I communicated with Jane Stuart Smith, who is currently putting a new proposal together. I’m not going to be massively involved in it, but will contribute a little so I read through all of the materials and gave some feedback. Bill Kretzschmar is also working on a new proposal that I will have a small role in too, so I had an email chat with him about this too. I also completed a second version of the Technical Plan for the proposal Murray Pittock is currently writing. The initial version required some quite major revisions due to changes in how a lot of the materials will be handled, but I think we are now getting close to a final version.
I also spent a little bit of time working with some of the Burns materials for the new section and gave a little bit of advice to a colleague who was looking into incorporating historical maps into a project. I also fixed a couple of bugs with the SciFiMedHums ‘suggest a new bibliographical item’ page and then made it live for Gavin Miller. You can suggest a new item here: http://scifimedhums.glasgow.ac.uk/suggest-new-item/ (but you’ll need to register first). Finally, I continued processing the Hansard data using the ScotGrid infrastructure. By the end of the week 200 of the 1200 SQL files had been processed, resulting in more than 150,000,000 rows. I’ll just keep these scripts running over the next few weeks until all of the data is done. I’m not sure than an SQL database is going to be quick enough to actually process this amount of data, however. I did a simple ‘count rows’ query and it took over two minutes to return an answer, which is a little worrying. It’s possible that after all of the data is inserted I may then have to look for another solution. But we’ll see.
Monday was a bank holiday so this was a four-day week for me. I had yet more AHRC review duties to perform this week so quite a lot of Tuesday was devoted to that. I also had an email conversation with Murray Pittock about the technical plan for a proposal he is currently putting together. We’re still trying to decide on the role of OCR in the project, but I think a bit of progress is being made. I also spent some further time helping to sort out the materials for the new section of the Burns website. This mainly consisted of sorting out a series of video clips of song performances, uploading them to YouTube and embedding them in the appropriate pages. For Mapping Metaphor, Wendy had sent me on some new teaching materials that she wanted me to add to the ‘Metaphoric’ resource (http://mappingmetaphor.arts.gla.ac.uk/metaphoric/teaching-materials.html). This involved uploading the files, adding them to the zip files and updating the browse by type and topic facilities to incorporate the resources.
My two big tasks of the week were working with the Hansard data and starting the redevelopment of another old STELLA resource. As mentioned in previous posts, Gareth Roy of Physics and Astronomy has kindly set up a database server for me where the Hansard data can reside as we’re processing it. Last week I added all of the ancillary tables to the database (e.g. Information about speakers) and I ‘fixed’ the SQL files so that MySQL could process them at the command line and I wrote a very simple Shell script that takes the path to an SQL file as an argument and then invokes the MySQL command to import that file in to the specified database. I tested this out on the first output file, running the Shell script on the test server I have in my office and it successfully inserted all 581,409 rows contained in the file into the database. It did take quite a long time for the script to execute, though. About an hour and 20 minutes, in fact. With that individual test successfully completed I wrote another Shell script that would submit a batch of jobs to the Grid. It took a little bit of trial and error to get this script to work successfully within the Grid, mainly due to needing to specify the full path to the MySQL binary from the nodes. Thankfully I got the script working in the end and set it to work on the first batch of 9 files (files 2-10 as I had already processed file 1). It took about four hours for the nodes to finish processing the files, which means we’re looking at (very roughly) 30 minutes per file and there are about 1200 files so it might take about 25 days to process them all. That is unless processing more than 9 jobs at a time is faster, but I suspect speed might be limited to some extent at the database server end. I submitted a further 20 jobs before I left the office for the weekend so we’ll just need to see how quickly these are processed.
The STELLA resource I decided to look into redeveloping is ‘The Basics of English Metre’. The old version can be found here: http://www.arts.gla.ac.uk/stella/Metre/MetreHome.html. It’s a rather out of date website that only works properly in Internet Explorer. The exercises contained in it also require the Flash plugin in order to function. Despite these shortcomings the actual content (if you can access it) is still very useful, which I think makes it a good candidate for redevelopment. As with the previous STELLA resources I’ve redeveloped, I intend to make web and app versions of the resource. I spent some time this week going through the old resource, figuring out how the exercises function and how the site is structured. After that I began to work on a new version, setting up the basic structure (which in common with the other STELLA resources I’ve redeveloped will use the jQueryMobile framework). By the end of the week I had decided on a structure for the new site (i.e. which of the old pages should be kept as separate pages and which should be merged) and had created a site index (this is already better than the old resource which only featured ‘next’ and ‘previous’ links between pages with no indication of what any of the pages contained or any way to jump to a specific page). I also made a start processing the content, but I only got as far as working on the first exercise. Some of the latter exercises are quite complicated and although I will be able to base a lot of the exercise code on the previous resources I had created it is still going to require quite a bit of customisation to get things working for the sorts of questions these exercises ask. I hope to be able to continue with this next week, although I would also like to publish Android versions of the two STELLA apps that are currently only available for iOS (English Grammar: An Introduction and ARIES) so I might focus on this first.