A lot of this week was devoted to the Scots Thesaurus project, which we launched on Wednesday. You can now access the website here: http://scotsthesaurus.org/. I spent quite a bit of time on Monday and Tuesday making some last minute updates to the website and visualisations and also preparing my session for Wednesday’s colloquium. The colloquium went well, as did the launch itself. We had considerable media attention and many thousands of page hits and thankfully the website coped with all of this admirably. I still have a number of additional features to implement now the launch is out of the way, and I’ll hopefully get a chance to implement these in the coming weeks.
Other than Scots Thesaurus stuff I had to spend a day or so this week doing my Performance and Development review exercise. This involved preparing materials, having my meeting and then updating the materials. It has been a very successful year for me so the process all went fine.
I spent some of the remainder of the week working on the front end for the bibliographical database for Gavin Miller’s SciFiMedHums project. This involved updating my WordPress plugin to incorporate some functions that could then be called in the front-end template page using WordPress shortcodes. This is a really handy way to add custom content to the WordPress front end and I managed to get a first draft of the bibliographical entry page completed, including lists of associated people, places, organisations and other items, themes, excerpts and other information. It’s all looking pretty good so far, but there is still a lot of functionality to add, for example search and browse facilities and the interlinking of data such as themes (e.g. click on a theme listed in one entry to view all other entries that have been classified with it).
I also spent some time this week starting on the Technical Plan for a new project for the Burns people, but I haven’t got very far with it yet. I’ll be continuing with this on Monday. So, a very short report this week, even though the week itself was really rather hectic.
This week I returned to working a full five days, after the previous two part-time weeks. It was good to have a bit more time to work on the various projects I’m involved with, and to be able to actually get stuck into some development work again. On Monday and Tuesday and a bit of Thursday this week I focussed on the Scots Thesaurus project. The project is ending at the end of September so there’s going to be a bit of a final push over the coming weeks to get all of the outstanding tasks completed.
I spent quite a bit of time continuing to try to get an option to enable multiple parts of speech represented in the visualisations at the same time, but unfortunately I had to abandon this due to the limitations of my available time. It’s quite difficult to explain why allowing multiple parts of speech to appear in the same visualisation is tricky, but I’ll try. The difficulty is caused by the way parts of speech and categories are handled in the thesaurus database. A category for each part of speech is considered to be a completely separate entity, with a different unique identifier, different lexemes and subcategories. For example there isn’t just one category ‘01.01.11.02.08.02.02 Rain’, and then certain lexemes within it that are nouns and others that are verbs. Instead, ‘01.01.11.02.08.02.02n Rain’ is one category (ID 398) and ‘01.01.11.02.08.02.02v Rain’ is another, different category (ID 401). This is useful because categories of different parts of speech can then have different names (e.g. ‘Dew'(n) and ‘Cover with dew'(v)), but it also means building a multiple part of speech visualisation is tricky because the system is based around the IDs.
The tree based visualisations we’re using expect every element to have one parent category and if we try to include multiple parts of speech things get a bit confused as we no longer have a single top-level parent category as the noun categories have a different parent from the verbs etc. I thought of trying to get around this by just taking the category for one part of speech to be the top category but this is a little confusing if the multiple top categories have different names. It also makes it confusing to know where the ‘browse up’ link goes to if multiple parts of speech are displayed.
There is also the potential for confusion relating to the display of categories that are at the same level but with a different part of speech. It’s not currently possible to tell by looking at the visualisation which category ‘belongs’ to which part of speech when multiple parts of speech are selected, so for example if looking at both ‘n’ and ‘v’ we end up with two circles for ‘Rain’ but no way of telling which is ‘n’ and which is ‘v’. We could amalgamate these into one circle but that brings other problems if the categories have different names, like the ‘Dew’ example. Also, what then should happen with subcategories? If an ‘n’ category has 3 subcategories and a ‘v’ category has 2 subcategories and these are amalgamated it’s not possible to tell which main category the subcategories belong to. Also, subcategory numbers can be the same in different categories, so the ‘n’ category may have a subcategory ’01’ and a further one ‘01.01’ while the ‘v’ category may also have ones with the same numbers and it would be difficult to get these to display as separate subcategories.
There is also a further issue with us ending up with too much information in the right-hand column, where the lexemes in each category are displayed. If the user selects 2 or 3 parts of speech we then have to display the category headings and the words for each of these in the right-hand column, which can result in far too much data being displayed.
None of these issues are completely insurmountable, but I decided that given the limited amount of time I have left on the project it would be risky to continue to pursue this approach for the time being. Instead what I implemented is a feature that allows users to select a single part of speech to view from a list of available options. Users are able to, for example, switch from viewing ‘n’ to viewing ‘v’ and back again, but can’t to view both ‘n’ and ‘v’ at the same time. I think this facility works well enough and considerably cuts down on the potential for confusion.
After completing the part of speech facility I moved onto some of the other outstanding, ono-visualisation tasks I still have to tackle, namely a ‘browse’ facility and the search facilities. Using WordPress shortcodes I created an option that lists all of the top level main categories in the system – i.e. those categories that have no parent category. This option provides a pathway into the thesaurus data and is a handy reference showing which semantic areas the project has so far tackled. I also began work on the search facilities, which will work in a very similar manner to those offered by the Historical Thesaurus of English. So far I’ve managed to create the required search forms but not the search that this needs to connect to.
After making this progress with non-visualisation features I returned to the visualisations. The visualisation style we had adopted was a radial tree, based on this example: http://bl.ocks.org/mbostock/4063550. This approach worked well for representing the hierarchical nature of the thesaurus, but it was quite hard to read the labels. I decided instead to investigate a more traditional tree approach, initially hoping to get a workable vertical tree, with the parent node at the top and levels down the hierarchy from this expanding down the page. Unfortunately our labels are rather long and this approach meant that there were a lot of categories on the same horizontal line of the visualisation, leading to a massive amount of overlap of labels. So instead I went for a horizontal tree approach, and adapted a very nice collapsible tree style similar to the one found here: http://mbostock.github.io/d3/talk/20111018/tree.html. I continued to work on this on Thursday and I have managed to get a first version integrated with the WordPress plugin I’m developing.
Also on Thursday I met with Susan and Magda to discuss the project and the technical tasks that are still outstanding. We agreed on what I should focus in my remaining time and we also discussed the launch at the end of the month. We also had a further meeting with Wendy, as a representative of the steering group, and showed her what we’d been working on.
On Wednesday this week I focussed on Medical Humanities. I spent a few hours adding a new facility to the SciFiMedHums database and WordPress plugin to enable bibliographical items to cross reference any number of other items. This facility adds such a connection in both directions, allowing (for example) Blade Runner to have an ‘adapted from’ relationship with ‘Do androids dream of electric sheep’ and for the relationship in the other direction to then automatically be recorded with an ‘adapted into’ relationship.
I spent the remainder of Wednesday and some other bits of free time continuing to work on the Medical Humanities Network website and CMS. I have now completed the pages and the management scripts for managing people and projects and have begun work on Keywords. There should be enough in place now to enable the project staff to start uploading content and I will continue to add in the other features (e.g. collections, teaching materials) over the next few weeks.
On Friday I met with Stuart Gillespie to discuss some possibilities for developing an online resource out of a research project he is currently in the middle of. We had a useful discussion and hopefully this will develop into a great resource if funding can be secured. The rest of my available time this week was spent on the Hansard materials again. After discussions with Fraser I think I now have a firmer grasp on where the metadata that we require for search purposes is located. I managed to get access to information about speeches from one of the files supplied by Lancaster and also access to the metadata used in the Millbanksystems website relating to constituencies, offices and things like that. The only thing we don’t seem to have access to is which party a member belonged to, which is a shame is this would be hugely useful information. Fraser is going to chase this up, but in the meantime I have the bulk of the required information. On Friday I wrote a script to extract the information relating to speeches from the file sent by Lancaster. This will allow us to limit the visualisations by speaker, and also hopefully by constituency and office too. I also worked some more with the visualisation, writing a script that created output files for each thematic heading in the two-year sample data I’m using, to enable these to be plugged into the visualisation. I also started to work on facilities to allow a user to specify which thematic headings to search for, but I didn’t quite manage to get this working before the end of the day. I’ll continue with this next week.
It was another short week for me this week, as I was off on Monday, Tuesday and Wednesday. I had to spend a little time whilst I was off fixing a problem with one of our WordPress sites. Daria alerted me to the fact that the ISAS conference website was displaying nothing but a database connection error, which is obviously a fairly major problem. It turned out that one of the underlying database tables had become corrupted, but thankfully WordPress includes some tools to fix such issues and after figuring that out I managed to get the site up and running again. I’m not sure what caused the problem but hopefully it won’t happen again.
I attended to some further WordPress duties later on in the week when I was back at work. I’ve set up a conference website for Sean Adams in Theology and he was wanting a registration page to be set up. I was supposed to meet with his RA on Thursday to discuss the website but unfortunately she was ill, so I just had to get things set up without any formal discussions. I investigated a couple of event management plugins for WordPress, but the ones I tried seemed a bit too big and clunky for what we need. All we need is a single registration page for one event, but the plugins provide facilities to publish multiple events, manage payments, different ticket types and all of this stuff. It was all far too complicated, yet at the same time it seemed rather difficult to customise some fairly obvious things such as which fields are included in the registration form. After trying two plugins and being dissatisfied with both of them I just settled for using a contact form that emails Sean and the RA whenever someone registers. It’s not the ideal setup but for a relatively small event that we have very little time to get things set up for it should work out ok.
I had some further AHRC review duties to take care of this week, which took up the best part of one of my available days. I also had some more iOS developer account management issues to take care of, which also took up some time. Some people elsewhere in the University are wanting to upload a paid app to the App Store, but in order to do this a further contract needs to be signed with Apple, and this needs some kind of legal approval from the University before we agree to it. I had a couple of telephone conversations with a lawyer working on behalf of the University about the contracts for Apple and also for the Google Play store. I also had email conversations with Megan Coyer and Gavin Miller about the development of their respective online resources and spoke to Magda and Susan regarding some Scots Thesaurus issues.
On Friday morning I had a meeting with Gerry Carruthers and Catriona MacDonald to discuss their ‘People’s Voice’ project, which is due to start in January and for which I had been assigned two months of effort. We had a really useful meeting, going over some initial requirements for the database of songs and poems that they want to put together and thinking about how their anthology of songs will be marked up and managed. We also agreed that Mark Herraghty would do the bulk of the development work for the project. Mark knows a lot more about XML markup than I do which makes him a perfect fit for the project so this is really good news.
This was all I managed to squeeze into my two days of work this week. I didn’t get a chance to do any real development work but hopefully next week I’ll be able to get back stuck into it.
This week I continued to work on the projects I’d started work on again last week after launching the three Old English resources. For the Science Fiction and the Medical Humanities project I completed a first draft of all of the management scripts that are required for managing the bibliographic data that will be published through the website. It is now possible to manage all of the information relating to bibliographical items through the WordPress interface, including adding, editing and deleting mediums, themes, people, places and organisations. The only thing it isn’t possible to do is to update the list of options that appear in the ‘connection’ drop-down lists when associating people, places and organisations. But I can very easily update these lists directly through the database and the new information then appears wherever it is required so this isn’t going to be a problem.
Continuing on the Medical Humanities theme, I spent about a day this week starting work on the new Medical Humanities network website and content management system for Megan Coyer. This system is going to be an adaptation of the existing Digital Humanities network system. Most of my time was spent on ‘back end’ stuff like setting up the underlying database, password protecting the subdomain until we’re ready to ‘go live’ and configuring the template scripts. The homepage is in place (but without any content), it is possible to log into the system and the navigation menu is set up, but no other pages are currently in place. I spent a bit of time tidying up the interface, for example adding in more modern looking up and down arrows to the ‘log in’ box, tweaking the breadcrumb layout and updating the way links are styled to bring things more into line with the main University site.
I also spent a bit of time advising staff and undertaking some administrative work. Rhona Brown asked me for some advice on the project she is putting together and it took a little time to formulate a response to her. I was also asked by Wendy and Nikki to complete a staff time allocation survey for them, which also took a bit of time to go through. I also had an email from Adam Zachary Wyner in Aberdeen about a workshop he is putting together and I gave him a couple of suggestions about possible Glasgow participants. I’m also in the process of setting up a conference website for Sean Adams in Theology and have been liaising with the RA who is working on this with him.
Other than these matters the rest of my week was spent on two projects, the Scots Thesaurus and SAMUELS. For the Scots Thesaurus I continued to work on the visualizations. Last week I adapted an earlier visualization I had created to make it ‘dynamic’ – i.e the contents change depending on variables passed to it by the user. This week I set about integrating this with the WordPress interface. I had initially intended to make the visualisations available as a separate tab within the main page. E.g. the standard ‘browse’ interface would be available and by clicking on the visualization tab this would be replaced in page by the visualization interface. However, I realized that this approach wasn’t really going to work due to the limited screen space that we have available within the WordPress interface. As we are using a side panel the amount of usable space is actually quite limited and for the visualizations we need as much screen width as possible. I decided therefore to place the visualizations within a jQuery modal dialog box which takes up 90% of the screen width and height and have provided a button from the normal browse view to open this. When clicked on the visualization now loads in the dialog box, showing the current category in the centre and the full hierarchy from this point downwards spreading out around it. Previously the contents of a category were displayed in a pop-up when the user clicked on a category in the visualization, but this wasn’t ideal as it obscured the visualization itself. Instead I created an ‘infobox’ that appears to the right of the visualization and I’ve set this up so that it lists the contents of the selected category, including words, sources, links through to the DSL and facilities to centre the visualization on the currently selected category or to browse up the hierarchy if the central node is selected. The final thing I added in was highlighting of the currently selected node in the visualization and facilities to switch back to the textual browse option at the point at which the user is viewing the visualization. There is still some work to be done on the visualizations, for example adding in the part of speech browser, sorting out the layout and ideally providing some sort of animations between views, but things are coming along nicely.
However, I have figured out that BookwormGUI is based around the Highcharts.js library (see http://www.highcharts.com/demo/line-ajax) and I’m wondering now whether I can just use this library to connect to the Hansard data instead of trying to get Bookworm working, possibly borrowing some of the BookwormGUI code for handling the ‘limit by’ options and the ‘zoom in’ functionality (which I haven’t been able to find in the highcharts examples). I’m going to try this with the two years of Hansard data that I previously managed to extract, specifically this visualisation style: http://www.highcharts.com/stock/demo/compare. If I can get it to work the timeslider along the bottom would work really nicely.
The ISAS (International Society of Anglo-Saxonists) conference took place this week and two projects I have been working on over the past few weeks were launched at this event. The first was A Thesaurus of Old English (http://oldenglishthesaurus.arts.gla.ac.uk/), which went live on Monday. As is usual with these things there were some last minute changes and additions that needed to be made, but overall the launch went very smoothly and I’m particularly pleased with how the ‘search for word in other online resources’ feature works.
The second project that launched was the Old English Metaphor Map (http://mappingmetaphor.arts.gla.ac.uk/old-english/). We were due to launch this on Thursday but due to illness the launch was bumped up to Tuesday instead. Thankfully I had completed everything that needed sorting out before Tuesday so making the resource live was a very straightforward process. I think the map is looking pretty good and it complements the main site nicely.
With these two projects out of the way I had to spend about a day this week on AHRC duties, but once all that was done I could breathe a bit of a sigh of relief and get on with some other projects that I haven’t been able to devote much time to recently due to other commitments. The first of these was Gavin Miller’s Science Fiction and the Medical Humanities project. I’m developing a WordPress based tool for his project to manage a database of sources and this week I continued adding functionality to this tool as follows:
- I removed the error messages that were appearing when there weren’t any errors
- I’ve replaced ‘publisher’ with a new entity named ‘organisation’. This allows the connection the organisation has with the item (e.g. Publisher, Studio) to be selected in the same way as connections to items from places and people are handled.
- I’ve updated the way in which these connections are pulled out of the database to make it much easier to add new connection types. After adding a new connection type to the database this then immediately appears as a selectable option in all relevant places in the system.
- I’ve updated the underlying database so that data can have an ‘active’ or ‘deleted’ state, which will allow entities like people and places to be ‘deleted’ via WordPress but still retained in the underlying database in case they need to be reinstated.
- I’ve begun work on the pages that will allow the management of types and mediums, themes, people, places and organisations. Currently there are new menu items that provide options to list these data types. The lists also include counts of the number of bibliographic items each row is connected to. The next step will be to add in facilities to allow admin users to edit, delete and create types, mediums, themes, people, places and organisations.
The next project I worked on was the Scots Thesaurus project. Magda has emailed me stating she was having problems uploading words via CSV files and also assigning category numbers. I met with Magda on Thursday to discuss these issues and to try and figure out what was going wrong. The CSV issue was being caused by the CSV files created by Excel on Magda’s PC being given a rather unexpected MIME type. The upload script was checking the uploaded file for specific CSV MIME types but Excel was giving them a MIME type of ‘application/vnd.ms-excel’. I have no idea why this was happening, and even more strangely, when Magda emailed me one of her files and I uploaded it on my PC (without re-saving the file) it uploaded fine. I didn’t really get to the bottom of this problem, but instead I simply fixed it by allowing files of MIME type ‘application/vnd.ms-excel’ to be accepted.
The issue with certain category numbers not saving was being caused by deleted rows in the system. When creating a new category the system checks to see if there is already a row with the supplied number and part of speech in the system. If there is then the upload fails. However, the check wasn’t taking into consideration categories that had been deleted from within WordPress. These rows were being marked as ‘trash’ in WordPress but still existed in our non-Wordpress ‘category’ table. I updated the check to link up the category table to WordPress’s posts table to check the status of the category there. Now if a category number exists but it’s associated with a WordPress post that is marked as deleted then the upload of a new row can proceed without any problems.
In addition to fixing these issues I also continued working on the visualisations for the Scots Thesaurus. Magda will be presenting the thesaurus at a conference next week and she was hoping to be able to show some visualisations of the weather data. We had previously agreed at a meeting with Susan that I would continue to work on the static visualisation I had made for the ‘Golf’ data using the d3.js ‘node-link tree’ diagram type (see http://bl.ocks.org/mbostock/4063550). I would make this ‘dynamic’ (i.e. it would work with any data passed to it from the database and it would be possible to update the central node). Eventually we may choose a completely different visualisation approach but this is the one we will focus on for now. I spent some time adapting my ‘Golf’ visualisation to work with any thesaurus data passed to it – simply give it a category ID and a part of speech and the thesaurus structure (including subcategories) from this point downwards gets displayed. There’s still a lot of work to do on this (e.g. integrating it within WordPress) but I happy with the progress I’m making with it.
The last project I worked on this week was the SAMUELS Hansard data, or more specifically trying to get Bookworm set up on the test server I have access to. Previously I had managed to get the underlying database working and the test data (US Congress) installed. I had then installed the Bookworm API but I was having difficulty getting Python scripts to execute. I’m happy to report that I got to the bottom of this. After reading this post (https://www.linux.com/community/blogs/129-servers/757148-configuring-apache2-to-run-python-scripts) I realised that I had not enabled the CGI module of Apache, so even though the cgi-bin directory was now web accessible nothing was getting executed there. The second thing I realised was that I’d installed the API in a subdirectory within cg-bin and I needed to add privileges in the Apache configuration file for this subdirectory as well as the parent directory. With that out of the way I could query the API from a web browser, which was quite a relief.
Also this week I had a meeting with Gary Thomas about Jennifer Smith’s Syntactic Atlas of Scots project. Gary is the RA on the project and we met on Thursday to discuss how we should get the technical aspects of the project off the ground. It was a really useful meeting and we already have some ideas about how things will be managed. We’re not going to get started on this until next month, though, due to the availability of the project staff.
I continued working on the new website for the Thesaurus of Old English (TOE) this week, which took up a couple of days in total. Whilst working on the front end I noticed that the structure of TOE is different to that of the Historical Thesaurus of English (HTE) in an important way: There are never any categories with the same number but a different part of speech. With HTE the user can jump ‘sideways’ in the hierarchy from one part of speech to another and then browse up and down the hierarchy for that part of speech, but with TOE there is no ‘sideways’ – for example if there is an adjective category that could be seen as related to a noun category at the same level these categories are given different numbers. This difference meant that plugging the TOE data into the functions I’d created for the HTE website just didn’t work very well as there were just too many holes in the hierarchy when part of speech was taken into consideration.
The solution to the problem was to update the code to ignore part of speech. I checked that there were indeed no main categories with the same number but a different part of speech (a little script I wrote confirmed this to be the case) and then updated all of the functions that generated the hierarchy, the subcategories and other search and browse features to ignore part of speech, but instead to place the part of speech beside the category heading wherever category headings appear (e.g. in the ‘browse down’ section or the list of subcategories). This approach seems to have worked out rather well and the thesaurus hierarchy is now considerably more traversable.
I managed to complete a first version of the new website for TOE, with all required functionality in place. This includes both quick and advanced searches, category selection, the view category page and some placeholder ancillary pages. At Fraser’s request I also added in the facility to search with vowel length marks. This required creating another column in the ‘lexeme search words’ table with a stricter collation setting that ensures a search involving a length mark (e.g. ‘sǣd’) only finds words that feature the length mark (e.g. ‘sæd’ would not be found). I added an option to the advanced search field allowing the user to say whether they cared about length marks or not. The default is not, but I’m sure a certain kind of individual will be very keen on searching with length marks. If this option is selected the ‘special characters’ buttons expand to include all of the vowels with length marks, thus enabling the user to construct the required form. It will be useful for people who want to find out (for example) all of the words in the thesaurus that end in ‘*ēn’ (41) as opposed to all of those words that end in ‘*en’ disregarding length marks (1546).
I think we’re well on track to have the new TOE launched before the ISAS conference at the beginning of next month, which is great.
I continued working on the Scots Thesaurus project this week as well. I met with Susan and Magda on Tuesday to talk them through using the WordPress plugin I’d created for managing thesaurus categories and lexemes. Before this meeting I ran through the plugin a few times myself and noted a number of things that needed updating or improving so I spent some time sorting those things out. The meeting itself went well and I think both Susan and Magda are now familiar enough with the interface to use it. I created a ‘to do’ list containing outstanding technical tasks for the project and I’ll need to work through all of these. For example, a big thing to add will be facilities to enable staff to upload lexemes to a cateogory through the WordPress interface via a spreadsheet. This will really help to populate the thesaurus.
I also spent a little time contributing to a Leverhulme bid application for Carole Hough and did a tiny amount of DSL work as well. I’m still no further with the Hansard visualisations though. Arts Support are going to supply me with a test server on which I should be able to install Bookworm, but I’m not sure when this is going to happen yet. I’ll chase this up on Monday.
I returned to work on Monday this week after being off sick on Thursday and Friday last week. It has been yet another busy week, the highlight of which was undoubtedly the launch of the Mapping Metaphor website. After many long but enjoyable months working on the project it is really wonderful to finally be able to link to the site. So here it is: http://www.glasgow.ac.uk/metaphor
I moved the site to its ‘live’ location on Monday and made a lot of last minute tweaks to the content over the course of the day, with everything done and dusted before the press release went out at midnight. We’ve had lots of great feedback about the site. There was a really great article on the Guardian website (which can currently be found here: http://www.theguardian.com/books/2015/jun/30/metaphor-map-charts-the-images-that-structure-our-thinking) plus we made the front page of the Herald. A couple of (thankfully minor) bugs were spotted after the launch but I managed to get those sorted on Wednesday. It’s been a very successful launch and it has been a wonderful project to have been a part of. I’m really pleased with how everything has turned out.
Other than Mapping Metaphor duties I split my time across a number of different projects. I continued working with the Thesaurus of Old English data and managed to get everything that I needed to do to the data completed. This included writing and executing a nice little script that added in the required UTF-8 length markers over vowels. Previously the data used an underscore after the vowel to note that it was a long one but with UTF-8 we can use proper length marks, so my script found words like ‘sæ_d’ and converted them all to words like ‘sǣd’. Much nicer.
I wrote and executed another script that added in all of the category cross references, and another one that checked all of the words with a ‘ge’ prefix. My final data processing script generated the search terms for the words, for example it identified word forms with brackets such as ‘eorþe(r)n’ and then generated multiple variant search words, in this case two – ‘eorþen’ and ‘eorþern’. This has resulted in a totally of 57,067 search terms for the 51,470 words we have in the database.
Once I’d completed work on the data, I spent a little bit of time on the front end for the new Thesaurus of Old English website. This is going to be structurally the same as the Historical Thesaurus of English website, just with a different colour scheme and logos. I created three different colour scheme mockups and have sent these to Fraser and Marc for consideration, plus I got the homepage working for the new site (still to be kept under wraps for now). This homepage has a working ‘random category’ feature, which shows that the underlying data is working very nicely. Next week I’ll continue with the site and hopefully will get the search and browse facilities completed.
I also returned to working on the Scots Thesaurus project this week. I spent about a day on a number of tasks, including separating out the data that originated in the paper Scots Thesaurus from the data that has been gathered directly from the DSL. I also finally got round to amalgamating the ‘tools’ database with the ‘Wordpress’ database. When I began working for the project I created a tool that enables researchers to bring together the data for the Historical Thesaurus of English and the data from the Dictionary of the Scots Language in order to populate Scots Thesaurus categories with words uncovered from the DSL. Following on from this I made a WordPress plugin through which thesaurus data could be managed and published. But until this week the two systems were using separate databases, which required me to periodically manually migrate the data from the tools database to the WordPress one. I have now brought the two systems together, so it should now be possible to edit categories through the WordPress admin interface and for these updates to be reflected in the ‘tools’ interface. Similarly, any words added to categories through the ‘tools’ interface now automatically appear through the WordPress interface. I still need to fully integrate the ‘tools’ functionality with WordPress to we can get rid of the ‘tools’ system altogether, but it is much better having a unified database, even if there are still two interfaces on top of it. Other than these updates I also made a few tweaks to the public facing Scots Thesaurus website – adding in logos and things like that.
I also spent some time this week working on another WordPress plugin – this time for Gavin Miller’s Science Fiction and the Medical Humanities project. I’m creating the management scripts to allow him and his researchers to assemble a bibliographic database of materials relating to both Science Fiction and the Medical Humanities. I’ve got the underlying database created and the upload form completed. Next week I’ll get the upload form to actually upload its data. One handy thing I figured out whilst developing this plugin is how you can have multiple text areas that have the nice ‘WYSIWYG’ tools above them to enable people to add in formatted text. After lots of hunting around it turned out to be remarkably simple to incorporate, as this page explains: http://codex.wordpress.org/Function_Reference/wp_editor
The ‘scifimedhums’ website itself went live this week, so I can link to it here: http://scifimedhums.glasgow.ac.uk/
I was intending to continue with the Hansard data work this week as well. I had moved my 2 years of sample data (some 13 million rows) to my work PC and was all ready to get Bookworm up and running when I happened to notice that the software will not run on Windows (“Windows is out of the question” says the documentation). I contacted IT support to see if I could get command-line access to a server to get things working but I’m still waiting to see what they might be able to offer me.
I was struck down by a rather unpleasant, feverish throat infection this week. I managed to struggle through Wednesday, even though I should really have been in bed, but then was off sick on Thursday and Friday. It was very frustrating as I am really quite horribly busy at the moment with so many projects on the go and so many people needing advice, and I had to postpone three meetings I’d arranged for Thursday. But it can’t be helped.
I had a couple of meetings this week, one with Carole Hough to help her out with her Cogtop.org site. Whilst I was away on holiday a few weeks ago there were some problems with a multilingual plugin that we use on this site to provide content in English and Danish and the plugin had to be deactivated in order to get content added to the site. I met with Carole to discuss who should be responsible for updating the content of the site and what should be done about the multilingual feature. It turns out Carole will be updating the content herself so I gave her a quick tutorial on managing a WordPress site. I also replaced the multilingual plugin with a newer version that works very well. This plugin is called qTranslate X: https://wordpress.org/plugins/qtranslate-x/ and I would definitely recommend it.
My other meeting was with Gavin Miller, and we discussed the requirements for his bibliography of text relating to Medical Humanities and Science Fiction. I’m going to be creating a little WordPress plugin that he can use to populate the bibliography. We talked through the sorts of data that will need to be managed and Gavin is going to write a document listing the fields and some examples and we’ll take it from there.
I had hoped to be able to continue with the Hansard visualisation stuff on Wednesday this week but I just was feeling well enough to tackle it. My data extraction script had at least managed to extract frequency data for two whole years of the Commons by Wednesday, though. This may not seem like a lot of data when we have over 200 years to deal with, but it will be enough to test out how the Bookworm system will work with the data. Once I have get this test data working and I’m sure that the structure I’ve extracted the data into can be used with Bookworm we can then think about using Cloud or Grid computing to extract chunks of the data in parallel. If we don’t take this approach it will take another two years to complete the extraction of the data!
Instead of working with Hansard, I spent most of Wednesday working with the Thesaurus of Old English data that Fraser had given to me earlier in the week. I’ll be overhauling the old ‘TOE’ website and database and Fraser has been working to get the data into a consistent format. He gave me the data as a spreadsheet and I spent some time on Wednesday creating the necessary database structure for the data and writing scripts that would be able to process and upload the data. I managed to get all of the data uploaded into the new online database, consisting of almost 22,500 categories and 51,500 lexemes. I still need to do some work on the data, specifically fixing length symbols, which currently appear in the data as underscores after the letter (e.g. eorþri_ce) when what is needed is the modern UTF8 character (e.g. eorþrīce). I also need to create the search terms for variant forms in the data too, which could prove to be a little tricky.
Other tasks I carried out this week included completing the upload of all of the student created data for the Scots Thesaurus project, investigating the creation of the Google Play account for the STELLA apps and updating a lot of the ancillary content for the Mapping Metaphor website ahead of next week’s launch, a task which took a fair amount of time.
I spent a fair amount of time this week working on AHRC duties, conducting reviews and also finishing off the materials I’d been preparing for a workshop on technical plans. This involved writing a sample ‘bad’ plan (or at least a plan with quite a few issues with it) and then writing comments on each section stating what was wrong with it. It has been enjoyable to prepare these materials. I’ve been meaning to write a list of “dos and don’ts” for technical plans for some time and it was a good opportunity to get all this information out of my head and written down somewhere. It’s likely that a version of these materials will also be published on the Digital Curation Centre website at some point, and it’s good to know that the information will have a life beyond the workshop.
I continued to wrestle with the Hansard data this week after the problems I encountered with the frequency data last week. Rather than running the ‘mrarchive’ script that Lancaster had written in order to split a file into millions of tiny XML files I decided to write my own script that would load each line of the archived file, extract the data and upload it directly to a database instead. Steve Wattam at Lancaster emailed me some instructions and an example shell script that splits the archive files and I set to work adapting this. Each line of the archive file (in this case a 10Gb file containing the frequency data) consists of two parts, each of which is Base64 encoded. The first part is the filename and the second part is the file contents. All I needed to do for each line was split the two parts and decode each part. I would then have the filename, which includes information such as the year, month and day, plus all of the frequency data for the speech the file refers to. The frequency data consisted of a semantic category ID and a count, one per line and separated by a tab so it would be easy to split this information up and then upload each count for each category for each speech into a database table.
It took a little bit of time to get the script running successfully due to some confusion over how the two base64 encoded parts of each line were separated. In his email, Steve had said that the parts were split by ‘whitespace’, which I took to mean a space character. Unfortunately there didn’t appear to be a space character present but looking at the encoded lines I could see that each section appeared to be split with an equals sign so I set my script going using this. I also contacted Steve to check this was right and it turned out that by ‘whitespace’ he’d meant a tab character and that the equals sign I was using to split the data was a padding character that couldn’t be relied upon to always be present. After hearing this I managed to update my script and set it off again. However, my script is unfortunately not going to be a suitable way to extract the data as its execution is just too slow for the amount of data we’re dealing with. Having started the process on Wednesday evening it took until Sunday before the script had processed the data for one year. During this time it had extracted more than 7.5million frequencies relating to tens of thousands of speeches, but at the current rate it will take more than two years to finish processing the data for the 200ish years of data that we have. A more efficient method is going to be required.
Following on from my meeting with Scott Spurlock last week I spent a bit of time researching crowdsourcing tools. I managed to identify three open source tools that might be suitable for Scott’s project (and potentially other projects in future).
First of all is one called PyBossa: http://pybossa.com/. It’s written in the Python programming language, which I’m not massively familiar with but have used a bit. The website links through to some crowdsourcing projects that have been created using the tool and one of them is quite similar to what Scott is wanting to do. The example project is getting people to translate badly printed German text into English, an example of which can be found here: http://crowdsourced.micropasts.org/app/NFPA-SetleyNews2/task/40476. Apparently you can create a project for free via a web interface here: http://crowdcrafting.org/ but I haven’t investigated this.
The second one is a tool called Hive that was written by the New York Times and has been released for anyone to use: https://github.com/nytlabs/hive with an article about it here: http://blog.nytlabs.com/2014/12/09/hive-open-source-crowdsourcing-framework/. This is written in ‘Go’ which I have to say I’d never heard of before so have no experience of. The system is used to power a project to crowdsource historical adverts in the NYT, and you can access this here: http://madison.nytimes.com/contribute/. It deals with categorising content rather than transcribing it, though. I haven’t found any other examples of projects that use the tool as of yet.
The third option is the Zooniverse system, which does appear to be available for download: https://github.com/zooniverse/Scribe. It’s written in Ruby, which I only have a passing knowledge of. I haven’t been able to find any examples of other projects using this software and I’m also not quite sure how the Scribe tool (which says it’s “a framework for crowdsourcing the transcription of text-based documents, particularly documents that are not well suited for Optical Character Recognition) fits in with other Zooniverse tools, for example Panoptes (https://github.com/zooniverse/Panoptes), which says it’s “The new Zooniverse API for supporting user-created projects.” It could be difficult to get everything set up, but is probably worth investigating further.
I spent a small amount of time this week dealing with App queries from other parts of the University, and I also communicated briefly with Jane Stuart-Smith about a University data centre. I made a few further tweaks for the SciFiMedHums website for Gavin Miller and talked with Megan Coyer about her upcoming project, which is now due to commence in August, if recruitment goes to plan.
What remained of the week after all of the above I mostly spent on Mapping Metaphor duties. Ellen had sent through the text for the website (which is now likely to go live at the end of the month!) and I made the necessary additions and changes. My last task of the week was to begin to process the additional data that Susan’s students had compiled for the Scots Thesaurus project. I’ve so far managed to process two of these files and there are another few still to go, which I’ll get done on Monday.