It was another short week for me this week, as I was off on Monday, Tuesday and Wednesday. I had to spend a little time whilst I was off fixing a problem with one of our WordPress sites. Daria alerted me to the fact that the ISAS conference website was displaying nothing but a database connection error, which is obviously a fairly major problem. It turned out that one of the underlying database tables had become corrupted, but thankfully WordPress includes some tools to fix such issues and after figuring that out I managed to get the site up and running again. I’m not sure what caused the problem but hopefully it won’t happen again.
I attended to some further WordPress duties later on in the week when I was back at work. I’ve set up a conference website for Sean Adams in Theology and he was wanting a registration page to be set up. I was supposed to meet with his RA on Thursday to discuss the website but unfortunately she was ill, so I just had to get things set up without any formal discussions. I investigated a couple of event management plugins for WordPress, but the ones I tried seemed a bit too big and clunky for what we need. All we need is a single registration page for one event, but the plugins provide facilities to publish multiple events, manage payments, different ticket types and all of this stuff. It was all far too complicated, yet at the same time it seemed rather difficult to customise some fairly obvious things such as which fields are included in the registration form. After trying two plugins and being dissatisfied with both of them I just settled for using a contact form that emails Sean and the RA whenever someone registers. It’s not the ideal setup but for a relatively small event that we have very little time to get things set up for it should work out ok.
I had some further AHRC review duties to take care of this week, which took up the best part of one of my available days. I also had some more iOS developer account management issues to take care of, which also took up some time. Some people elsewhere in the University are wanting to upload a paid app to the App Store, but in order to do this a further contract needs to be signed with Apple, and this needs some kind of legal approval from the University before we agree to it. I had a couple of telephone conversations with a lawyer working on behalf of the University about the contracts for Apple and also for the Google Play store. I also had email conversations with Megan Coyer and Gavin Miller about the development of their respective online resources and spoke to Magda and Susan regarding some Scots Thesaurus issues.
On Friday morning I had a meeting with Gerry Carruthers and Catriona MacDonald to discuss their ‘People’s Voice’ project, which is due to start in January and for which I had been assigned two months of effort. We had a really useful meeting, going over some initial requirements for the database of songs and poems that they want to put together and thinking about how their anthology of songs will be marked up and managed. We also agreed that Mark Herraghty would do the bulk of the development work for the project. Mark knows a lot more about XML markup than I do which makes him a perfect fit for the project so this is really good news.
This was all I managed to squeeze into my two days of work this week. I didn’t get a chance to do any real development work but hopefully next week I’ll be able to get back stuck into it.
After fixing this issue I continued to work with the visualisations. I added in an option to show or hide the words in a category, as the ‘infobox’ was taking up quite a lot of space when viewing a category that contains a lot of words. I also developed a first version of the part of speech selector. This displays the available parts of speech as checkboxes above the visualization and allows the user to select which parts to view. Ticking or unticking a box automatically updates the visualization. The feature is still unfinished and there are some aspects that need sorted, for example the listed parts of speech only show those that are present at the current level in the hierarchy but as things stand there are sometimes a broader range of parts lower down the hierarchy and these are not available to choose until the user browses down to the lower level. I’m still uncertain as to whether multiple parts of speech in one visualisation is going to work very well and whether a simpler switch from one part to another might work better, but we’ll see how it goes.
I also spent a bit of time on the Medical Humanities Network website, continuing to add new features to it and I set up a conference website for Sean Adams in Theology. This is another WordPress powered site but Sean wanted it to look like the University website. A UoG-esque theme for WordPress had been created a few years ago by Dave Beavan and then subsequently tweaked by Matt Barr, but the theme was rather out of date and didn’t look exactly like the current University website so I spent some time updating the theme, which will probably see some use on other websites too. This one, for example.
This week I continued to work on the projects I’d started work on again last week after launching the three Old English resources. For the Science Fiction and the Medical Humanities project I completed a first draft of all of the management scripts that are required for managing the bibliographic data that will be published through the website. It is now possible to manage all of the information relating to bibliographical items through the WordPress interface, including adding, editing and deleting mediums, themes, people, places and organisations. The only thing it isn’t possible to do is to update the list of options that appear in the ‘connection’ drop-down lists when associating people, places and organisations. But I can very easily update these lists directly through the database and the new information then appears wherever it is required so this isn’t going to be a problem.
Continuing on the Medical Humanities theme, I spent about a day this week starting work on the new Medical Humanities network website and content management system for Megan Coyer. This system is going to be an adaptation of the existing Digital Humanities network system. Most of my time was spent on ‘back end’ stuff like setting up the underlying database, password protecting the subdomain until we’re ready to ‘go live’ and configuring the template scripts. The homepage is in place (but without any content), it is possible to log into the system and the navigation menu is set up, but no other pages are currently in place. I spent a bit of time tidying up the interface, for example adding in more modern looking up and down arrows to the ‘log in’ box, tweaking the breadcrumb layout and updating the way links are styled to bring things more into line with the main University site.
I also spent a bit of time advising staff and undertaking some administrative work. Rhona Brown asked me for some advice on the project she is putting together and it took a little time to formulate a response to her. I was also asked by Wendy and Nikki to complete a staff time allocation survey for them, which also took a bit of time to go through. I also had an email from Adam Zachary Wyner in Aberdeen about a workshop he is putting together and I gave him a couple of suggestions about possible Glasgow participants. I’m also in the process of setting up a conference website for Sean Adams in Theology and have been liaising with the RA who is working on this with him.
Other than these matters the rest of my week was spent on two projects, the Scots Thesaurus and SAMUELS. For the Scots Thesaurus I continued to work on the visualizations. Last week I adapted an earlier visualization I had created to make it ‘dynamic’ – i.e the contents change depending on variables passed to it by the user. This week I set about integrating this with the WordPress interface. I had initially intended to make the visualisations available as a separate tab within the main page. E.g. the standard ‘browse’ interface would be available and by clicking on the visualization tab this would be replaced in page by the visualization interface. However, I realized that this approach wasn’t really going to work due to the limited screen space that we have available within the WordPress interface. As we are using a side panel the amount of usable space is actually quite limited and for the visualizations we need as much screen width as possible. I decided therefore to place the visualizations within a jQuery modal dialog box which takes up 90% of the screen width and height and have provided a button from the normal browse view to open this. When clicked on the visualization now loads in the dialog box, showing the current category in the centre and the full hierarchy from this point downwards spreading out around it. Previously the contents of a category were displayed in a pop-up when the user clicked on a category in the visualization, but this wasn’t ideal as it obscured the visualization itself. Instead I created an ‘infobox’ that appears to the right of the visualization and I’ve set this up so that it lists the contents of the selected category, including words, sources, links through to the DSL and facilities to centre the visualization on the currently selected category or to browse up the hierarchy if the central node is selected. The final thing I added in was highlighting of the currently selected node in the visualization and facilities to switch back to the textual browse option at the point at which the user is viewing the visualization. There is still some work to be done on the visualizations, for example adding in the part of speech browser, sorting out the layout and ideally providing some sort of animations between views, but things are coming along nicely.
However, I have figured out that BookwormGUI is based around the Highcharts.js library (see http://www.highcharts.com/demo/line-ajax) and I’m wondering now whether I can just use this library to connect to the Hansard data instead of trying to get Bookworm working, possibly borrowing some of the BookwormGUI code for handling the ‘limit by’ options and the ‘zoom in’ functionality (which I haven’t been able to find in the highcharts examples). I’m going to try this with the two years of Hansard data that I previously managed to extract, specifically this visualisation style: http://www.highcharts.com/stock/demo/compare. If I can get it to work the timeslider along the bottom would work really nicely.
The ISAS (International Society of Anglo-Saxonists) conference took place this week and two projects I have been working on over the past few weeks were launched at this event. The first was A Thesaurus of Old English (http://oldenglishthesaurus.arts.gla.ac.uk/), which went live on Monday. As is usual with these things there were some last minute changes and additions that needed to be made, but overall the launch went very smoothly and I’m particularly pleased with how the ‘search for word in other online resources’ feature works.
The second project that launched was the Old English Metaphor Map (http://mappingmetaphor.arts.gla.ac.uk/old-english/). We were due to launch this on Thursday but due to illness the launch was bumped up to Tuesday instead. Thankfully I had completed everything that needed sorting out before Tuesday so making the resource live was a very straightforward process. I think the map is looking pretty good and it complements the main site nicely.
With these two projects out of the way I had to spend about a day this week on AHRC duties, but once all that was done I could breathe a bit of a sigh of relief and get on with some other projects that I haven’t been able to devote much time to recently due to other commitments. The first of these was Gavin Miller’s Science Fiction and the Medical Humanities project. I’m developing a WordPress based tool for his project to manage a database of sources and this week I continued adding functionality to this tool as follows:
- I removed the error messages that were appearing when there weren’t any errors
- I’ve replaced ‘publisher’ with a new entity named ‘organisation’. This allows the connection the organisation has with the item (e.g. Publisher, Studio) to be selected in the same way as connections to items from places and people are handled.
- I’ve updated the way in which these connections are pulled out of the database to make it much easier to add new connection types. After adding a new connection type to the database this then immediately appears as a selectable option in all relevant places in the system.
- I’ve updated the underlying database so that data can have an ‘active’ or ‘deleted’ state, which will allow entities like people and places to be ‘deleted’ via WordPress but still retained in the underlying database in case they need to be reinstated.
- I’ve begun work on the pages that will allow the management of types and mediums, themes, people, places and organisations. Currently there are new menu items that provide options to list these data types. The lists also include counts of the number of bibliographic items each row is connected to. The next step will be to add in facilities to allow admin users to edit, delete and create types, mediums, themes, people, places and organisations.
The next project I worked on was the Scots Thesaurus project. Magda has emailed me stating she was having problems uploading words via CSV files and also assigning category numbers. I met with Magda on Thursday to discuss these issues and to try and figure out what was going wrong. The CSV issue was being caused by the CSV files created by Excel on Magda’s PC being given a rather unexpected MIME type. The upload script was checking the uploaded file for specific CSV MIME types but Excel was giving them a MIME type of ‘application/vnd.ms-excel’. I have no idea why this was happening, and even more strangely, when Magda emailed me one of her files and I uploaded it on my PC (without re-saving the file) it uploaded fine. I didn’t really get to the bottom of this problem, but instead I simply fixed it by allowing files of MIME type ‘application/vnd.ms-excel’ to be accepted.
The issue with certain category numbers not saving was being caused by deleted rows in the system. When creating a new category the system checks to see if there is already a row with the supplied number and part of speech in the system. If there is then the upload fails. However, the check wasn’t taking into consideration categories that had been deleted from within WordPress. These rows were being marked as ‘trash’ in WordPress but still existed in our non-Wordpress ‘category’ table. I updated the check to link up the category table to WordPress’s posts table to check the status of the category there. Now if a category number exists but it’s associated with a WordPress post that is marked as deleted then the upload of a new row can proceed without any problems.
In addition to fixing these issues I also continued working on the visualisations for the Scots Thesaurus. Magda will be presenting the thesaurus at a conference next week and she was hoping to be able to show some visualisations of the weather data. We had previously agreed at a meeting with Susan that I would continue to work on the static visualisation I had made for the ‘Golf’ data using the d3.js ‘node-link tree’ diagram type (see http://bl.ocks.org/mbostock/4063550). I would make this ‘dynamic’ (i.e. it would work with any data passed to it from the database and it would be possible to update the central node). Eventually we may choose a completely different visualisation approach but this is the one we will focus on for now. I spent some time adapting my ‘Golf’ visualisation to work with any thesaurus data passed to it – simply give it a category ID and a part of speech and the thesaurus structure (including subcategories) from this point downwards gets displayed. There’s still a lot of work to do on this (e.g. integrating it within WordPress) but I happy with the progress I’m making with it.
The last project I worked on this week was the SAMUELS Hansard data, or more specifically trying to get Bookworm set up on the test server I have access to. Previously I had managed to get the underlying database working and the test data (US Congress) installed. I had then installed the Bookworm API but I was having difficulty getting Python scripts to execute. I’m happy to report that I got to the bottom of this. After reading this post (https://www.linux.com/community/blogs/129-servers/757148-configuring-apache2-to-run-python-scripts) I realised that I had not enabled the CGI module of Apache, so even though the cgi-bin directory was now web accessible nothing was getting executed there. The second thing I realised was that I’d installed the API in a subdirectory within cg-bin and I needed to add privileges in the Apache configuration file for this subdirectory as well as the parent directory. With that out of the way I could query the API from a web browser, which was quite a relief.
Also this week I had a meeting with Gary Thomas about Jennifer Smith’s Syntactic Atlas of Scots project. Gary is the RA on the project and we met on Thursday to discuss how we should get the technical aspects of the project off the ground. It was a really useful meeting and we already have some ideas about how things will be managed. We’re not going to get started on this until next month, though, due to the availability of the project staff.
It was a four-day week this week due to the Glasgow Fair holiday. I actually worked the Monday and took the Friday off instead, and this worked out quite well as it gave me a chance to continue development of the Scots Thesaurus before we had our team meeting on the Tuesday morning. I had previously circulated a ‘to do’ list that brought together all of the outstanding technical tasks for the project, with 5 items specifically to do with the management of thesaurus data via the WordPress interface. I’m happy to report that I managed to complete all of these items. This included adding facilities to enable words associated with a category to be deleted from the system (in actual fact the word records are simply marked as ‘inactive’ in the underlying database). This option makes it a lot easier for Magda to manage the category information. I also redeveloped the way sources and URLs as stored in the system. Previously each word could have one single source (either DOST or SND) and a single URL. I’ve updated this to enable a word to have any number of associated sources and URLs, and I’ve expanded the possible source list to include the paper Scots Thesaurus too. I could have updated the system to incorporate any number of sources but Susan thinks these three will be sufficient. Allowing multiple sources per word actually meant quite a lot of reworking of both the underlying database and the WordPress plugin I’m developing for the project, but it is all now working fine. I also updated the way connections to existing Historical Thesaurus of English categories are handled and added in an option that allows a CSV file containing words to be uploaded to a category via the WordPress admin interface. This last update should prove very useful to the people working on the project as it will enable them to compile lists of words in Excel and then upload them directly from this to a category in the online database. On Tuesday we had a team meeting for the project and I gave a demonstration of these new features and Magda is going to start using the system and will let me know if anything needs updated.
I spent a small amount of time this week updating the Burns website to incorporate new features that launched on the anniversary of Burns’ death on the 21st. These are an audio play about Burns forgeries (http://burnsc21.glasgow.ac.uk/burns-forgery/) and an online exhibition about the illustrations to George Thomson’s collections of songs (http://burnsc21.glasgow.ac.uk/the-illustrations-to-george-thomsons-collections/).
I continued working on the SAMUELS project this week, again trying to figure out how to get the Bookworm system working on the test server that Chris has set up for me. The script that imports the congress data into Bookworm that I left running last week successfully completed this week. The amount of data generated for this textual resource is rather large, with one of the tables consisting of over 42 million rows and another one taking up 22 million rows. I still need to figure out how this data is actually queried and used to power the Bookworm visualisations and the next step was to get the Bookworm API installed and running. The API connects to the database and allows the visualisation to query it. It’s written in Python and I spent rather a lot of time just trying to get Python scripts to execute via Apache on the server. This involved setting up a cgi-bin, ensuring Apache knows about it, where it is and it has the permissions to execute scripts stored there. I spent a rather frustrating few hours getting nothing but 403 Forbidden errors before realising that you had to explicitly give Apache rights to do things with the directory in the apache configuration file as well as updating file permissions. By the end of the week I still hadn’t managed to get Python files actually running – instead the browser just attempts to download the files. I need to continue with this next week, hopefully with the help of Chris McGlashan who was on holiday this week.
I spent the majority of the rest of the week working on the Old English version of the Metaphor Map, which we are intending to launch at the ISAS conference. This is a version of the Metaphor Map that features purely Old English related data and will sit alongside the main Mapping Metaphor website as a stand-alone interface. Here’s a summary of what I managed to complete this week:
- I’ve uploaded OE stage 5 and stage 4 data to new OE specific tables
- I identified some rows that included categories that no longer exist and following feedback from Ellen I deleted these (I think there were only 3 in total)
- I’ve replicated the existing site structure at the new OE URL and I’ve updated how the text for the ancillary pages is stored: It’s all now stored in one single PHP file which is then referenced by both the main and the OE ancillary pages. I’ve also put a check in all of the OE pages to see if OE specific text has been supplied and if so this is used instead of the main text. This should make it easier to manage all of the text.
- I’ve created a new purple colour scheme for the OE site, plus a new purple ‘M’ favicon (unfortunately it isn’t exactly the same as the green one so I might update this)
- I’ve expanded the top bar to incorporate tabs for switching from the OE map to the main one. These are currently positioned to the left of the bar in a similar way to how the Scots Corpus links to CMSW and back work.
- The visualisation / table / card views are all now working with the OE data. Timeline has been removed as this is no longer applicable (all metaphors are OE with no specific date).
- Search and browse are also now working with the OE data.
- All reference to first dates and first lexemes has been removed, e.g. from metaphor cards, columns in the tabular view, the search options
- The metaphor card heading now says ‘OE Metaphor’ and then a number, just in case people notice the same number is used for different connections in the OE / non-OE sites.
- The text ‘(from OE to present day)’ has been added to the lexeme info in the metaphor cards.
- Where a metaphorical connection between two categories also exists in the non-OE data, a link is added to the bottom of the metaphor card with text ‘View this connection in the main metaphor map’. Pressing on this opens the nonOE map in a new tab with the visualisation showing category 1 and the connection to category 2 highlighted. The check for the existence of the connection in the non-OE data ignores strength and presents the nonOE map with both strong and weak visible. This is so that if (for example) the OE connection is weak but the main connection is strong you can still jump from one to the other.
- I’ve updated the category database to add a new column ‘OE categories completed’. The OE categories completed page will list all categories where this is set to ‘y’ (none currently)
- I’ve created staff pages to allow OE data to be managed by project staff.
Next week I’ll receive some further data to upload and after that we should be pretty much ready to launch.
I continued working on the new website for the Thesaurus of Old English (TOE) this week, which took up a couple of days in total. Whilst working on the front end I noticed that the structure of TOE is different to that of the Historical Thesaurus of English (HTE) in an important way: There are never any categories with the same number but a different part of speech. With HTE the user can jump ‘sideways’ in the hierarchy from one part of speech to another and then browse up and down the hierarchy for that part of speech, but with TOE there is no ‘sideways’ – for example if there is an adjective category that could be seen as related to a noun category at the same level these categories are given different numbers. This difference meant that plugging the TOE data into the functions I’d created for the HTE website just didn’t work very well as there were just too many holes in the hierarchy when part of speech was taken into consideration.
The solution to the problem was to update the code to ignore part of speech. I checked that there were indeed no main categories with the same number but a different part of speech (a little script I wrote confirmed this to be the case) and then updated all of the functions that generated the hierarchy, the subcategories and other search and browse features to ignore part of speech, but instead to place the part of speech beside the category heading wherever category headings appear (e.g. in the ‘browse down’ section or the list of subcategories). This approach seems to have worked out rather well and the thesaurus hierarchy is now considerably more traversable.
I managed to complete a first version of the new website for TOE, with all required functionality in place. This includes both quick and advanced searches, category selection, the view category page and some placeholder ancillary pages. At Fraser’s request I also added in the facility to search with vowel length marks. This required creating another column in the ‘lexeme search words’ table with a stricter collation setting that ensures a search involving a length mark (e.g. ‘sǣd’) only finds words that feature the length mark (e.g. ‘sæd’ would not be found). I added an option to the advanced search field allowing the user to say whether they cared about length marks or not. The default is not, but I’m sure a certain kind of individual will be very keen on searching with length marks. If this option is selected the ‘special characters’ buttons expand to include all of the vowels with length marks, thus enabling the user to construct the required form. It will be useful for people who want to find out (for example) all of the words in the thesaurus that end in ‘*ēn’ (41) as opposed to all of those words that end in ‘*en’ disregarding length marks (1546).
I think we’re well on track to have the new TOE launched before the ISAS conference at the beginning of next month, which is great.
I continued working on the Scots Thesaurus project this week as well. I met with Susan and Magda on Tuesday to talk them through using the WordPress plugin I’d created for managing thesaurus categories and lexemes. Before this meeting I ran through the plugin a few times myself and noted a number of things that needed updating or improving so I spent some time sorting those things out. The meeting itself went well and I think both Susan and Magda are now familiar enough with the interface to use it. I created a ‘to do’ list containing outstanding technical tasks for the project and I’ll need to work through all of these. For example, a big thing to add will be facilities to enable staff to upload lexemes to a cateogory through the WordPress interface via a spreadsheet. This will really help to populate the thesaurus.
I also spent a little time contributing to a Leverhulme bid application for Carole Hough and did a tiny amount of DSL work as well. I’m still no further with the Hansard visualisations though. Arts Support are going to supply me with a test server on which I should be able to install Bookworm, but I’m not sure when this is going to happen yet. I’ll chase this up on Monday.
I returned to work on Monday this week after being off sick on Thursday and Friday last week. It has been yet another busy week, the highlight of which was undoubtedly the launch of the Mapping Metaphor website. After many long but enjoyable months working on the project it is really wonderful to finally be able to link to the site. So here it is: http://www.glasgow.ac.uk/metaphor
I moved the site to its ‘live’ location on Monday and made a lot of last minute tweaks to the content over the course of the day, with everything done and dusted before the press release went out at midnight. We’ve had lots of great feedback about the site. There was a really great article on the Guardian website (which can currently be found here: http://www.theguardian.com/books/2015/jun/30/metaphor-map-charts-the-images-that-structure-our-thinking) plus we made the front page of the Herald. A couple of (thankfully minor) bugs were spotted after the launch but I managed to get those sorted on Wednesday. It’s been a very successful launch and it has been a wonderful project to have been a part of. I’m really pleased with how everything has turned out.
Other than Mapping Metaphor duties I split my time across a number of different projects. I continued working with the Thesaurus of Old English data and managed to get everything that I needed to do to the data completed. This included writing and executing a nice little script that added in the required UTF-8 length markers over vowels. Previously the data used an underscore after the vowel to note that it was a long one but with UTF-8 we can use proper length marks, so my script found words like ‘sæ_d’ and converted them all to words like ‘sǣd’. Much nicer.
I wrote and executed another script that added in all of the category cross references, and another one that checked all of the words with a ‘ge’ prefix. My final data processing script generated the search terms for the words, for example it identified word forms with brackets such as ‘eorþe(r)n’ and then generated multiple variant search words, in this case two – ‘eorþen’ and ‘eorþern’. This has resulted in a totally of 57,067 search terms for the 51,470 words we have in the database.
Once I’d completed work on the data, I spent a little bit of time on the front end for the new Thesaurus of Old English website. This is going to be structurally the same as the Historical Thesaurus of English website, just with a different colour scheme and logos. I created three different colour scheme mockups and have sent these to Fraser and Marc for consideration, plus I got the homepage working for the new site (still to be kept under wraps for now). This homepage has a working ‘random category’ feature, which shows that the underlying data is working very nicely. Next week I’ll continue with the site and hopefully will get the search and browse facilities completed.
I also returned to working on the Scots Thesaurus project this week. I spent about a day on a number of tasks, including separating out the data that originated in the paper Scots Thesaurus from the data that has been gathered directly from the DSL. I also finally got round to amalgamating the ‘tools’ database with the ‘Wordpress’ database. When I began working for the project I created a tool that enables researchers to bring together the data for the Historical Thesaurus of English and the data from the Dictionary of the Scots Language in order to populate Scots Thesaurus categories with words uncovered from the DSL. Following on from this I made a WordPress plugin through which thesaurus data could be managed and published. But until this week the two systems were using separate databases, which required me to periodically manually migrate the data from the tools database to the WordPress one. I have now brought the two systems together, so it should now be possible to edit categories through the WordPress admin interface and for these updates to be reflected in the ‘tools’ interface. Similarly, any words added to categories through the ‘tools’ interface now automatically appear through the WordPress interface. I still need to fully integrate the ‘tools’ functionality with WordPress to we can get rid of the ‘tools’ system altogether, but it is much better having a unified database, even if there are still two interfaces on top of it. Other than these updates I also made a few tweaks to the public facing Scots Thesaurus website – adding in logos and things like that.
I also spent some time this week working on another WordPress plugin – this time for Gavin Miller’s Science Fiction and the Medical Humanities project. I’m creating the management scripts to allow him and his researchers to assemble a bibliographic database of materials relating to both Science Fiction and the Medical Humanities. I’ve got the underlying database created and the upload form completed. Next week I’ll get the upload form to actually upload its data. One handy thing I figured out whilst developing this plugin is how you can have multiple text areas that have the nice ‘WYSIWYG’ tools above them to enable people to add in formatted text. After lots of hunting around it turned out to be remarkably simple to incorporate, as this page explains: http://codex.wordpress.org/Function_Reference/wp_editor
The ‘scifimedhums’ website itself went live this week, so I can link to it here: http://scifimedhums.glasgow.ac.uk/
I was intending to continue with the Hansard data work this week as well. I had moved my 2 years of sample data (some 13 million rows) to my work PC and was all ready to get Bookworm up and running when I happened to notice that the software will not run on Windows (“Windows is out of the question” says the documentation). I contacted IT support to see if I could get command-line access to a server to get things working but I’m still waiting to see what they might be able to offer me.
I was struck down by a rather unpleasant, feverish throat infection this week. I managed to struggle through Wednesday, even though I should really have been in bed, but then was off sick on Thursday and Friday. It was very frustrating as I am really quite horribly busy at the moment with so many projects on the go and so many people needing advice, and I had to postpone three meetings I’d arranged for Thursday. But it can’t be helped.
I had a couple of meetings this week, one with Carole Hough to help her out with her Cogtop.org site. Whilst I was away on holiday a few weeks ago there were some problems with a multilingual plugin that we use on this site to provide content in English and Danish and the plugin had to be deactivated in order to get content added to the site. I met with Carole to discuss who should be responsible for updating the content of the site and what should be done about the multilingual feature. It turns out Carole will be updating the content herself so I gave her a quick tutorial on managing a WordPress site. I also replaced the multilingual plugin with a newer version that works very well. This plugin is called qTranslate X: https://wordpress.org/plugins/qtranslate-x/ and I would definitely recommend it.
My other meeting was with Gavin Miller, and we discussed the requirements for his bibliography of text relating to Medical Humanities and Science Fiction. I’m going to be creating a little WordPress plugin that he can use to populate the bibliography. We talked through the sorts of data that will need to be managed and Gavin is going to write a document listing the fields and some examples and we’ll take it from there.
I had hoped to be able to continue with the Hansard visualisation stuff on Wednesday this week but I just was feeling well enough to tackle it. My data extraction script had at least managed to extract frequency data for two whole years of the Commons by Wednesday, though. This may not seem like a lot of data when we have over 200 years to deal with, but it will be enough to test out how the Bookworm system will work with the data. Once I have get this test data working and I’m sure that the structure I’ve extracted the data into can be used with Bookworm we can then think about using Cloud or Grid computing to extract chunks of the data in parallel. If we don’t take this approach it will take another two years to complete the extraction of the data!
Instead of working with Hansard, I spent most of Wednesday working with the Thesaurus of Old English data that Fraser had given to me earlier in the week. I’ll be overhauling the old ‘TOE’ website and database and Fraser has been working to get the data into a consistent format. He gave me the data as a spreadsheet and I spent some time on Wednesday creating the necessary database structure for the data and writing scripts that would be able to process and upload the data. I managed to get all of the data uploaded into the new online database, consisting of almost 22,500 categories and 51,500 lexemes. I still need to do some work on the data, specifically fixing length symbols, which currently appear in the data as underscores after the letter (e.g. eorþri_ce) when what is needed is the modern UTF8 character (e.g. eorþrīce). I also need to create the search terms for variant forms in the data too, which could prove to be a little tricky.
Other tasks I carried out this week included completing the upload of all of the student created data for the Scots Thesaurus project, investigating the creation of the Google Play account for the STELLA apps and updating a lot of the ancillary content for the Mapping Metaphor website ahead of next week’s launch, a task which took a fair amount of time.
I spent a fair amount of time this week working on AHRC duties, conducting reviews and also finishing off the materials I’d been preparing for a workshop on technical plans. This involved writing a sample ‘bad’ plan (or at least a plan with quite a few issues with it) and then writing comments on each section stating what was wrong with it. It has been enjoyable to prepare these materials. I’ve been meaning to write a list of “dos and don’ts” for technical plans for some time and it was a good opportunity to get all this information out of my head and written down somewhere. It’s likely that a version of these materials will also be published on the Digital Curation Centre website at some point, and it’s good to know that the information will have a life beyond the workshop.
I continued to wrestle with the Hansard data this week after the problems I encountered with the frequency data last week. Rather than running the ‘mrarchive’ script that Lancaster had written in order to split a file into millions of tiny XML files I decided to write my own script that would load each line of the archived file, extract the data and upload it directly to a database instead. Steve Wattam at Lancaster emailed me some instructions and an example shell script that splits the archive files and I set to work adapting this. Each line of the archive file (in this case a 10Gb file containing the frequency data) consists of two parts, each of which is Base64 encoded. The first part is the filename and the second part is the file contents. All I needed to do for each line was split the two parts and decode each part. I would then have the filename, which includes information such as the year, month and day, plus all of the frequency data for the speech the file refers to. The frequency data consisted of a semantic category ID and a count, one per line and separated by a tab so it would be easy to split this information up and then upload each count for each category for each speech into a database table.
It took a little bit of time to get the script running successfully due to some confusion over how the two base64 encoded parts of each line were separated. In his email, Steve had said that the parts were split by ‘whitespace’, which I took to mean a space character. Unfortunately there didn’t appear to be a space character present but looking at the encoded lines I could see that each section appeared to be split with an equals sign so I set my script going using this. I also contacted Steve to check this was right and it turned out that by ‘whitespace’ he’d meant a tab character and that the equals sign I was using to split the data was a padding character that couldn’t be relied upon to always be present. After hearing this I managed to update my script and set it off again. However, my script is unfortunately not going to be a suitable way to extract the data as its execution is just too slow for the amount of data we’re dealing with. Having started the process on Wednesday evening it took until Sunday before the script had processed the data for one year. During this time it had extracted more than 7.5million frequencies relating to tens of thousands of speeches, but at the current rate it will take more than two years to finish processing the data for the 200ish years of data that we have. A more efficient method is going to be required.
Following on from my meeting with Scott Spurlock last week I spent a bit of time researching crowdsourcing tools. I managed to identify three open source tools that might be suitable for Scott’s project (and potentially other projects in future).
First of all is one called PyBossa: http://pybossa.com/. It’s written in the Python programming language, which I’m not massively familiar with but have used a bit. The website links through to some crowdsourcing projects that have been created using the tool and one of them is quite similar to what Scott is wanting to do. The example project is getting people to translate badly printed German text into English, an example of which can be found here: http://crowdsourced.micropasts.org/app/NFPA-SetleyNews2/task/40476. Apparently you can create a project for free via a web interface here: http://crowdcrafting.org/ but I haven’t investigated this.
The second one is a tool called Hive that was written by the New York Times and has been released for anyone to use: https://github.com/nytlabs/hive with an article about it here: http://blog.nytlabs.com/2014/12/09/hive-open-source-crowdsourcing-framework/. This is written in ‘Go’ which I have to say I’d never heard of before so have no experience of. The system is used to power a project to crowdsource historical adverts in the NYT, and you can access this here: http://madison.nytimes.com/contribute/. It deals with categorising content rather than transcribing it, though. I haven’t found any other examples of projects that use the tool as of yet.
The third option is the Zooniverse system, which does appear to be available for download: https://github.com/zooniverse/Scribe. It’s written in Ruby, which I only have a passing knowledge of. I haven’t been able to find any examples of other projects using this software and I’m also not quite sure how the Scribe tool (which says it’s “a framework for crowdsourcing the transcription of text-based documents, particularly documents that are not well suited for Optical Character Recognition) fits in with other Zooniverse tools, for example Panoptes (https://github.com/zooniverse/Panoptes), which says it’s “The new Zooniverse API for supporting user-created projects.” It could be difficult to get everything set up, but is probably worth investigating further.
I spent a small amount of time this week dealing with App queries from other parts of the University, and I also communicated briefly with Jane Stuart-Smith about a University data centre. I made a few further tweaks for the SciFiMedHums website for Gavin Miller and talked with Megan Coyer about her upcoming project, which is now due to commence in August, if recruitment goes to plan.
What remained of the week after all of the above I mostly spent on Mapping Metaphor duties. Ellen had sent through the text for the website (which is now likely to go live at the end of the month!) and I made the necessary additions and changes. My last task of the week was to begin to process the additional data that Susan’s students had compiled for the Scots Thesaurus project. I’ve so far managed to process two of these files and there are another few still to go, which I’ll get done on Monday.
I returned to the office on Monday to check how the Hansard data extraction was going to discover that our 2TB external hard drive had been completed filled without the extraction completing. I managed to extract the files that have the semantic tags (commons.tagged.v3.1 and lord.tagged.v3.1) but due to the astonishing storage overheads the directory structures and tiny XML files have the 2Tb external drive just wasn’t big enough to hold the full-text files as well. The commons full-text file (commons.mr) is 36.55Gb but when the extraction quit due to using up all available space this file had already taken up 736Gb. Rather strangely, OSX’s ‘get info’ facility gives completely wrong directory size values. After taking about 15 minutes to check through the directories it reckoned the commons.mr extraction directory was taking up just 38.1Gb of disk space and the data itself was taking up just 25.78Gb. I had to run the ‘du’ command at the command line (du -hcs) to get the accurate figure of 736Gb. It makes me wonder what command ‘get info’ is using.
I had a couple of meetings with Fraser and some chats with Marc about the Hansard data and what it is the we need to do with it, and while we do need access to all of the files I’ve been extracting it turns out that what we really need for the bookworm visualisations we’re hoping to put together (see a similar example for the US congress here: http://bookworm.culturomics.org/congress/) is the data about the frequency of occurrence for each thematic heading in each speech. This data wasn’t actually located in the files I had previously been extracting but was in a different tar.gz file that we had received from Steve previously. I set to work extracting the data from this file, only to find that the splitting tool kept quitting out during processing.
I had decided to extract the ‘thm’ file first, as this contained the frequencies for the thematic headings, but the commons file, commons.tagged.v3.1.mr.thm.fql.mr, which is 9.51Gb in size, when passed through the mrarchive script quit out after only processing 210Mb. I tried this twice and it quit out at the same point each time having only extracted some of the days from ‘commons 2000-2005’. I then tried to extract the file in Windows rather than on my Mac but encountered the same problem. I tried the other files (HT and sem frequency lists) but ran into the same problem. We contacted Steve at Lancaster about this and he’s given me some helpful pointers about how I can create a script that will be able to process the data from the joined file rather than having to split the file up first and I’m going to try this approach next week.
Other than these rather frustrating Hansard matters I worked on a number of different projects this week. I spent some time on AHRC duties, undertaking more reviews plus writing materials for a workshop on writing Technical Plans that I’m creating in collaboration with colleagues in HATII. I also helped Gavin Miller with his ‘Sci-Fi and the Medical Humanities’ project, creating some graphics for the website, making banners and doing other visual things, which was good fun. I helped Vivien Williams of the Burns project with some issues she’s been having with managing images on the Burns site and I had some admin duties to perform with regards to the University’s Apple Developer account too.
I met with Craig Lamont, a PhD student who is working on a project with Murray Pittock. They are putting together an interactive historical map of Edinburgh with lots of points of interest on it and I helped Craig get this set up. We tried to get the map embedded in the University’s T4 system but unfortunately we didn’t have much success. We have since heard back from the University’s T4 people and it may be possible to embed such content using a different method. I’ll need to try this next time we meet. In the meantime I set up the map interface on Craig’s laptop and showed him how he could add new points to the map, so he will be able to add all the necessary content himself now. I also met with Scott Spurlock on Friday to discuss a project he is putting together involving Kirk records. I can’t really go into detail about this here but I’m going to be helping him to write the bid over the next few weeks.
For the Scots Thesaurus project I had a fair amount of data to format and upload. Magda had sent me some more data late last week before she went away so I added that to our databases. Two students had been working on the project over the past couple of weeks too and they had also produced some datasets which I uploaded. I met with the students and Susan on Thursday to discuss the data and to show them how it was being used in the system.
For the DSL we finally got round to going live with a number of updates, including Boolean searching and the improved search results navigation facilities. There was a scary moment on Thursday morning when the API was broken and it wasn’t possible to access any data, but Peter soon got this sorted and the new facilities are now available for all (see http://www.dsl.ac.uk/advanced-search/).
I was also involved with a few Mapping Metaphor duties this week. After working with the OE data that I had written a script to collate last week, Ellen sent me a version of the data that needed duplicate rows stripped out of. I passed this file through a script I’d written and this reduced the number of rows from 5732 down to 2864. Ellen then realised that she needed consolidated metaphor codes too (i.e. the code noted from A>B, e.g. ‘metaphor strong’ doesn’t always correspond to the code that is recorded for B>A, e.g. ‘metaphor weak’) so I passed the data through a script that generated these codes too. All in all it’s been another busy week.