I was on holiday last week and had quite a stack of things to do when I got back in on Monday. This included setting up a new project website for a student in Scottish Literature who had received some Carnegie funding for a project and preparing for an interview panel that I was on, with the interview taking place on Friday. I also responded to Alison Wiggins about the content management system I’d created for her Mary Queen of Scots letters project and had a discussion with someone in Central IT Services about the App and Play Store accounts that I’ve been managing for several years now. It’s looking like responsibility for this might be moving to IT services, which I think makes a lot of sense. I also gave some advice to a PhD student about archiving and preserving her website and engaged in a long email discussion with Heather Pagan of the Anglo-Norman Dictionary about sorting out their data and possibly migrating it to a new system. On Wednesday I had a meeting with the SCOSYA team about further developments of the public atlas. We decided on another few requirements and discussed timescales for the completion of the work. They’re hoping to be able to engage in some user testing in the middle of September, so I need to try and get everything completed before then. I had hoped to start on some of this on Thursday, but I was struck down by a really nasty cold that I’ve still not shaken yet, which made focussing on such tricky tasks as getting questionnaire areas to highlight when clicked on rather difficult.
I spent most of the rest of the week working for DSL in various capacities. I’d put in a request to get Apache Solr installed on a new server, so we could use this for free-text searching and thankfully Arts IT Support agreed to do this. A lot of my week was spent preparing the data from both the ‘v2’ version of the DSL (the data outputted from the original API, but with full quotes and everything pre-generated rather than being created on the fly every time an entry is requested) and the ‘v3’ API (data taken from the editing server and outputted by a script written by Thomas Widmann) so that it could be indexed by Solr. Raymond from Arts IT Support set up an instance of Solr on a new server and I created scripts that went through all 90,000 DSL entries in both versions and generated full-text versions of the entries that stripped out the XML tags. For each set I created three versions – on that was ‘full text’, one that was full text without the quotations and the other that was just the quotations. The script outputted this data in a format that Solr could work with and I sent this on to Raymond for indexing. The first test version I sent Raymond was just the full text, and Solr managed to index this without incident. However, the other views of the text required working with the XML a bit, and this appears to have brought in some issues with special characters that Solr is not liking. I’m still in the middle of sorting this out and will continue to look into it next week, but progress with the free-text searching is definitely being made and it looks like the new API will be able to offer the same level of functionality as the existing API. I also ensured I documented the process of generating all of the data from the XML files outputted by the editing system through to preparing the full-text for indexing by Solr, so next time we come to update the data we will know exactly what to do. This is much better than how things previously stood, as the original API is entirely a ‘black box’ with no documentation whatsoever as to how to update the data contained therein.
Also during this time I engaged in an email conversation about managing the dictionary entries and things like cross references with Ann Ferguson and the people who will be handling the new editor software for the dictionary, and helped to migrate control for the email part of the DSL domain to the control of the DSL’s IT people. We’re definitely making progress with sorting out the DSL’s systems, which is really great.
I’m going to be working for just three days over the next two weeks, and all of these days will be out of the office, so I’ll just need to see how much time I have to continue with the DSL tasks, especially as work for the SCOSYA project is getting rather urgent.
I focussed on two new projects that were needing my input this week, as well as working on some more established projects and attending an event on Friday. One of the new projects was the pilot project for Matthew Sangster’s Books and Borrowing project, which I started working on last week. After some manual tweaking of the script I had created to work with the data, and some manual reworking of the data itself I managed to get 8138 rows out of 8199 uploaded, with most of the rows that failed to be uploaded being those where blank cells were merged together – almost all being rows denoting blank pages or other rows that didn’t contain an actual record. I considered fixing the rows that failed to upload but decided that as this is still just a test version of the data there wasn’t really much point in me spending time doing so, as I will be getting a new dataset from Matthew later on in the summer anyway.
I also created and executed a few scripts that will make the data more suitable for searching and browsing. This has included writing a script that takes the ‘lent’ and ‘returned’ dates and splits these up into separate day, month and year columns, converting the month to a numeric value too. This will make it easier to search by dates, or order results by dates (e.g. grouping by a specific month or searching within a range of time). Note that where the given date doesn’t exactly fit the pattern of ‘dd mmm yyyy’ the new date fields remain unpopulated – e.g. things like ‘[blank]’ or ‘[see next line]’. There are also some dates that don’t have a day, and a few typos (e.g. ’fab’ instead of ‘feb’) that I haven’t done anything about yet.
I’ve also extracted the professor names from the ‘prof1, prof2 and prof3’ columns and have stored them as unique entries in a new ‘profs’ table. So for example ‘Dr Leechman’ only appears once in the table, even though he appears multiple times as either prof 1, 2 or 3 (225 times, in fact). There are 123 distinct profs in this table, although these will need further work as some of these are undoubtedly duplicates with slightly different forms. I’ve also created a joining table that joins each prof to each record. This matches up the record ID with the unique ID for each prof, and also stores whether the prof was listed as ‘1,2 or 3’ for each record, in case this is of any significance.
Similarly, I’ve extracted the unique normalised names from the records and have stored these in a separate table. There are 862 unique student names, and a further linking table associates each of these with one or more record. I will also need to split the student names into forename and surname in order to generate a browse list of students. It might not be possible to do this fully automatically as names like ‘Robert Stirling Junior’ and ‘Robert Stirling Senior’ would then end up with ‘Junior’ and ‘Senior’ as surnames. I guess a list of professors listed by surname would also be needed too.
I have also processed the images of the 3 manuscripts that appear in the records (2,3 and 6). This involved running a batch script to rotate the images of the manuscripts that are to be read as landscape rather than portrait and renaming all files to remove spaces. I’ve also passed the images through the Zoomify tileset generator so now have tiles at various zoom levels for all of the images. I’m not going to be able to do any further work on this pilot project until the end of July, but it’s good to get some of the groundwork done.
The second new project I worked on this week was Alison Wiggins’s Account Books project. Alison had sent me an Access database containing records relating to the letters of Mary, Queen of Scots and she wanted me to create an online database and content management system out of this, to enable several researchers to work on the data together. Alison wanted this to be ready to use in July, which meant I had to try and get the system up and running this week. I spent about two days importing the data and setting up the content management system, split across 13 related database tables. Facilities are now in place for Alison to create staff accounts and to add, edit, delete and associate information about archives, documents, editions, pages, people and places.
On Wednesday this week I had a further meeting with Marc and Fraser about the HT / OED data linking task. In preparation for the meeting I spent some time creating some queries that generated some statistics about the data. There are 223250 matched categories, and of these there 161378 where the number of HT and OED words are the same and 100% of words match. There are 19990 categories where the number of HT and OED words are the same but not all words match. There are 12796 categories where the number of HT words is greater than the number of OED words and 100% of OED words match and 5878 categories where the number of HT words is greater than the number of OED words and less than 100% of OED words match. There are 16077 categories where the number of HT words is less than the number of OED words and 100% of HT words match, and in these categories there are 18909 unmatched OED lexemes that are ‘revised’. Finally, there are 7131 categories where the number of HT words is less than the number of OED words and less than 100% of HT words match, and in these categories there are 19733 unmatched OED lexemes that are ‘revised’. Hopefully these statistics will help when it comes to deciding what dates to adopt from the OED data.
On Friday I went down to Lancaster University for the Encyclopedia of Shakespeare’s Language Symposium. This was a launch event for the project and there were sessions on the new resources that the project is going to make available, specifically the Enhanced Shakespearean Corpus. This corpus consists of three parts. The ‘First Folio Plus’ is a corpus of the first folio of plays from 1623, plus a few extra plays. It has additional layers of annotation and tagging – tagging such as for part of speech and annotation such as social annotation (e.g. who was speaking to whom), gender and a social ranking. Part of speech was tagged using an adapted version of the CLAWS tagger and spelling variation was regularised using VARD2.
The second part is a corpus of comparative plays. This is a collection of plays by other authors from the same time period, which allows the comparison of Shakespeare’s language to that of his contemporaries. The ‘first folio’ has 38 plays from 1589-1613 while the comparative plays corpus has 46 plays from 1584-1626. Both are just over 1 million words in size and look at similar genres (comedy, tragedy, history) and a similar mixture of verse and prose.
The third part is the EEBO-TCP segment, which is about 300 million words over 5700 texts, which is about a quarter of EEBO. It doesn’t include Shakespeare’s texts but includes texts from five broad domains (literary, religious, administrative, instructional and informational) and many genres and allows researchers to tap into the meanings triggered in the minds of Elizabethan audiences.
The corpus uses CQPWeb, which was developed by Andrew Hardie at Lancaster, and who spoke about the resource at the event. As we use this software for some projects at Glasgow it was good to see it demonstrated and get some ideas as to how it is being used for this new project. There were also several short papers that demonstrated the sorts of research that can be undertaken using the new corpus and the software. It was an interesting event and I’m glad I attended it.
Also this week I engaged in several email discussions with DSL people about working with the DSL data, advised someone who is helping Thomas Clancy get his proposal together, scheduled an ArtsLab session on research data, provided the RNSN people with some information they need for some printed materials they’re preparing and spoke to Gerry McKeever about an interactive map he wants me to create for his project.
I seem to be heading through a somewhat busy patch at the moment, and had to focus my efforts on five major projects and several other smaller bits of work this week. The major projects were SCOSYA, Books and Borrowing, DSL, HT and Bess of Hardwick’s Account books. For SCOSYA I continued to implement the public atlas, this week focussing on the highlighting of groups. I had hoped that this would be a relatively straightforward feature to implement, as I had already created facilities to create and view groups in the atlas I’d made for the content management system. However, it proved to be much trickier than I’d anticipated as I’d rewritten much of the atlas code in order to incorporate the GeoJSON areas as well as purely point-based data, plus I needed to integrate the selection of groups and the loading of group locations with the API. My existing code for finding the markers for a specified group and adding a coloured border was just not working, and I spent a frustratingly long amount of time debugging the code to find out what had changed to stop the selection from finding anything. It turned out that in my new code I was reinstantiating the variable I was using to hold all of the point data within a function, meaning that the scope of the variable containing the data was limited to that function rather than being available to other functions. Once I figured this out it was a simple fix to make the data available to the parts of the code that needed to find and highlight relevant markers and I then managed to make groups of markers highlight or ‘unhighlight’ at the press of a button, as the following screenshot demonstrates:
You can now select one or more groups and the markers in the group are highlighted in green. Press a group button a second time to remove the highlighting. However, there is still a lot to be done. For one thing, only the markers highlight, not the areas. It’s proving to be rather complicated to get the areas highlighted as these GeoJSON shapes are handled quite differently to markers. I spent a long time trying to get the areas to highlight without success and will need to return to this another week. I also need to implement highlighting in different colours, so each group you choose to highlight is given a different colour to the last. Also, I need to find a way to make the selected groups be remembered as you change from points to areas to both, and change speaker type, and also possibly as you change between examples. Currently the group selection resets but the selected group buttons remain highlighted, which is not ideal.
I also spend time this week on the pilot project for Matthew Sangster’s Books and Borrowing project, which is looking at University student (and possibly staff) borrowing records from the 18th century. Matthew has compiled a spreadsheet that he wants me to create a searchable / browsable online resource for and my first task was to extract the data from the spreadsheet, create an online database and write a script to migrate the data to this database. I’ve done this sort of task many times before, but unfortunately things are rather more complicated this time because Matthew has included formatting within the spreadsheet that needs to be retained in the online version. This includes superscript text throughout the more than 8000 records and simply saving the spreadsheet as a CSV file and writing a script to go through each cell and upload the data won’t work as the superscript style will be lost in the conversion to CSV. PHPMyAdmin also includes a facility to import a spreadsheet in the OpenDocument format, but unfortunately this not only removes the superscript format but also the text that is specified as superscript as well.
Therefore I had to investigate other ways of getting the data out of the spreadsheet while somehow retaining the superscript formatting. The only means of doing so that I could think of was to save the spreadsheet as an HTML document, which would convert Excel’s superscript formatting into HTML superscript tags, which is what we’d need for displaying the data on a website anyway. Unfortunately the HTML generated by Excel is absolutely awful and filled with lots of unnecessary junk that I then needed to strip out manually. I managed to write a script that extracted the data (including the formatting for superscript) and import this into the online database for about 8000 of the 8200 rows, but the remainder had problems that prevented the insertion from taking place. I’ll need to think about creating multiple passes for the data when I return to it next week.
For the DSL this week I spent rather a lot of time engaged in email conversations with Rhona Alcorn about the tasks required to sort out the data that the team have been working on for several years and which now needs to be extracted from older systems and migrated to a new system, plus the API that I am working on. It looked like there would be a lot of work for me to do with this, but thankfully midway through the week it became apparent that the company who are supplying the new system for managing the DSL’s data have a member of staff who is expecting to do a lot of the tasks that had previously been assigned to me. This is really good news as I was beginning to worry about the amount of work I wold have to do for the DSL and how I would fit this in around other work commitments. We’ll just need to see how this all pans out.
I also spent some time implementing a Boolean search for the new DSL API. I now have this in place and working for headword searches, which can be performed via the ‘quick search’ box on the test sites I’ve created. It’s possible to use Boolean AND, OR and NOT (all must be entered upper case to be picked up) and a search can be used in combination with wildcards, and speech-marks can now be used to specify an exact search. So, for example, if you want to find all the headwords beginning with ‘chang’ but wish to exclude results for ‘change’ and ‘chang’ you can enter ‘chang* NOT “change” NOT “chang”’.
OR searches are likely to bring back lots of results and at the moment I’ve not put a limit on the results, but I will do so before things go live. Also, while there are no limits on the number of Booleans that can be added to a query, results when using multiple Booleans are likely to get a little weird due to there being multiple ways a query could be interpreted. E.g. ‘Ran* OR run* NOT rancet’ still brings back ‘rancet’ because the query is interpreted as ‘get all the ‘ran*’ results OR all the ‘run*’ results so long as they don’t include ‘rancet’ – so ran* OR (run* NOT rancet). But without complicating things horribly with brackets or something similar there’s no way of preventing such ambiguity when multiple different Booleans are used.
For the Historical Thesaurus I met with Marc and Fraser on Monday to discuss our progress with the HT / OED linking and afterwards continued with a number of tasks that were either ongoing or had been suggested at the meeting. This included ticking off some matches from a monosemous script, creating a new script that brings back up to 1000 random unmatched lexemes at a time for spot-checking and creating an updated Levenshtein script for lexemes, which is potentially going to match a further 5000 lexemes. I also wrote a document detailing how I think that full dates should be handled in the HT, to replace the rather messy way dates are currently recorded. We will need to decide on a method in order to get the updated dates from the OED into a comparable format.
Also this week I returned to Alison Wiggins’s Account Books project, or rather a related output about the letters of Mary, Queen of Scots. Alison had sent me a database containing a catalogue of letters and I need to create a content management system to allow her and other team members to work on this together. I’ve requested a new subdomain for this system and have begun to look at the data and will get properly stuck into this next week, all being well.
Other than these main projects I also gave feedback on Thomas Clancy’s Iona project proposal, including making some changes to the Data Management Plan, helped sort out access to logo files for the Seeing Speech project, sorted out an issue with the Editing Burns blog that was displaying no content since the server upgrade (it turns out it was using a very old plugin that was not compatible with the newer version of PHP on the server) and helped sort out some app issues. All in all a very busy week.
This was a week of many different projects, most of which required fairly small jobs doing, but some of which required most of my time. I responded to a query from Simon Taylor about a potential new project he’s putting together that will involve the development of an app. I fixed a couple of issues with the old pilot Scots Thesaurus website for Susan Rennie, and I contributed to a Data Management Plan for a follow-on project that Murray Pittock is working on. I also made a couple of tweaks to the new maps I’d created for Thomas Clancy’s Saints Places project (the new maps haven’t gone live yet) and I had a chat with Rachel Macdonald about some further updates to the SPADE website. I also made some small updates to the Digital Humanities Network website, such as replacing HATII with Information Studies. I also had a chat with Carole Hough about the launch of the REELS resource, which will happen next month, and spoke to Alison Wiggins about fixing the Bess of Hardwick resource, which is currently hosted at Sheffield and is unfortunately no longer working properly. I also continued to discuss the materials for an upcoming workshop on digital editions with Bryony Randall and Ronan Crowley. I also made a few further tweaks to the new Seeing Speech and Dynamic Dialects websites for Jane Stuart-Smith.
I had a meeting with Kirsteen McCue and Brianna Robertson-Kirkland to discuss further updates to the Romantic National Song Network website. There are going to be about 15 ‘song stories’ that we’re going to publish between the new year and the project’s performance event in March, and I’ll be working on putting these together as soon as the content comes through. I also need to look into developing an overarching timeline with contextual events.
I spent some time updating the pilot crowdsourcing platform I had set up for Scott Spurlock. Scott wanted to restrict access to the full-size manuscript images and also wanted to have two individual transcriptions per image. I updated the site so that users can no longer right click on an image to save or view it. This should stop most people from downloading the image, but I pointed out that it’s not possible to completely lock the images. If you want people to be able to view an image in a browser it is always going to be possible for the user to get the image somehow – e.g. saving a screenshot, or looking at the source code for the site and finding the reference to the image. I also pointed out that by stopping people easily getting access to the full image we might put people off from contributing – e.g. some people might want to view the full image in another browser window, or print it off to transcribe from a hard copy.
I also spent a bit of time continuing to work on the Bilingual Thesaurus. I moved the site I’m working on to a new URL, as requested by Louise Sylvester, and updated the thesaurus data after receiving feedback on a few issues I’d raised previously. This included updating the ‘language of citation’ for the 15 headwords that had no data for this, instead making them ‘uncertain’. I also added in first dates for a number of words that previously only had end dates, based on information Louise sent to me. I also noticed that several words have duplicate languages in the original data, for example the headword “Clensing (mashinge, yel, yeling) tonne” has for language of origin: “Old English|?Old English|Middle Dutch|Middle Dutch|Old English”. My new relational structure ideally should have a language of origin / citation linked only once to a word, otherwise things get a bit messy, so I asked Louise whether these duplicates are required, and whether a word can have both an uncertain language of origin (“?Old English”) and a certain language of origin (“Old English”). I haven’t heard back from her about this yet, but I wrote a script that strips out the duplicates, and where both an uncertain and certain connection exists keeps the uncertain one. If needs be I’ll change this. Other than these issues relating to the data, I spent some time working on the actual site for the Bilingual Thesaurus. I’m taking the opportunity to learn more about the Bootstrap user interface library and am developing the website using this. I’ve been replicating the look and feel of the HT website using Bootstrap syntax and have come up with a rather pleasing new version of the HT banner and menu layout. Next week I’ll see about starting to integrate the data itself.
This just leaves the big project of the week to discuss: the ongoing work to align the HT and OED datasets. I continued to implement some of the QA and matching scripts that Marc, Fraser and I discussed at our meeting last week. Last week I ‘dematched’ 2412 categories that don’t have a perfect number of lexemes match and have the same parent category. I created a further script that checks how many lexemes in these potentially matched categories are the same. This script counts the number of words in the potentially matched HT and OED categories and counts how many of them are identical (stripped). A percentage of the number of HT words that are matched is also displayed. If the number of HT and OED words match and the total number of matches is the same as the number of words in the HT and OED categories the row is displayed in green. If the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1 this is also considered a match. If the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1 this is also considered a match. The total matches given are 1154 out of 2412.
I then moved onto creating a script that checks the manually matched data from our ‘version 1’ matching process. There are 1407 manual matches in the system. Of these:
- 795 are full matches (number of words and stripped last word match or have a Levenshtein score of 1 and 100% of HT words match OED words, or the categories are empty)
- There are 205 rows where all words match or the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1, or the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1
- There are 122 rows where the last word matches (or has a Levenshtein score of 1) but nothing else does
- There are 18 part of speech mismatches
- There are 267 rows where nothing matches
I then created a ‘pattern matching’ script, which changes the category headings based on a number of patterns and checks whether this then results in any matches. The following patterns were attempted:
- inhabitant of the -> inhabitant
- inhabitant of -> inhabitant
- relating to -> pertaining to
- spec. -> specific
- spec -> specific
- specific -> specifically
- assoc. -> associated
- esp. -> especially
- north -> n.
- south -> s.
- january -> jan.
- march -> mar.
- august -> aug.
- september -> sept.
- october -> oct.
- november -> nov.
- december -> dec.
- Levenshtein difference of 1
- Adding ‘ing’ onto the end
The script identified 2966 general pattern matches, 129 Levenshtein score 1 matches and 11 ‘ing’ matches, leaving 17660 OED categories that have a corresponding HT catnum with different details and a further 6529 OED categories that have no corresponding HT catnum. Where there is a matching category number of lexemes / last lexeme / total matched lexeme checks as above are applied and rows are colour coded accordingly.
On Friday Marc, Fraser and I had a further meeting to discuss the above, and we came up with a whole bunch of further updates that I am going to focus on next week. It feels like real progress is being made.
I continued to work with the data for the Bess of Hardwick account book project this week. I had intended to work on a couple of other projects that are just starting up, but there have been some delays in people getting back to me so instead I used the time to experiment with the account book data. Last week I exported the data from the original Access database into a MySQL database, and this week I set about creating an initial online resource that would enable users to browse through the data.
I took one of my Bootstrap powered prototype interfaces for the new ‘Seeing Speech’ website and adapted this as an initial interface, changing the colours and using a section of an image of one of the account book pages as a background to the header. It didn’t take long to set up, but I think it looks pretty good as a starting point.
I created ‘browse’ features that allow users to access the account book entries in a number of different ways. The ‘Entries’ page provides access to the data in ‘book’ format. It allows users to select a document, view a list of folios, then select a folio in order to view the entries found on it. The ‘Entry modes’ page lists the entry modes (‘bill’, ‘wages’ etc), along with a count of the number of entries that have the mode. Users can then click on an entry mode to view the entries that have this mode. The ‘Entry types’ page is the same but for entry types (‘money in’, ‘money out’ etc) rather than modes. The ‘Entities’ page lists the entity categories (e.g. ‘clothing’, ‘jewellery’) and the number of entities found in each. Clicking on a category allows the user to view its entities (e.g. ‘eggs’, ‘gloves’) together with a count of the number of entries this entity appears in. users can then click on an entity to view the entries. The ‘Parties’ page lists the party status types (‘card player’, ‘borrower’ etc) and the number of parties that have been associated with the type (e.g. ‘Sir William’, ‘Anne Dalton’). Users can click on a status to view a list of the parties, together with a count of the entries they appear in, and then click on a party name to view the associated entries. The ‘Places’ page lists places together with a count of the entries these appear in, while the ‘Times’ page does something similar for times.
When viewing entries, each entry contains all of the information recorded about the entry, such as the cost in pounds, shillings and pence, the cost converted purely to pence, the main text of the entry, associated people, places, entities etc. Where something in the entry can be browsed for it appears as a link – e.g. you can click on an ‘entry type’ to see all of the other entries that have this type. I also added in a ‘total cost’ at the bottom of a page of entries, plus options to order entries by their sequence number or by their cost.
On Wednesday I met with Alison Wiggins to discuss the project and the system I’d created and she seemed pretty pleased with how things are developing so far. There are still lots of things to do for the project, though, such as adding in some search functionality and some visualisations. It should be fun to get it all working.
I dealt with relatively minor issues for a number of other projects this week. This included setting up hosting for the crowdsourcing project for Scott Spurlock, making some tweaks to the SPADE website, upgrading all of the WordPress sites I manage to the latest version of WordPress, responding to a query Wendy Anderson had received relating to the Mapping Metaphor data, setting up hosting for our new thesaurus.ac.uk domain, setting up hosting for Thomas Clancy’s place-names of Kirkcudbright project and replying to an email from him about the Iona project proposal that’s still in development, and setting up a new page URL for Eleanor Lawson to use to promote the Seeing Speech website.
The rest of my week was spent on Historical Thesaurus duties. I met with Fraser on Tuesday to help him to set up a local copy of the HT database on his laptop. I’d managed to get a dump of the database from Chris and after a little bit of time figuring out where MySQL is located on a Mac, and what the default user details are, we managed to get all of the data uploaded and working in Fraser’s local copy of PHPMyAdmin.
On Friday I had a very long but useful meeting with Marc and Fraser to discuss future updates to the HT data and the website. The meeting lasted pretty much all morning, but we discussed an awful lot, including a new thesaurus that has been developed elsewhere that we might be hosting. Marc sent me on the data and I spent some time after the meeting looking through it and figuring out how it is structured. We also discussed moving some of my test projects that are currently located on old desktop PCs in my office onto the old HT server and how we might use this server to set up a new corpus resource. We talked about what we would host on the new thesaurus.ac.uk domain, and some conferences we might go to next year. We spent some time planning the proposal for a new thesaurus that Fraser is putting together at the moment (I can’t go into too much detail about this for now) and how we might develop an actual content management system for managing updates to the HT database, with workflows that would allow contributors to make changes and for these to then be passed to the editor for potential inclusion into the live system, and we discussed the ongoing work to join up the OED and the HT data. Following the meeting I made my updated ‘category selection’ page live. This page includes timelines and the main timeline visualisation popup, as you can see here: https://ht.ac.uk/category-selection/?qsearch=wolf
We’re meeting again next week to discuss the OED / HT data joining in more detail. I hope we can finally get this task completed sometime soon.
I spent the majority of this week working on two specific projects: Mapping Metaphor and Bess of Hardwick. For Mapping Metaphor I spent some more time thinking about the requirements and how these might translate into a usable interface to the data at the various levels that are required. I created a series of Powerpoint based mock-ups of the map interface as follows: a ‘Top level’ map where all metaphor connections would be aggregated into the three ‘general sections’ of the Historical Thesaurus, a second level map where these three general sections are broken up into their constituent subsections and metaphor data is still aggregated into these sections, and a Metaphor category level map, where all Metaphor categories within an HT Subsection are displayed, enabling connections from one or more Metaphor categories to be explored through map interfaces similar to those I had previously created mock-ups for. I also gave further thought into the issue of time, incorporating a double-ended slider into the maps, thus enabling users to specify a range of dates that should be represented on the map. And I also came up with a few possible ways in which we could make the maps more interactive – encouraging users to great metaphor groupings and save / share these, for example. I also attended a fairly lengthy meeting of the Metaphor project team, which had some useful outputs, although a lot more discussion is needed before the requirements for an actual working system can be detailed. There will be a follow-on meeting next week involving Wendy, Ellen, Marc and me and hopefully some of these details can be worked out then.
For the Bess project I reworked my previous mockup of a mobile interface for the project website based on the now finalised interface. The test site had moved URLs and I was unaware of this until the end of last week. The site at the new URL is quite different to the older one I had based my previous mobile interface on, so there was quite a bit of work to be done, updating CSS files, adding in new jQuery code to handle new situations and replacing some of the icons I had previously used with ones that were now in use on the main site. Kathy, the developer at HRI who is producing the main site had decided to use the OpenLayers approach to the images of the letters, as opposed to Zoomify, and this decision was a real help for the mobile interface as if the Flash based Zoomify had been chosen it would not have been possible for the majority of mobile devices to access the images. I spent some time reworking the OpenLayers powered image page so it would work nicely with touchscreens too and on Thursday I was in a position to email the files required to create the mobile interface to Kathy. Hopefully it will be possible to get the mobile version of the site launched alongside the main interface.
On Friday I met with a post-graduate student in English Language who is putting together an online survey involving listening to sound clips and answering some questions about them. I was able to give her a number of suggestions for improvements and I think we had a really useful meeting. On Friday afternoon I attended a meeting to discuss the redevelopment of the Historical Thesaurus website, with Marc, Jean, Flora and Christian. It was a useful meeting and we agreed on an overall site structure and that I would base the interface for the new site on the colour scheme and logo of the old site. I’m hoping to make a start on the redevelopment of the website next week, although we still need to have further discussions about the sort of search and browse functionality that is required.
Last weekend was Easter, and I took a few additional days off to recharge the old batteries. Because of this there was no weekly update last week, and this week’s is going to be relatively short too, as I only worked Wednesday to Friday. My Easter Egg count was four this year, with only one still intact.
I spent quite a bit of time this week working on the requirements for the Mapping Metaphor website, in preparation for next week’s team meeting. Wendy emailed a first draft of a requirements document to the team last week and on Wednesday I went through this in quite some detail, picking out suggestions and questions. This took up most of the day and my document had more than 50 questions in it, which I hoped wasn’t too overwhelming or disheartening. I emailed the document and arranged to meet Wendy the following day to discuss things. We had a really useful meeting where we went through each of my questions / suggestions. In a lot of cases Wendy was able to clarify things very well and my understanding of the project increased considerably. In other cases Wendy decided that my questions needed further discussion amongst the wider group and these questions will be emailed to the team before next week’s meeting. Our meeting took about two hours and was pretty exhausting but was hugely useful. It will probably take another few meetings like this with various people before we get a more concrete set of requirements that can be used as the basis for the development of the online tool.
Also this week I revisited the website I’ve set up for Carole’s ICOS2014 conference. Daria emailed me with some further suggestions and updates and I managed to get them all implemented. An interesting one was to add multilingual support to the site. I ended up using qTranslate (http://www.qianqin.de/qtranslate/) which was really easy to set up and works very well – you just select the languages you want to be represented and then your pages / posts have different title and content boxes for each language. A simple flag based language picker works well in the front end and loads the appropriate content, adding the two letter language abbreviation to the page URL too. It’s a very nice solution.
I was also contacted this week by Patricia Iolana, who is putting together an AHRC bid. She wanted me to give feedback on the Technical Plan for the bid, and I spent some time going through the bid information and commenting on the plan.
Using the options down the left-hand side you can view the metaphors related to light, plus only those that have been categorised as ‘strong’ or ‘weak’. You can also view the combined metaphors for beauty and light – either showing all, strong, weak or only those metaphors that relate to both beauty and light.
The graph itself can be scrolled around and zoomed in and out of like Google Maps – Click and hold and move the mouse to scroll, use the scroll wheel to zoom in and out. Brighter lines and bigger dots indicate ‘strong’ metaphors. If you hover over a node you can see its ID plus the number of connections. Click on a node to view connection details in the right-hand column. If you click and hold on a node you can drag it about the screen – useful for grouping nodes or simply moving some out of the way to make room. Note that you can do this on the central node too, which you’ll almost certainly have to do on the ‘connections to both beauty and light’ graph.
I think there would be some pretty major benefits to using this script for the project:
2: The data used is in the JSON format, which can easily be constructed from database queries or CSV files – I made a simple PHP script to convert Ellen’s CSV files to the necessary format (an example of one of the source files can be viewed here: http://www.arts.gla.ac.uk/STELLA/briantest/mm/Jit/Examples/ForceDirected/light-strong.json)
3: it would pretty straightforward to make the graphs more interactive by taking user input and generating JSON files based on this
4: Updating the code shouldn’t be too tricky – for example in addition to showing connections in the right-hand column when a secondary node linked to both beauty and light (e.g. ‘love’) is clicked on, we can provide options to make this node the centre of a new graph, or add it as a new ‘primary node’ to display in addition to beauty and light. Another example: users could remove nodes they are not interested in to ‘declutter’ the graph.
There are some possible downsides too:
1: People might want something that looks a bit fancer (having said that it is possible to customise all elements of the look and feel)
2: It probably won’t scale very well if you need to include a lot more data than these examples show
3: It doesn’t appear to be possible to manually define the length of certain lines (e.g. to make ‘strong’ connections appear in one circle, ‘weak’ ones further out).
4: The appearance of the graph is random each time it loads – sometimes the layout of the nodes is much nicer than other times.
5: All processing (other than the generation of the JSON source files) takes place at the client side so low powered devices will possibly struggle with the graphs (e.g. tablets, netbooks, old PCs)
On Wednesday I completed the required updates to ARIES, specifically adding in the ‘no highlight’ script to all exercises to avoid exercise contents being highlighted when users quickly click the exercise boxes. I also added in a facility to enable users to view the correct answers in stage 2 of the monstrous ‘further punctuation’ exercises. If you check your answers once and don’t manage to get everything right a link now appears that when clicked on highlights all the required capital letters in bold, green text and places all the punctuation in the right places.
I spent a bit of time continuing to work on the technical plan for the Bess of Hardwick follow-on project, but I’m not making particularly good progress with it. I think it’s because the deadline for getting the bid together is now the summer and it’s more difficult to complete things when there’s no imminent deadline! I will try to get this done soon though.
I returned to the ‘Grammar’ app this week and finally managed to complete all the sections of the ‘book’. I’ve now started on the exercises, but haven’t got very far with them as yet. I also started work on the Burns Timeline, after Pauline sent me the sample content during the week. I should have something to show next week.
I was on holiday on Monday and Tuesday this week – spent a lovely couple of days at a hotel on Loch Lomondside with gloriously sunny weather. On Wednesday I worked from home as I usually do, and I spent most of the day updating Exercise 1 of the ‘New Words for Old’ page of ARIES. Previously this exercise asked the user to get a friend to read out some commonly mis-spelled words but last week I recorded Mike MacMahon reading out the words with the aim of integrating these sound clips into the exercise. I completed the reworking of the exercise, using the very handy HTML5 <audio> tag to place an audio player within the web page. The <audio> tag is wonderfully simple to use and allows sound files to be played in a web page without requiring any horrible plugin such as Quicktime. It really is a massive leap forwards. Of course different browsers support (or I should say don’t support) different sound formats, so it does mean sound files need to be stored in multiple formats (MP3 and OGG cover all major browsers) but as we only have 12 very short sound clips this duplication is inconsequential.
Originally I had intended for the exercise to have a sound player and then a simple text box where users could enter their spelling using their device’s default keyboard. However, I realised that this wouldn’t work as web browsers and smartphone onscreen keyboards tend to have inbuilt spell-checkers that would auto-correct or highlight any mis-spelled words, thus defeating the purpose of the exercise. Instead I created my own onscreen keyboard for the exercise. Users have to press on a letter and then it appears in the ‘answer’ section of the page. It’s not as swish as a smartphone’s inbuilt onscreen keyboard and it is a bit slow at registering key presses, but I think for the exercise it should be sufficient. You can try out the ‘app’ version of the exercise here: http://www.arts.gla.ac.uk/STELLA/briantest/aries/spelling-5-new-words-for-old.html
On Thursday morning I attended a symposium on ‘Video Games and Learning’ (see http://gameslearning.eventbrite.co.uk/) that my HATII colleague Matthew Barr had organised. It was a really excellent event, featuring three engaging speakers with quite different backgrounds and perspectives on the use of video game technology to motivate and educate learners. I managed to pick up quite a few good pieces of advice for developing interactive educational tools that could be very useful when developing future STELLA applications.
For the rest of Thursday I had a brief look at the Mapping Metaphor data that Ellen had sent me, I emailed Mike Pidd at Sheffield about getting my ‘mobile Bess’ interface available through the main Bess of Hardwick site as Sheffield begin the final push towards launching the site. I spent the remainder of Thursday and a fair amount of Friday working on the technical plan for the bid for the follow-on Bess of Hardwick project. Writing the plan has been quite slow going as in writing it I am having to think through a lot of the technical issues that will affect the project as a whole. I made some good progress though and I hope to have a first draft of the plan completed next week.
My final task of the week was to try and figure out why certain computers are giving Firewall warnings when users attempt to play the SCOTS Corpus sound clips (for example this one: http://www.scottishcorpus.ac.uk/corpus/search/document.php?documentid=1448). Marc encountered the problem on a PC in a lecture room and as he didn’t have admin rights on the PC he couldn’t accept the Firewall exception and therefore couldn’t play the sound clips. I’ve discovered that there must be an issue with Quicktime or the .mov files themselves as the Firewall warning still pops up even when you save the sound file to the desktop and play it directly through Quicktime rather than through the browser.
Rather strangely I downloaded a sample .mov file from somewhere else and it works fine, which does lead me to believe there may be an issue with a codec. I’ve asked Arts Support to check whether Quicktime on the PC needs an update, although its version number suggests that this isn’t the case. I’ve also looked through the SCOTS documentation to see if there is any mention of codecs but there’s no indication that anything unusual was used. I will continue to investigate this next week.
I seemed to work a little bit on many different projects this week. For Burns I wrote up my notes from last week’s meeting and did a bit more investigation into timelines and Timeglider in particular. I also noticed that there is already a Burns timeline on the new ‘Robert Burns Birthplace Museum’ website: http://www.burnsmuseum.org.uk/collections/timeline. It looks very nice, with a sort of ‘parallax scrolling’ effect in the background. It is however just a nice looking web page rather than being a more interactive timeline allowing users to search or focus on specific themes.
I spent a bit more time this week working on the technical plan for the follow-on project for Bess of Hardwick, although Alison is now wanting to submit this in July rather than ASAP so we have the luxury of time in which to really think some ideas through. I’m still hoping to get an initial version of the plan completed by the end of next week, however. I also spent a little time going over the mobile Bess interface I made as it looks like Sheffield might be about ready to implement my updates as they launch the main Bess site.
I also worked a little bit more on the ICOS2014 website and responsive web design. I’ve got a design I’m pretty happy with now that works on a wide variety of different screen sizes. I still need some banner images to play around with but things are looking promising.
Once I realised what was causing the problem I could replicate it on my PC and tackle the issues. Even though only 1% of web users still use IE7 I wanted to ensure ARIES worked on this older browser. It took some time but I managed to update all the exercises so they work in both old and new browsers.
Also for ARIES this week I recorded Mike MacMahon speaking some words that I will use in an exercise in the spelling section of ARIES. Users will be able to play sound clips to hear a word being spoken and then they will be asked to type the word as they think it should be spelled. It was my first experience of the Sound Studio and Rachel Smith very kindly offered to show me how everything worked. On Friday we did the recordings and everything went pretty smoothly. Now I need to make the exercise and embed the sound files!
Also this week I attended a HATII developers meeting. This was a chance for the techy people involved in digital humanities projects to get together to discuss their projects and the technologies they are using. Chris from Arts Support was also there and it was really useful to hear from the other developers. It is hoped that these meetings will become a regular occurrence and will be expanded out to all developers working in DH across the university. We should also be getting a mailing list set up for DH developers, and also a wiki or other such collaborative environment. I also pointed people in the direction of the new DH at Glasgow website and asked people to sign up to this as technical experts.
Finally this week I did a little bit more work with the data Susan Rennie sent me for the Scots Glossary project that we are putting together. I made some updates to the technical documentation I had previously sent her and mapped out in more detail a data schema for the project.
I’m on holiday on Monday and Tuesday next week but will be back to it next Wednesday.