This week I mainly working on three projects: The Historical Thesaurus, the Bilingual Thesaurus and the Romantic National Song Network. For the HT I continued with the ongoing and seemingly never-ending task of joining up the HT and OED datasets. Marc, Fraser and I had a meeting last Friday and I began to work through the action points from this meeting on Monday. By Wednesday I had ticked off most of the items, which I’ll summarise here.
Whilst developing the Bilingual Thesaurus I’d noticed that search term highlighting on the HT site wasn’t working for quick searches, only advanced searches for words, so I investigated and fixed this. I then updated the lexeme pattern matching / date matching script to incorporate the stoplist we’d created during last week’s meeting (words or characters that should be removed when comparing lexemes, such as ‘to ‘ and ‘the’). This worked well and has bumped matches up to better colour levels, but has resulted in some words getting matched multiple times. E.g. when removing ‘to’, ‘of’ etc this results in a form that then appears multiple times. For example, in one category the OED has ‘bless’ twice (presumably an erro?) and HT has ‘bless’ and ‘bless to’. With ‘to’ removed there then appear to be more matches that there should be. However, this is not an issue when dates are also taken into consideration. I also updated the script so that categories where there are 3 matches and at least 66% of words match have been promoted from orange to yellow.
When looking at the outputs at the meeting Marc wondered why certain matches (e.g. 120202 ‘relating to doctrine or study’ / ‘pertaining to doctrine/study’ and 88114 ‘other spec.’ / ‘other specific’) hadn’t been ticked off and wondered whether category heading pattern matching had worked properly. After some investigation I’d say it has worked properly – the reason these haven’t been ticked off is they contain too few words to have reached the criteria for ticking off.
Another script we looked at during our meeting was the sibling matching script, which looks for matches at the same hierarchical level and part of speech, but different numbers. I completely overhauled the script to bring it into line with the other scripts (including recent updates such as the stoplist for lexeme matching and the new yellow criteria). There are currently 19, 17 and 25 green, lime green and yellow matches that could be ticked off. I also ticked off the empty category matches listed on the ‘thing heard’ script (so long as they have a match) and for the ‘Noun Matching’ I ticked off the few matches that there were. Most were empty categories and there were less than 15 in total.
Another script I worked on was the ‘monosemous’ script, which looks for monosemous forms in unmatched categories and tries to identify HT categories that also contain these forms. We weren’t sure at the meeting whether this script identified words that were fully monosemous in the entire dataset, or those that were monosemous in the unmatched categories. It turned out it was the former, so I updated the script to only look through the unchecked data, which has identified further monosemous forms. This has helped to more accurately identify matched categories. I also created a QA script that checks the full categories that have potentially been matched by the monosemous script.
I also worked on the date fingerprinting script. This gets all of the start dates associated with lexemes in a category, plus a count of the number of times each date appears, and uses these to try and find matches in the HT data. I updated this script to incorporate the stoplist and the ‘3 matches and 66% match’ yellow rule, and ticked off lots of matches that this script identified. I ticked off all green (1556), lime green (22) and yellow (123) matches.
Out of curiosity, I wrote a script that looked at our previous attempt at matching the categories, which Fraser and I worked on last year and earlier this year. The script looks at categories that were matched during this ‘v1’ process that had yet to be matched during our current ‘v2’ process. For each of these the script performs the usual checks based on content: comparing words and first dates and colour coding based on number of matches (this includes the stoplist and new yellow criteria mentioned earlier). There are 7148 OED categories that are currently unmatched but were matched in V1. Almost 4000 of these are empty categories. There are 1283 ‘purple’ matches, which means (generally) something is wrong with the match. But there are 421 in the green, lime green and yellow sections, which is about 12% of the remaining unmatched OED categories that have words. It might also be possible to spot some patterns to explain why they were matched during v1 but have yet to be matched in v2. For example, 2711 ‘moving water’ has 01.02.06.01.02 and its HT counterpart has 01.02.06.01.01.02. There are possibly patterns in the 1504 orange matches that could be exploited too.
Finally, I updated the stats page to include information about main and subcats. Here are the current unmatched figures:
Unmatched (with POS): 8629
Unmatched (with POS and not empty): 3414
Unmatched Main Categories (with POS): 5036
Unmatched Main Categories (with POS and not empty): 1661
Unmatched Subcategories (with POS): 3573
Unmatched Subcategories (with POS and not empty): 1753
So we are getting there!
For the Bilingual Thesaurus I completed an initial version of the website this week. I have replaced the original colour scheme with a ‘red, white and blue’ colour scheme as suggested by Louise. This might be changed again, but for now here is an example of how the resource looks:
The ‘quick’ and ‘advanced’ searches are also now complete, using the ‘search words’ mentioned in a previous post, and ignoring accents on characters. As with the HT, by default the quick search matches category headings and headwords exactly, so ‘ale’ will return results as there is a category ‘ale’ and also a word ‘ale’ but ‘bread’ won’t match anything because there are no words or categories with this exact text. You need to use an asterisk wildcard to find text within word or category text: ‘bread*’ would find all items starting with ‘bread’, ‘*bread’ would find all items ending in ‘bread’ and ‘*bread*’ would find all items with ‘bread’ occurring anywhere.
The ‘advanced search’ lets you search for any combination of headword, category, part of speech, section, dates and languages or origin and citation. Note that if you specify a range of years in the date search it brings back any word that was ‘active’ in your chosen period. E.g. a search for ‘1330-1360’ will bring back ‘Edifier’ with a date of 1100-1350 because it was still in use in this period.
As with the HT, different search boxes are joined with ‘AND’ – e.g. if you tick ‘verb’ and select ‘Anglo Norman’ as the section then only words that are verbs AND Anglo Norman will be returned. Where search types allow multiple options to be selected (i.e. part of speech and languages of origin and citation) if multiple options in each list are selected these are joined by ‘OR’. E.g. if you select ‘noun’ and ‘verb’ and select ‘Dutch’, ‘Flemish’ and ‘Italian’ as languages or origin this will find all words that are either nouns OR verbs AND have a language of origin of Dutch OR Flemish OR Italian.
For the Romantic National Song Network I continued to create timelines and ‘storymaps’ based on powerpoint presentations that had been sent to me. This is proving to be a very time-intensive process, as it involves extracting images, audio files and text from the presentations, formatting the text as HTML, reworking the images (resizing, sometimes joining multiple images together to form one image, changing colour levels, saving the images, uploading them to the WordPress site), uploading the audio files, adding in the HTML5 audio tags to get the audio files to play, creating the individual pages for each timeline entry / storymap entry. It took the best part of an afternoon to create one timeline for the project, which involved over 30 images, about 10 audio files and more than 20 Powerpoint slides. Still, the end result works really well, so I think it’s worth putting the effort in.
In addition to these projects I met with a PhD student, Ewa Wanat, who wanted help in creating an app. I spent about a day attempting to make a proof of concept for the app, but unfortunately the tools I work with are just not very well suited to the app she wants to create. The app would be interactive and highly dependent on logging user interactions as accurately as possible. I created looked into using the d3.js library to create the sort of interface she wanted (a circle that rotates with smaller circles attached to it, that the user should tap on when a certain point in the rotation is reached), but although this worked, the ‘tap’ detection was not accurate enough. In fact on touchscreens more often than not a ‘tap’ wasn’t even being registered. D3.js just isn’t made to deal with time-sensitive user interaction on animated elements and I have no experience with any libraries that are made in this way, so unfortunately it looks like I won’t be able to help out with this project. Also, Ewa wanted the app to be launched in January and I’m just far too busy with other projects to be able to do the required work in this sort of timescale.
Also this week I helped extract some data about the Seeing Speech and Dynamic Dialects videos for Eleanor Lawson, I responded to queries from Meg MacDonald and Jennifer Nimmo about technical work on proposals they are involved with, I responded to a request for advice from David Wilson about online surveys, and another request from Rachel Macdonald about the use of Docker on the SPADE server. I think that’s just about everything to report.
I spent a fair amount of time this week working on the REELS project, which began last week. I set up a basic WordPress powered project website and got some network drive space set up and then on Wednesday we had a long meeting where we went over some of the technical aspects of the project. We discussed the structure of the project website and also the structure of the database that the project will require in order to record the required place-name data. I spent the best part of Thursday writing a specification document for the database and content management system which I sent to the rest of the project team for comment on Thursday evening. Next week I will update this document based on the team’s comments and will hopefully find the time to start working on the database itself.
I met with a PhD student this week to discuss online survey tools that might be suitable for the research that she was hoping to gather. I heard this week from Bryony Randall in English Literature that an AHRC proposal that I’d given her some technical advice on had been granted funding, which is great news. I had a brief meeting with the SCOSYA team this week too, mainly to discuss development of the project website. We’re still waiting on the domain being activated, but we’re also waiting for a designer to finish work on a logo for the project so we can’t do much about the interface for the project website until we get this anyway.
I also attended the ‘showcase’ session for the Digging into Data conference that was taking place at Glasgow this week. The showcase was an evening session where projects had stalls and could speak to attendees about their work. I was there with the Mapping Metaphor project, along with Wendy, Ellen and Rachael. We had some interesting and at times pretty in-depth discussions with some of the attendees and it was a good opportunity to see the sorts of outputs other projects have created with their data.
Before the event I went through the website to remind myself of how it all worked and managed to uncover a bug in the top-level visualisation: When you click on a category yellow circles appear at the categories the one you’ve clicked on have a connection to. These circles represent the number of metaphorical connections between the two categories. What I noticed was that the size of the circles was not taking into consideration the metaphor strength that had been selected, which was giving confusing results. E.g. if there are 14 connections but only one of these is ‘strong’ and you’ve selected to view only strong metaphors the circle size was still being based on 14 connections rather than one. Thankfully I managed to track down the cause of the error and I fixed it before the event.
I also spent a little bit of time further investigating the problems with the Curious Travellers server, which for some reason is blocking external network connections. I was hoping to install a ‘captcha’ on the contact form to cut down on the amount of spam that was being submitted and the Contact Form 7 plugin has a facility to integrated Google’s ‘reCaptcha’ service. This looked like it was working very well, but for some reason when ‘reCaptcha’ was added to forms these forms failed to submit, instead giving error messages in a yellow box. The Contact Form 7 documentation suggests that a yellow box means the content has been marked as spam and therefore won’t send, but my message wasn’t spam. Removing ‘reCaptcha’ from the form allowed it to submit without any issue. I tried to find out what was causing this but have been unable to find an answer. I can only assume it is something to do with the server blocking external connections and somehow failing to receive a ‘message is not spam’ notification from the service. I think we’re going to have to look at moving the site to a different server unless Chris can figure out what’s different about the settings on the current one.
My final project this week was SAMUELS, for which I am continuing to work on the extraction of the Hansard data. Last week I figured out how to run a test job on the Grid and I split the gigantic Hansard text file into 5000 line chunks for processing. This week I started writing a shell script that will be able to process these chunks. The script needs to do the same tasks as my initial PHP script, but because of the setup of the Grid I need to write a script that will run directly in the Bash shell. I’ve never done much with shell scripting so it’s taken me some time to figure out how to write such a script. So far I have managed to write a script that takes a file as an input, goes through each line at a time, splits the line up into two sections based on the tab character, base64 decodes each section and then extracts the parts of the first section into variables. The second section is proving to be a little trickier as the decoded content includes line breaks which seem to be ignored. Once I’ve figured out how to work with the line breaks I should then be able to isolate each tag / frequency pair, write the necessary SQL insert statement and then write this to an output file. Hopefully I’ll get this sorted next week.
This week is a three day week for me as Monday was May Day and I’m off on holiday on Friday (I’ll be away all next week). When I got into work on Tuesday the first thing I set about tackling was an issue Vivien had raised with the Burns Choral website. A colleague had tried to access the site using IE and was presented with nothing but a blank white screen. I’d tested the site out previously with the current version of IE (version 11) and with it set to ‘compatibility mode’ and in the former case the site worked perfectly and in the latter case there were a couple of quirks with the menu but it was perfectly possible to access the content. I made use of a free 30 minute trial of http://www.browserstack.com/, which provides access to older versions of IE via your browser and discovered that the blank screen issue was limited to IE versions 7 and 8, both of which were released years ago, are no longer supported by Microsoft and have a combined market share of less than 3%. However, I still wanted to get to the bottom of the problem as I didn’t like the idea of a website displaying a blank screen. I eventually managed to track down the cause, which is a bug in IE 7 and 8: when using a CSS file with an @font-face declaration the entire webpage disappears. A couple of tweaks later and IE7 and 8 users could view the site again.
Also on Tuesday I attended a project meeting for Mapping Metaphor, the first meeting since the colloquium. Things seem to be progressing quite well, and we arranged some further meetings to discuss the feedback from the test version of the visualisations and the poster for DH2014. I’m probably not going to do much more development for the project until September due to other work commitments, but will focus back in on the project from September onwards. I also spent a little time this week doing some further investigations into the new SLD project that Jean would like me to tackle. I can’t really say much about it at this stage but I spent about half a day working with the data and figuring out what might be possible to do in the available time.
On Thursday I answered a query from a PhD student about online questionnaires and gave some advice on the options that are available. Although I’ve previously used and recommended SurveyMonkey I would now recommend using Google Forms instead due to it being free and the way it integrates with Google Docs to give a nice spreadsheet of results. The rest of Thursday I devoted to further DSL development, which mostly boiled down to getting text highlighting working in the entry page. Basically if the user searches for a word or phrase then this string should be highlighted anywhere it appears within the entry. As mentioned last week, I had hoped to be able to use XSLT for this task, but what on the face of it looked very straightforward proved to be rather tricky to implement due to the limitations of XSLT version 1, which is the only version supported by PHP. Instead I decided to process highlighting using jQuery, and I managed to find a very neat little plugin that would highlight text in any element you specify (the plugin can be found here)
That’s all for this week, which proved to be pretty busy despite being only three days long. I will return to work the week beginning the 19th.
I only worked Monday and Tuesday this week as I took some additional time off over the Easter weekend, so there’s not a massive amount to report. Other than providing some technical help to a postgrad student in English Language I spent all of my available time working on DSL related matters. The biggest thing I achieved was to get a first version of the advanced search working. This doesn’t include all options, only those that are currently available through the API, so for example users can’t select a part of speech to refine their search with. But you can search for a word / phrase with wildcards, specify the search type (headwords, quotations or full text), specify the match type if it’s a headword search (whole / word / partial) and select a source (snd / dost / both). There are pop-ups for help text for each of these options, currently containing placeholder text.
A quotation or full text search displays results using the ‘highlights’ field from the API, displaying every snippet for each headword (there can be multiple snippets) with the search word highlighted in yellow. If a specific source is selected then the results take up the full width of the page rather than half the width. Advanced search options are ‘remembered’, allowing the user to return to the search form to refine their search. There is also an option to start a new advanced search.
I made a number of other smaller tweaks too, e.g. removing the ‘search results’ box from an entry when it is the only search result, and adding in pop-ups for DOST and SND in the results page (click on the links in the headings above each set of results to see these). I’ve also added in placeholder pages for the bibliography pages linked to from entries, with facilities to return to the entry from the bibliography page.
Also this week I received the shiny new iPad that I had ordered and I spent a little time getting to grips with how it works. I think it’s going to be very useful for testing websites!
This was an important week, as the new version of the Historical Thesaurus went live! I spent most of Monday moving the new site across to its proper URL, testing things, updating links, validating the HTML, putting in redirects from the old site and ensuring everything was working smoothly and the final result was unveiled at the Samuels lecture on Tuesday evening. You can find the new version here:
It was also the week that the new version of the SCOTS corpus and CMSW went live too, and I spent a lot of time on Tuesday and Wednesday working on setting up these new versions, undertaking the same tasks as I did for the HT. The new version of SCOTS and CMSW can be accessed here:
I had to spend some extra time following the relaunch of SCOTS updating the ‘Thoms Crawford’s diary’ microsite. This is a WordPress powered site and I updated the header and footer template files so as to give each page of the microsite the same header and footer as the rest of the CMSW site, which looks a lot better than the old version which didn’t have any link back to the main CMSW site.
I had a couple of meetings this week, the first with a PhD student who is wanting to OCR some 19th century Scottish courts records. I received some very helpful pointers on this from Jenny Bann, who was previously involved with the massive amounts of OCR work that went on in the CMSW project and was able to give some good advice on how to improve the success rate of OCR on historical documents thanks to Jenny’s input.
My second meeting was with a member of staff in English Literature who is putting together an AHRC bid. Her project will involve the use of Google Books and the Ngram interface, plus developing some visualisations of links between themes in novels. We had a good meeting and should hopefully proceed further with the writing of the bid in future weeks. Having not used Ngrams much I spent a bit of time researching it and playing around with it. It’s a pretty amazing system and has lots of potential for research due to the size of the corpus and the also the sophisticated query tools that are on offer.
I spent the majority of this week working on two specific projects: Mapping Metaphor and Bess of Hardwick. For Mapping Metaphor I spent some more time thinking about the requirements and how these might translate into a usable interface to the data at the various levels that are required. I created a series of Powerpoint based mock-ups of the map interface as follows: a ‘Top level’ map where all metaphor connections would be aggregated into the three ‘general sections’ of the Historical Thesaurus, a second level map where these three general sections are broken up into their constituent subsections and metaphor data is still aggregated into these sections, and a Metaphor category level map, where all Metaphor categories within an HT Subsection are displayed, enabling connections from one or more Metaphor categories to be explored through map interfaces similar to those I had previously created mock-ups for. I also gave further thought into the issue of time, incorporating a double-ended slider into the maps, thus enabling users to specify a range of dates that should be represented on the map. And I also came up with a few possible ways in which we could make the maps more interactive – encouraging users to great metaphor groupings and save / share these, for example. I also attended a fairly lengthy meeting of the Metaphor project team, which had some useful outputs, although a lot more discussion is needed before the requirements for an actual working system can be detailed. There will be a follow-on meeting next week involving Wendy, Ellen, Marc and me and hopefully some of these details can be worked out then.
For the Bess project I reworked my previous mockup of a mobile interface for the project website based on the now finalised interface. The test site had moved URLs and I was unaware of this until the end of last week. The site at the new URL is quite different to the older one I had based my previous mobile interface on, so there was quite a bit of work to be done, updating CSS files, adding in new jQuery code to handle new situations and replacing some of the icons I had previously used with ones that were now in use on the main site. Kathy, the developer at HRI who is producing the main site had decided to use the OpenLayers approach to the images of the letters, as opposed to Zoomify, and this decision was a real help for the mobile interface as if the Flash based Zoomify had been chosen it would not have been possible for the majority of mobile devices to access the images. I spent some time reworking the OpenLayers powered image page so it would work nicely with touchscreens too and on Thursday I was in a position to email the files required to create the mobile interface to Kathy. Hopefully it will be possible to get the mobile version of the site launched alongside the main interface.
On Friday I met with a post-graduate student in English Language who is putting together an online survey involving listening to sound clips and answering some questions about them. I was able to give her a number of suggestions for improvements and I think we had a really useful meeting. On Friday afternoon I attended a meeting to discuss the redevelopment of the Historical Thesaurus website, with Marc, Jean, Flora and Christian. It was a useful meeting and we agreed on an overall site structure and that I would base the interface for the new site on the colour scheme and logo of the old site. I’m hoping to make a start on the redevelopment of the website next week, although we still need to have further discussions about the sort of search and browse functionality that is required.
I spent a lot of time this week on corpus related matters. I managed to get a test corpus (well, in reality just one sentence) uploaded successfully through the CQPweb interface to our test instance of the Open Corpus Workbench, which felt like a real achievement. The corpus appears to work successfully through the front end, including word frequency lists, searches and KWIC displays. There is currently no access to the full text though and this still needs to be investigated. I had a meeting this week with Stephen Barrett and Marc regarding corpus matters, which proved to be very useful. We went through a few of the existing online corpora that utilise CWB / CQBweb which gave us some good ideas and we also discovered that I’d somehow managed to check out an older version of CQBweb from the Subversion repository. After the meeting I rectified this and reinstalled the front end. I also managed to get a couple of Stephen’s Celtic texts which he’d been having trouble with installed, and by the end of the week Stephen had managed to get his complete corpus uploaded, which really is progress.
I still haven’t got access to the SCOTS corpus server and I’ve chased up IT support about this. Marc suggested that I might be able to get the necessary details from Flora and I’ll ask her next week if I’ve not heard anything further back from Chris.
Also this week I continued working on the mock-ups of the redevelopment of the STELLA applications. This week I completed mock-ups for the ‘Essentials of Old English’ and I began looking into ‘Readings in Early English’. The exercises for the Old English application threw up a number of conundrums for implementation, most notably how to deal with Old English characters ‘æ’ and ‘þ’ when users are required to input these. For the app version we will be reliant on a mobile device’s built-in touchscreen keyboard that will overlay and obscure the web page. These characters will not be available through this keyboard. I’m still pondering how best to deal with this but it might be better for the sake of simplicity if we could just let users input ‘ae’ and ‘th’ instead of these characters and have the app transform these into ‘æ’ and ‘þ’ for display.
I had a couple of further meetings this week, one with Jeremy and then the general DROOG meeting. Both were very fruitful. I managed to speak to Jeremy about the ‘Essentials of Old English’ application as he created the content for this. He would like to rewrite a lot of the content, although as he’s a very busy man I’m not sure when this might be completed. He suggested prioritising ‘Readings in Early English’ initially. At the DROOG meeting Marc suggested that ‘ARIES’ might be the best app to develop first as the online version of this tool continues to be used widely. I think this is a good idea.
I demonstrated the mock-ups as they currently stand at the DROOG meeting and everyone seemed quite enthusiastic about them, which is encouraging. I also had further contact with Alison Wiggins about a mobile / tablet version of the Bess of Hardwick site and it looks like we are going to take this forward. We are hoping initially to make a mobile-friendly interface for the existing website (which shouldn’t take long) and we are hoping to put in a bid to develop a stand-alone app version of the letters after this.
Also this week I completed the migration of the Disability Studies Network from WordPress to Glasgow and that all seems to have worked out very smoothly. I also had my first contact with the Burns people, providing some help on the addition of sound files to the Burns blog. I’m meeting with the Burns project next week so I’ll find out more about my involvement then.
I’m still ensconced in the HATII attic this week, although I did peek once again into my new office. It’s all pretty much finished now, but there is still no furniture and I’ve been told it might take a week or two to get furniture delivered even after I’ve been to the stores to select suitable pieces. I emailed the estates people this week to see if there is any further news on a possible date of entry but I haven’t heard anything back yet. I did think I might have flown the HATII nest by now.
I split this week primarily between four projects: The SCOTS Corpus, the Open Corpus Workbench, the STELLA desktop applications and the Disability Studies Network, a project being run by two PhD students in English Literature.
A couple of weeks ago Marc and I met with Stephen Barrett who is doing some corpus work for a project in Celtic. An outcome of this was to ask IT Support for a test server where we could install the Open Corpus Workbench (http://cwb.sourceforge.net/) and this week IT Support delivered. I spent quite a bit of time this week installing the Open Corpus Workbench software and its dependencies on this server. Setting up the back end software for the corpus was relatively straightforward once dependencies such as Perl packages and a C compiler had been installed, but some issues were encountered when installing the PHP based front end for the workbench, which took some time to investigate. It turned out to be a database privileges problem – the system requires ‘Grant all’ privileges for the database it will use and this had not been selected for the database in question. After that was resolved the front end appeared to work. However, there is still a lot to be done in terms of customisation and configuration. I attempted to get the Dickens test corpus working through the front end but attempting to install it resulted in errors, specifically “Pre-indexed corpora require s-attributes text and text_id!!”, which is a bit odd as I would have expected the test corpus provided through the CWB website to be in a format that would allow it to work in the front end without any further tweaking. I still need to investigate this further.
After completing some mock-ups of an App and a Website version of the STELLA resource ‘ARIES’ last week, this week I created mock-ups for English Grammar: An Introduction. The exercises in this desktop application are very complex and text heavy and it took quite a while to come up with an app solution that would not require the user to type in lots of text. I’m quite pleased with the mock-ups I’ve created and I think they should cover most of the eventualities the original desktop based application throws up.
My final main project of the week was the Disability Studies Network. I met with Christine Ferguson and two of her PhD students who currently run a WordPress hosted blog for the Disability Studies Network. They are wanting the blog to be migrated to the University of Glasgow domain to give it a more official feel and I ran through a few options with them in collaboration with Matthew Barr from the School of Humanities, who had spoken to them initially. I agreed to set up a blog within the University domain, to apply the UoG WordPress theme that David Beavan had previously created (and modify it where necessary) and to migrate all of the content across. I am about half-way through this task and aim to have everything ready for the students to use by the end of next week.