It’s been a rather full-on week this week. It always tends to get busy for me at this time of year as staff tend to return to work after their holidays itching to get things done before the new term starts, which tends to mean lots of requests come my way. I also needed to get things done before the end of the week as I’m attending the ICEHL conference (http://www.conferences.cahss.ed.ac.uk/icehl20/) in Edinburgh all next week so won’t be able to do much regular work then.
I spent quite a bit of time on Historical Thesaurus related duties this week. This included reading through the new proposal Fraser is working on and commenting on it, and continuing the email discussion about the new thesaurus that we’re hoping to host. I met with Marc and Fraser on Tuesday morning to continue our discussions about the integration of the HT and the OED data, and we made a bit more progress on the linking up of the categories. I have to say I wasn’t feeling all that great on Tuesday, Wednesday and Thursday this week and it was a bit of a struggle to make it through these days, so I don’t feel like I contributed as much to this meeting as I would normally have done. However, struggle through the days I did, and by Friday I was thankfully back to normal again.
Fraser is presenting a session about the HT visualisations at a workshop at ICEHL next week and I was asked to write some slides about the visualisations for him to use. When it came down to writing these I figured that a few slides in isolation without background material and context would not be much use, so instead I ended up writing a document about the four types of visualisations that are currently part of the HT website. This took some time, and ended up being almost 3000 words in length, but I felt it was useful to document how and why the visualisation were created – for future reference for me if not for Fraser to use next week!
I had my PDR session on Tuesday afternoon, which took up a fair bit of time, both attending the meeting and acting on some of the outcomes from the meeting. Overall I think it went really well, though.
Gerry Carruthers emailed me this week about a new proposal he is in the middle of writing. He wanted me to supply a few sentences about the technical aspects of the proposal, so after reading through his document and sending a few questions his way I spent a bit of time writing the required sections.
I also met with Matthew Sangster from English Literature and Katie Halsey from Stirling, who are in the middle of putting a proposal together. I’m going to write the Data Management Plan for the proposal and will also undertake the development work. I can’t really go into any details about the project at this stage, but it seems like just the sort of project I enjoy being involved with, and I managed to give some suggestions and feedback on their existing documentation.
Bryony Randall from English Literature was also in touch this week, asking if I would like to participate in a workshop she’s hoping to run in the Autumn. I said I would help out and she introduced me by email to Ronan Crowley, who will also be involved in the workshop. We had an email conversation about what the workshop should contain and other such matters – a conversation that will no doubt continue over the coming weeks.
As mentioned earlier, I’ll be at the ICEHL conference next week, so the next report will be more of a conference review than a summary of work done.
I spent a fair amount of time this week working on Historical Thesaurus duties following our team meeting on Friday last week. At the meeting Marc had mentioned another thesaurus that we are potentially going to host at Glasgow, the Bilingual Thesaurus of Medieval England. Marc gave me access to the project’s data and I spent some time looking through it and figuring out how we might be able to develop an online resource for it that would be comparable to the thesauri we currently host. I met with Marc and Fraser again on Tuesday to discuss the ongoing issue of matching up the HT and OED datasets, and prior to the meeting I spent some time getting back up to speed with the issues involved and figuring out where we’d left off. I also created some CSV files containing the unmatched data for us to use.
The meeting itself was pretty useful and I came out of it with a list of several new things to do, which I focussed on for much of the remainder of the week. This included writing a script that goes through each unmatched HT category, brings back the non-OE words and compares these with the ght_lemma field of all the words in unmatched OE categories. The script outputs a table featuring information about the categories as well as the words, and I think the output will be useful for identifying unmatched categories as well as words contained therein. Also at the meeting we’d noticed that if you perform a search on the front-end that contains an apostrophe the search itself works, but following a link in the search results to a word that also contains an apostrophe wasn’t working. I added in a bit of urlencoding magic and that sorted the issue.
I also created a few more scripts aimed at identifying categories and words to match (or to identify things that would have no matches). This included a script to display unmatched HT and OED categories that have non-alphanumeric characters in them, creating a CSV output of HT words that doesn’t feature OE words (as the OED doe not include OE words) and creating another script that identifies categories that have ‘pertaining to’ in their headings.
I also created a script that generated the full hierarchical pathway for each unmatched HT and OED category and then ran a Levenshtein test to figure out which OED path was the closest to which HT path (in the same part of speech). It took the best part of a morning to write the script, and the script itself took about 30 minutes to run, but unfortunately the output is not going to be much use in identifying potential matches.
For every unmatched HT category the script currently displays the OED category with the lowest Levenshtein score when comparing the full hierarchy of each. There’s very little in the way of matches that are of any value, but things might improve with some tweaking. As it stands the script generates the full HT hierarchy within the chosen POS, meaning for non-nouns the hierarchy generally doesn’t go all the way to the top. I could potentially use the noun hierarchy instead. Similarly, for the OED data I’ve kept within the POS, which means it hasn’t taken into consideration the top level OED categories that have no POS. Also, rather than generating the full hierarchy we might have more luck if we just looked at a smaller slice, for example two levels up from the current main cat, plus full subcat hierarchy. But even this might result in some useless results – e.g. the HT adverb ‘>South>most’ currently has as its closest match the OED adverb ‘>four>>four’ with a Levenshtein score of 6. But clearly it’s not a valid match.
My final script was one that identifies empty HT categories (or those that only include OE words). I figured that a lot of these probably don’t need to match up to an OED category. I also included any empty OED categories (not including the ‘top level’ OED categories that have no part of speech and are empty). Out of the 12034 unmatched HT cats 4977 are empty or only contain OE words. Out of the 6648 unmatched OED categories that have a POS there are 1918 that are empty. Hopefully we can do something about ticking these off as checked at some point.
While going through this data I made a slightly worrying discovery: At the meeting we’d found an OED word that referenced an OED category ID that didn’t exist in our database. This seemed rather odd. The next day I discovered another, and I figured out what was going on. It would appear that when uploading the OED data from their XML files to our database any OED category or word that included an apostrophe silently failed to upload. This unfortunately is not good news as it means many potential matches that should have been spotted by the countless sweeps through the data that we’ve already done have been missed due to the corresponding OED data simply not being there. I ran the XML through another script to count the OED categories and words that include apostrophes and there are 1843 categories and 26,729 words (the latter due to apostrophes in word definitions also causing words to fail to upload). This is not good news and it’s something we’re going to have to investigate next week. However, it does mean we should be able to match up more HT categories and words than we had previously matched, which is at least a sort of a silver lining.
Other than HT duties I did small bits of work for a number of different projects. I generated some data for Carole for the REELS project from the underlying database, and investigated a possible issue with the certainty levels for place-names (which thankfully turn out to not be an issue at all). I also responded to a couple of queries from Thomas Widmann of SLD, started to think about the new Galloway Glens place-name project and updated the images and image credits that appear on Matthew Creasy’s Decadence and Translation website.
I also spend the best part of a day preparing for this year’s P&DR process, ahead of my meeting next week.
I continued to work with the data for the Bess of Hardwick account book project this week. I had intended to work on a couple of other projects that are just starting up, but there have been some delays in people getting back to me so instead I used the time to experiment with the account book data. Last week I exported the data from the original Access database into a MySQL database, and this week I set about creating an initial online resource that would enable users to browse through the data.
I took one of my Bootstrap powered prototype interfaces for the new ‘Seeing Speech’ website and adapted this as an initial interface, changing the colours and using a section of an image of one of the account book pages as a background to the header. It didn’t take long to set up, but I think it looks pretty good as a starting point.
I created ‘browse’ features that allow users to access the account book entries in a number of different ways. The ‘Entries’ page provides access to the data in ‘book’ format. It allows users to select a document, view a list of folios, then select a folio in order to view the entries found on it. The ‘Entry modes’ page lists the entry modes (‘bill’, ‘wages’ etc), along with a count of the number of entries that have the mode. Users can then click on an entry mode to view the entries that have this mode. The ‘Entry types’ page is the same but for entry types (‘money in’, ‘money out’ etc) rather than modes. The ‘Entities’ page lists the entity categories (e.g. ‘clothing’, ‘jewellery’) and the number of entities found in each. Clicking on a category allows the user to view its entities (e.g. ‘eggs’, ‘gloves’) together with a count of the number of entries this entity appears in. users can then click on an entity to view the entries. The ‘Parties’ page lists the party status types (‘card player’, ‘borrower’ etc) and the number of parties that have been associated with the type (e.g. ‘Sir William’, ‘Anne Dalton’). Users can click on a status to view a list of the parties, together with a count of the entries they appear in, and then click on a party name to view the associated entries. The ‘Places’ page lists places together with a count of the entries these appear in, while the ‘Times’ page does something similar for times.
When viewing entries, each entry contains all of the information recorded about the entry, such as the cost in pounds, shillings and pence, the cost converted purely to pence, the main text of the entry, associated people, places, entities etc. Where something in the entry can be browsed for it appears as a link – e.g. you can click on an ‘entry type’ to see all of the other entries that have this type. I also added in a ‘total cost’ at the bottom of a page of entries, plus options to order entries by their sequence number or by their cost.
On Wednesday I met with Alison Wiggins to discuss the project and the system I’d created and she seemed pretty pleased with how things are developing so far. There are still lots of things to do for the project, though, such as adding in some search functionality and some visualisations. It should be fun to get it all working.
I dealt with relatively minor issues for a number of other projects this week. This included setting up hosting for the crowdsourcing project for Scott Spurlock, making some tweaks to the SPADE website, upgrading all of the WordPress sites I manage to the latest version of WordPress, responding to a query Wendy Anderson had received relating to the Mapping Metaphor data, setting up hosting for our new thesaurus.ac.uk domain, setting up hosting for Thomas Clancy’s place-names of Kirkcudbright project and replying to an email from him about the Iona project proposal that’s still in development, and setting up a new page URL for Eleanor Lawson to use to promote the Seeing Speech website.
The rest of my week was spent on Historical Thesaurus duties. I met with Fraser on Tuesday to help him to set up a local copy of the HT database on his laptop. I’d managed to get a dump of the database from Chris and after a little bit of time figuring out where MySQL is located on a Mac, and what the default user details are, we managed to get all of the data uploaded and working in Fraser’s local copy of PHPMyAdmin.
On Friday I had a very long but useful meeting with Marc and Fraser to discuss future updates to the HT data and the website. The meeting lasted pretty much all morning, but we discussed an awful lot, including a new thesaurus that has been developed elsewhere that we might be hosting. Marc sent me on the data and I spent some time after the meeting looking through it and figuring out how it is structured. We also discussed moving some of my test projects that are currently located on old desktop PCs in my office onto the old HT server and how we might use this server to set up a new corpus resource. We talked about what we would host on the new thesaurus.ac.uk domain, and some conferences we might go to next year. We spent some time planning the proposal for a new thesaurus that Fraser is putting together at the moment (I can’t go into too much detail about this for now) and how we might develop an actual content management system for managing updates to the HT database, with workflows that would allow contributors to make changes and for these to then be passed to the editor for potential inclusion into the live system, and we discussed the ongoing work to join up the OED and the HT data. Following the meeting I made my updated ‘category selection’ page live. This page includes timelines and the main timeline visualisation popup, as you can see here: https://ht.ac.uk/category-selection/?qsearch=wolf
We’re meeting again next week to discuss the OED / HT data joining in more detail. I hope we can finally get this task completed sometime soon.
I was on holiday from Monday to Wednesday this week, but I still managed to pack a fair amount into my two days of work. We’ve started to get some feedback from members of the REELS advisory board about the online resource, so I spent a bit of time looking through that. I’m not going to address any of it until next month, though, as other people may still be trying out the site. I also fixed an issue with the CMS: When the ‘uncertain’ element had been added to a place it was then impossible to remove it through the ‘manage elements’ page. This was because the element had no assigned language, and the query expected all elements to have an associated language. I added the language ‘unknown’ to the element and this fixed the issue.
I also checked through some images Matthew Creasy had sent me to be used on his new ‘Decadence and Translation’ website and read through some materials Matthew Sangster sent me relating to a proposal he’s putting together. I also responded to a query from a user of the Thesaurus of Old English that Fraser had forwarded on to me. The user suggested that the category search could be improved and I’ve thought through how the improvement might be implemented in future. In the meantime the use of asterisk wildcards should solve the user’s problem.
Marc contacted me this week to say that the Google Analytics stats for the Historical Thesaurus seem to have stopped working either since the move to HTTPS or since the move to the new domain. A bit of research suggested we need to update the site URL in ‘property settings’ in the admin interface, which I did. However, this did not immediately fix the issue and I’m going to have to keep an eye on this. Regarding new domains, I also requested some new web space for our new thesaurus.ac.uk domain, so hopefully we’ll have the beginnings of a new resource in place soon.
I met with Fraser on Friday to go through the Historical Thesaurus database with him, to give him a few pointers to running SQL queries. We went through some examples, which included exporting a lot of the data he was needing to work with anyway. We also talked about a new thesaurus related proposal that he’s putting together.
I responded to a query from Thomas Widmann relating to the structure of the database for DSL. The XML for the entries doesn’t include bibliographical reference IDs, even though these are listed (and work as links) on the website. After looking at the database it turns out that these references are stored in a separate table, so I exported the data and sent it on to Thomas.
Earlier in the week Alison Wiggins had emailed me the data she has been compiling about the Bess of Hardwick account books. She has been adding data into an Access database and there is now a fairly large number of records. We’re going to meet next week to discuss the data and how this should be presented via an online resource, and I spent some time on Friday exporting the data from Access into a MySQL database that I will then use online.
I continued with the group statistics feature for the SCOSYA project this week. Last week Gary had let me know that he was experiencing issues when using the feature with a large group he had created, so I did some checking of functionality. I created a group with 140 locations in it and tried out the feature with a variety of searches on a variety of devices, operating systems and browsers but didn’t encounter any issues. Thankfully it turned out that Gary needed to clear his browser’s cache, and with that done the feature worked perfectly for him. Gary had also reported an issue with the data export facilitiy I created a while back for the project team to use. It was working fine if limits on the returned data were included, but gave nothing but a blank page when all the data was requested. After a bit of investigation I reached the conclusion that it must be a some kind of limit imposed on the server, and a quick check with Chris revealed that when the script returned all of the data it was exceeding a memory limit. When Chris increased the limit the script began to work perfectly again.
In addition to these investigations I added a couple of new pieces of functionality to the group statistics feature. I added in the option to show or hide locations that are not part of your selected group, allowing the user to cut down on the clutter and focus on the locations that they are partiuclarly interested in. I also added in an option to download the data relating specifically to the user’s selected locations, rather than for all locations. This meant updating the project’s API to allow any number of locations to be included in the GET request sent to the server. Unfortunately this uncovered another server setting that was preventing certain requests working. With many locations selected the URL sent to the API is very long, and in such cases the request was not fully getting through to my API scripts but was instead getting blocked by the server. Rather than processing the API’s default index page was displaying, but wothout the CSS file properly loadng. With shorter URLs the request got through fine. I checked with Chris and a setting on the server was limiting URL parameters to 512 characters in length. Chris increased this and the request got through and returned the required data. With this issue out of the way the ‘download group data’ feature worked properly. I had been making these changes on a temporary version of the atlas in the CMS, but with everything in place I moved my temporary version over to the main atlas, and all seems to be working well.
I had a few meetings this week. The first was with someone from a start-up company who are wanting to develop some kind of transcription service. We talked about the SCOTS corpus and its time-aligned transcriptions of audio files. I’m not sure how much help I really was, however, as what they really need is a tool to create such transcriptions rather than publish them, and the SCOTS project used a different tool to do this called PRAAT. The guy is going to meet with Jane Stuart-Smith who should be able to give more information on this side of things, and also with Wendy Anderson who knows a bit more about the history of the SCOTS project than I do, so maybe these subsequent meetings will be more useful. I also met with Ewa Wanat, a PhD student in English Language, who is wanting to put together an app about rhythm and metre in English. I gave her some advice about the sorts of tools she could use to develop the app and showed her the ‘English Metre’ app I created last year. She already has a technical partner in mind for her project so probably won’t need me to do the actual work, but I think I was able to give her some useful advice. I also met with Scott Spurlock from Theology, for whom I will be creating a crowdsourcing tool that will be used to transcribe some records of the Church of Scotland. There has been a bit of a delay in getting the images for the project, and Scott hasn’t decided what URL he would like for the project, but once these things are sorted I’ll be able to start to work developing the tool, hopefully using some existing technologies.
Before I went away on holiday the SLD people were in touch to say that the Android version of the Scots Dictionary for Schools app had been taken down, and the person with the account details had retired without passing the account details on. We tried various approaches to get access to the account but in the end it looked like the only thing to do would be to create a new account and republish the app. Thomas Widmann set up the account just before I went away and I said I’d sort out the technical side of things when I got back to the office. On Friday this week I tackled this task. As I suspected, it look rather a long time to get all of the technologies up to date again. I don’t develop apps all that often and it seems that every time I come to develop a new one (or create a new version of an old one) the software and methodologies needed to publish an app have all changed. It took most of the morning to install the necessary software updates, and a fair bit of the afternoon to figure out how the new workflow for publishing an app would work. However, I got there in the end and by the end of the day the new version was available for download (for free) via the Google Play store. You can access the dictionary app here: https://play.google.com/store/apps/details?id=com.sld.ssd2
I’m on holiday on Monday to Wednesday next week, so next week’s report should be rather shorter.