Week Beginning 1st October 2018

This was a week of many different projects, most of which required fairly small jobs doing, but some of which required most of my time.  I responded to a query from Simon Taylor about a potential new project he’s putting together that will involve the development of an app.  I fixed a couple of issues with the old pilot Scots Thesaurus website for Susan Rennie, and I contributed to a Data Management Plan for a follow-on project that Murray Pittock is working on.  I also made a couple of tweaks to the new maps I’d created for Thomas Clancy’s Saints Places project (the new maps haven’t gone live yet) and I had a chat with Rachel Macdonald about some further updates to the SPADE website.  I also made some small updates to the Digital Humanities Network website, such as replacing HATII with Information Studies.  I also had a chat with Carole Hough about the launch of the REELS resource, which will happen next month, and spoke to Alison Wiggins about fixing the Bess of Hardwick resource, which is currently hosted at Sheffield and is unfortunately no longer working properly.  I also continued to discuss the materials for an upcoming workshop on digital editions with Bryony Randall and Ronan Crowley.  I also made a few further tweaks to the new Seeing Speech and Dynamic Dialects websites for Jane Stuart-Smith.

I had a meeting with Kirsteen McCue and Brianna Robertson-Kirkland to discuss further updates to the Romantic National Song Network website.  There are going to be about 15 ‘song stories’ that we’re going to publish between the new year and the project’s performance event in March, and I’ll be working on putting these together as soon as the content comes through.  I also need to look into developing an overarching timeline with contextual events.

I spent some time updating the pilot crowdsourcing platform I had set up for Scott Spurlock.  Scott wanted to restrict access to the full-size manuscript images and also wanted to have two individual transcriptions per image.  I updated the site so that users can no longer right click on an image to save or view it.  This should stop most people from downloading the image, but I pointed out that it’s not possible to completely lock the images.  If you want people to be able to view an image in a browser it is always going to be possible for the user to get the image somehow – e.g. saving a screenshot, or looking at the source code for the site and finding the reference to the image.  I also pointed out that by stopping people easily getting access to the full image we might put people off from contributing – e.g. some people might want to view the full image in another browser window, or print it off to transcribe from a hard copy.

The crowdsourcing platform uses the Scripto tool, a plugin for Omeka, and unfortunately allowing multiple transcriptions to be made for each image isn’t something the Scripto tool is set up to do.  All the examples I’ve seen have allowed multiple users to collaborate in the preparation of a single transcriptions for each page, with all edits to the transcription tracked.  However, I had a look at the source code to see how I might update the system to allow two versions of each page to be stored.  It turns out that it’s not going to be a straightforward process to implement this.  I probably could get such an update working, but it’s going to mean rewriting many different parts of the tool – not just the transcription page, but the admin pages, the ‘watch’ pages, the ‘history’ pages, the scripts that process edits and the protecting of transcriptions.  It’s a major rewrite of the code, which might break the system or introduce bugs or security issues.  Suggested to Scott that his pilot should focus on single transcriptions, as I am uncertain he will get enough transcribers to fully transcribe each page twice.  There are huge amounts of text on each page and even completing a single transcription is going to be a massive amount of effort.  I pointed out that other projects that have used Scripto for single transcriptions on much shorter texts are still ongoing, years after launching, for example http://letters1916.maynoothuniversity.ie/diyhistory/ and https://publications.newberry.org/digital/mms-transcribe/index.  However, after further thought I did come up with a way of handling multiple transcriptions purely in the front end via JavaScript.  The system architecture wouldn’t need to be updated, but I could use a marker hidden in the transcription text to denote where the first transcription ends and the second begins.  The JavaScript would then use this marker to split the single transcription up into two separate text areas in the front-end and admin interfaces.  I’ll see how Scott would prefer to proceed with this next time we speak.

I also spent a bit of time continuing to work on the Bilingual Thesaurus.  I moved the site I’m working on to a new URL, as requested by Louise Sylvester, and updated the thesaurus data after receiving feedback on a few issues I’d raised previously.  This included updating the ‘language of citation’ for the 15 headwords that had no data for this, instead making them ‘uncertain’.  I also added in first dates for a number of words that previously only had end dates, based on information Louise sent to me.  I also noticed that several words have duplicate languages in the original data, for example the headword “Clensing (mashinge, yel, yeling) tonne” has for language of origin: “Old English|?Old English|Middle Dutch|Middle Dutch|Old English”.  My new relational structure ideally should have a language of origin / citation linked only once to a word, otherwise things get a bit messy, so I asked Louise whether these duplicates are required, and whether a word can have both an uncertain language of origin (“?Old English”) and a certain language of origin (“Old English”).  I haven’t heard back from her about this yet, but I wrote a script that strips out the duplicates, and where both an uncertain and certain connection exists keeps the uncertain one.  If needs be I’ll change this.  Other than these issues relating to the data, I spent some time working on the actual site for the Bilingual Thesaurus.  I’m taking the opportunity to learn more about the Bootstrap user interface library and am developing the website using this.  I’ve been replicating the look and feel of the HT website using Bootstrap syntax and have come up with a rather pleasing new version of the HT banner and menu layout.  Next week I’ll see about starting to integrate the data itself.

This just leaves the big project of the week to discuss:  the ongoing work to align the HT and OED datasets.  I continued to implement some of the QA and matching scripts that Marc, Fraser and I discussed at our meeting last week.  Last week I ‘dematched’ 2412 categories that don’t have a perfect number of lexemes match and have the same parent category.  I created a further script that checks how many lexemes in these potentially matched categories are the same.  This script counts the number of words in the potentially matched HT and OED categories and counts how many of them are identical (stripped).  A percentage of the number of HT words that are matched is also displayed.  If the number of HT and OED words match and the total number of matches is the same as the number of words in the HT and OED categories the row is displayed in green.  If the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1 this is also considered a match.  If the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1 this is also considered a match.  The total matches given are 1154 out of 2412.

I then moved onto creating a script that checks the manually matched data from our ‘version 1’ matching process.  There are 1407 manual matches in the system.  Of these:

  1. 795 are full matches (number of words and stripped last word match or have a Levenshtein score of 1 and 100% of HT words match OED words, or the categories are empty)
  2. There are 205 rows where all words match or the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1, or the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1
  3. There are 122 rows where the last word matches (or has a Levenshtein score of 1) but nothing else does
  4. There are 18 part of speech mismatches
  5. There are 267 rows where nothing matches

I then created a ‘pattern matching’ script, which changes the category headings based on a number of patterns and checks whether this then results in any matches.  The following patterns were attempted:

  • inhabitant of the -> inhabitant
  • inhabitant of -> inhabitant
  • relating to -> pertaining to
  • spec. -> specific
  • spec -> specific
  • specific -> specifically
  • assoc. -> associated
  • esp. -> especially
  • north -> n.
  • south -> s.
  • january -> jan.
  • march -> mar.
  • august -> aug.
  • september -> sept.
  • october -> oct.
  • november -> nov.
  • december -> dec.
  • Levenshtein difference of 1
  • Adding ‘ing’ onto the end

The script identified 2966 general pattern matches, 129 Levenshtein score 1 matches and 11 ‘ing’ matches, leaving 17660 OED categories that have a corresponding HT catnum with different details and a further 6529 OED categories that have no corresponding HT catnum.  Where there is a matching category number of lexemes / last lexeme / total matched lexeme checks as above are applied and rows are colour coded accordingly.

On Friday Marc, Fraser and I had a further meeting to discuss the above, and we came up with a whole bunch of further updates that I am going to focus on next week.  It feels like real progress is being made.