I took Friday off this week, in the run-up to Christmas, and spent the remaining four days trying to finish some of my outstanding tasks before the holidays begin. This included finishing the ‘song story’ I’d left half finished last week for the RNSN project, starting and completing the other ‘song story’ that I had in my ‘to do’ list and updating two other stories to add in audio files. All this required a lot of adapting images, uploading and linking to files, writing HTML and other such trivial but rather time-consuming tasks. It probably took the best part of two days to get it all done, but things are looking great and I reckon the bulk of the work on these song stories is now complete. We’re hoping to launch them early next year, at which point I’ll be able to share the URLs.
I also continued to talk Joanna Kopacyk with the proposal she’s putting together. This included having a long email conversation with both her and Luca, plus meeting in person with Joanna to go through some of the technical aspects that still needed a bit of thought. Things seem to be coming together well now and hopefully Joanna will be able to submit the proposal in the new year.
Bryony Randall is also working on a proposal, this time a follow-on funding bid. She’d set up a project website on WordPress.com, but it was filled with horribly intrusive adverts, and I thought it would give a better impression to reviewers if we migrated the site to a Glasgow server. I started this process last week and completed it this week. The new website can be found here: https://newmodernistediting.glasgow.ac.uk/
I also spent a couple of hours on the Bilingual Thesaurus, changing the way the selection of languages of origin and citation are handled on the search page. There are so many languages and it had been suggested that higher-level groupings could help ensure users selected all of the appropriate options. So, for example, a new top level group would be ‘Celtic’ and then within this there would be Irish, Old Irish, Scots Gaelic etc. Each group has a checkbox and if you click on it then everything within the group is checked. Clicking again deselects everything, as the screenshot below demonstrates. I think it works pretty well.
I could hardly let a week pass without continuing to work on the HT / OED category linking task, and therefore I spent several further hours working on this. I completed a script that compares all lexemes and their search terms to try and find matches. For this to work I also had to execute a script to generate suitable search terms for the OED data (e.g. variants with / without brackets). The comparison script takes ages to run as it has to compare every word in every unmatched OED category to every word in every unmatched HT category. The script has identified a number of new potential matches that will hopefully be of some use. It also unfortunately identified many OED lexemes that just don’t have any match in the HT data, despite having a ‘GHT date’, which means there should be a matching HT word somewhere. It looks like some erroneous matches might have crept into our matching processes. In some cases the issue is merely that the OED have changed the lexeme so it no longer matches (e.g. making a word plural). But in other cases things look a little weird.
For example, OED 231036 ‘relating to palindrome’ isn’t matched and contains 3 words, none of which are found in the remaining unmatched HT categories (palindrome, palindromic, palindromical). I double-checked this in the database. The corresponding HT category is 219358 ‘pertaining to palindrome’, which contains four words (palinedrome, cancrine, palindromic, palindromical). This has been matched to OED category 194476 ‘crab-like, can be read backwards and forwards’, which contains the words ‘palinedrome, cancrine, palindromic, palindromical’. on further investigation I’d say OED category 194476 ‘crab-like, can be read backwards and forwards’ should actually match HT category 91942 ‘having same form in both directions’ which contains a single word ‘palindromic’. I get the feeling the final matching stages are going to get messy. But this is something to think about next year. That’s all from me for 2018. I wish anyone who is reading this a very merry Christmas.
I continued to work with the Hansard dataset this week, working with Chris McGlashan to get the dataset onto a server. Once it was there I could access the data, but as there are more than 682 million rows of frequency data things were a little slow to query, especially as no indexes were included in the dump. As I don’t have command-line access to the server I needed to ask Chris to run the commands to create indexes, as each index takes several hours to compile. He set on going that indexed the data by year, and after a few hours it had completed, resulting in an 11GB index file. With that in place I could much more swiftly retrieve the data for each year. I’ve let Marc know that this data is now available again, and I just need to wait to hear back from him to see exactly what he wants to do with the dataset.
I spent a fair amount of time this week advising staff on technical aspects of research proposals. It’s the time of year when the students are all away and staff have time to think about such proposals, meaning things get rather busy for me. I created a Data Management Plan for a follow-on project that Bryony Randall in English Literature is putting together. I also started to migrate a project website she had previously set up through WordPress.com onto an instance of WordPress hosted at Glasgow. Her site on WordPress.com was full of horribly intrusive adverts that did not give a good impression and really got in the way, and moving to hosting at Glasgow will stop this, and give the site a more official looking URL. It will also ensure the site can continue to be hosted in future, as free commercial hosting is generally not very reliable. I hope to finish the migration next week. I also responded to a query about equipment from Joanna Kopaczyk, discussed a couple of timescale issues with Thomas Clancy and gave some advice to Karen Lury from TFTS about video formats and storage requirements. I also met with Clara Cohen to discuss her Data Management Plan.
Also this week I sorted out my travel arrangements for the DH2019 conference and updated the site layout slightly for the DSL website, and on Wednesday I attended the English Language and Linguistics Christmas lunch, which was lovely. I also continued with my work on the HT / OED category linking, ticking off another batch of matches, which takes us down to 1894 unmatched OED categories that have words and a part of speech.
I also spent about a day continuing to work on the Bilingual Thesaurus. Last week I’d updated the ‘category’ box on the search page to make it an ‘autocomplete’ box, that lists matching categories as you type. However, I’d noticed that this was often not helpful as the same title is used for multiple categories (like the three ‘used in buildings’ categories mentioned in last week’s post). I therefore implemented a solution that I think works pretty well. When you type into the ‘category’ box the top two levels of the hierarchy to which a matching category belongs now appear in addition to the category name. If the category is more than two hierarchical levels down this is represented by ellipsis. Listed categories are now ordered by their ID rather than alphabetically too, so categories in the same part of the tree appear together. So now, for example, if you type in ‘processes’ the list contains ‘Building > Processes’ , ‘Building > Processes > … > Other processes’ etc. Hopefully this will make the search much easier to use. I also updated the search results page so the hierarchy is shown in the ‘you searched for’ box too, and I fixed a bug that was preventing the search results page displaying results if you searched for a category then followed a link through to the category page then pressed on the ‘Back to search results’ button.
Louise had noticed that there were two ‘processes’ categories within ‘Building’ so I amalgamated these. I also changed ‘Advanced Search’ back to plain old ‘Search’ again in all locations, and I created a new menu item and page for ‘How to use the thesaurus’.
As the Bilingual Thesaurus is almost ready to go live and it ‘hangs off’ the thesaurus.ac.uk domain I added some content to the homepage of the domain, as you can see in the screenshot below:
It currently just has boxes for the three thesauruses featuring a blurb and a link, with box colours taken from each site’s colour schemes. I did think about adding in the ‘sample category’ feature for each thesaurus here too, but as it might make the top row boxes rather long (if it’s a big category) I decided to keep things simple. I added the tagline ‘The home of academic thesauri’ (‘thesauruses’ seemed a bit clumsy here) just to give visitors a sense of what the site is. I’ll need some feedback from Marc and Fraser before this officially goes live.
Finally this week I spent some time working on some new song stories for the Romantic National Song Network. I managed to create about one and a half, which took several hours to do. I’ll hopefully manage to get the remaining half and maybe even a third one done next week.
As with previous weeks recently, I spent quite a bit of time this week on the HT / OED category linking issue. One of the big things was to look into using the search terms for matching. The HT lexemes have a number of variant forms hidden in the background for search purposes, such as alternative spellings, forms with bracketed text removed or included, and text either side of slashes split up into different terms. Marc wondered whether we could use these to try and match up lexemes with OED lexemes, which would also mean generating similar terms for the OED lexemes too. For the HT I can get variants with or without any bracketed text easily enough, but slashes are not going to be straightforward. The search terms for HT lexemes were generated using multiple passes through the data, which would be very slow to do on the fly when comparing the contents of every category. An option might be to use the existing search terms for the HT and generate a similar set for the OED, but as things stand the HT search terms contain rows that would be too broad for us to use for matching purposes. For example, ‘sway (the sceptre/sword)’ has ‘sword’ on its own as one of the search terms and we wouldn’t want to use this for matching purposes.
Slashes in the HT are used to mean so many different things that it’s really hard to generate an accurate list of possible forms, and this is made even more tricky when brackets are added into the mix. Simple forms would be easy, e.g. for ‘Aimak/Aymag’ just split the form on the slash and treat the before and after parts as separate. This is also the case for some phrases too, e.g. ‘it is (a) wonder/wonder it is’. But then elsewhere the parts on either side of the slash are alternatives that should then be combined with the rest of the term after the word after the slash – e.g. ‘set/start the ball rolling’, or combined with the rest of the term before the word before the slash – e.g. ‘sway (the sceptre/sword)’, or combined with both the beginning and the end of the term while switching stuff out in the middle – e.g. ‘of a/the same suit’. In other places an ‘etc’ appears that shouldn’t be combined with any resulting form – e.g. ‘bear (rule/sway, etc.)’. Then there are a further group where the slash means there’s an alternative ending to the word before the slash – e.g. ‘connecter/-or’. But in other forms the bits after the slash should be added on rather than replacing the final letters – e.g. ‘radiogoniometric/-al’. Sometimes there are multiple slashes that might be treated in one or more of the above ways, e.g. ‘lie of/on/upon’. The there are multiple slashes in the same form, e.g. ‘throw/cast a stone/stones’.
It’s a horrible mess and even after several passes to generate the search terms I don’t think we managed to generate all legitimate search term, while we certainly did generate a lot of incorrect terms, the thinking at the time being that the weird forms didn’t matter as no-one would search for them anyway and they’d never appear on the site. But we should be wary about using them for comparison, as the ‘sword’ example demonstrates.
Thankfully the OED lexemes don’t include slashes. There are only 16 OED lexemes that include a slash, and these are things like ‘AC/DC’, so I could generate some search terms for the OED data without too much risk of forms being incorrect, but the HT data is pretty horrible and is going to be an issue when it comes to matching lexemes too.
I met with Marc on Tuesday and we discussed the situation and agreed that we’d just use the existing search terms, and I’d generate a similar set for the OED and we’d just see how much use these might be. I didn’t have time to implement this during the week, but hopefully will do next week. Other HT tasks I tackled this week included adding in a new column to lots of our matching scripts that lists the Leveshtein score between the HT and OED path and subcats. This will help us to spot categories that have moved around a lot. I also updated the sibling matching script so that categories with multiple potential matches are separated out into a separate table.
I then rearranged the advanced search form to make the chose of language more prominent (i.e. whether ‘Anglo Norman’, ‘Middle English’ or ‘Both’). I used the label ‘Headword Language’ as opposed to ‘Section’ as it seemed to be an accurate description and we needed some sort of label to attach the help icon to. Language choice is now handled by radio buttons rather than a drop-down list so it’s easier to see what the options are.
The thing that took the longest to implement was changing the way ‘category’ works in a search. Whereas before you entered some text and your search was then limited to any individual categories that featured this text in their headings, now as you start typing into the category box a list of matching categories appears, using the jQuery UI AutoComplete widget. You can then select a category from the list and your search is then limited to any categories from this point downwards in the hierarchy. Working out the code for grabbing all ‘descendant’ categories from a specified category took quite some time to do, as every branch of the tree from that point downwards needs to be traversed and its ID and child categories returned. E.g. if you start typing in ‘build’ and select ‘builder (n.)’ from the list and then limit your search to Anglo Norman headwords your results will display AN words from ‘builder (n.)’ and categories within this, such as ‘Plasterer/rough-caster’. Unfortunately I can’t really squeeze the full path into the list of categories that appears as you type into the category box, as that would be too much text, and it’s not possible to style the list using the AutoComplete plugin (e.g. to make the path information smaller than the category heading). This means some category headings are unclear due to a lack of context (e.g. there are 3 ‘Used in building’ categories that appear with nothing to differentiate them). However, the limit by category is a lot more useful now.
On Wednesday I gave a talk about AHRC Data Management Plans at an ArtsLab workshop. This was basically a repeat of the session I was involved with a month or so ago, and it all went pretty smoothly. I also sent a couple of sample data management plans to Mary Donaldson of the University’s Research Data Management team, as she’d asked whether I had any I could let her see. It was rather a busy week for data management plans, as I also had to spend some time writing an updated plan for a place-names project for Thomas Clancy and gave feedback and suggested updates to a plan for an ESRC project that Clara Cohen is putting together. I also spoke to Bryony Randall about a further plan she needs me to write for a proposal she’s putting together, but I didn’t have time to work on that plan this week.
Also this week I met with Andrew from Scriptate, who I’d previously met to discuss transcription services using an approach similar to the synchronised audio / text facilities that the SCOTS Corpus offers. Andrew has since been working with students in Computing Science to develop some prototypes for this and a corpus of Shakespeare adaptations and he showed me some of the facilities thy have been developing. It looks like they are making excellent progress with the functionality and the front-end and I’d say things are progressing very well.
I also had a further chat with Valentina Busin in MVLS about an app she’s wanting to put together and I spoke to Rhona Alcorn of SLD about the Scots School Dictionary app I’d created about four years ago. Rhona wanted to know a bit about the history of the app (the content originally came from the CD-ROM made in the 90s) and how it was put together. It looks like SLD are going to be creating a new version of the app in the near future, although I don’t know at this stage whether this will involve me.
I also spoke to Gavin Miller about a project I’m named on that recently got funded. I can’t say much more about it for now, but will be starting on this in January. I also started to arrange travel and things for the DH2019 conference I’ll be attending next year, and rounded off the week by looking at retrieving the semantically tagged Hansard dataset that Marc wants to be able to access for a paper he’s writing. Thankfully I managed to track down this data, inside a 13GB tar.gz file, which I have now extracted into a 67Gb MySQL dataset. I just need to figure out where to stick this so we can query it.
I continued to work on the outstanding ‘stories’ for the Romantic National Song Network this week, completing work on another story using the storymap.js library. I have now completed seven of these stories, which is more than half of the total the project intends to make.
On Monday I met with E Jamieson, the new RA on the SCOSYA project, to discuss the maps we are going to make available to the public. We discussed various ways in which the point-based data might be extrapolated, such as heat maps and Voronoi diagrams. I found a nice example of a Leaflet.js / D3.js based Voronoi diagram that I think could work very well for the project (see https://chriszetter.com/voronoi-map/examples/uk-supermarkets/) so I might start to investigate using such an approach. I think we’d want to be able to colour-code the cells, although other D3.js examples of Voronoi diagrams suggest that this is possible (see this one: ). We also discussed how the more general public views of the data (as opposed to expert view) might work. The project team like the interface offered by this site: https://ygdp.yale.edu/phenomena/done-my-homework, although we want something that presents more of the explanatory information (including maybe videos) via the map interface itself. It looks like the storymap.js library (https://storymap.knightlab.com/) I’m using for RNSN might actually work very well for this. For RNSN I’m using the library with images rather than maps, but it was primarily designed for use with maps, and could hopefully be adapted to work with a map showing data points or even Voronoi layers.
I spent a further couple of days this week working on the HT / OED data linking task. This included reading through and giving feedback on Fraser’s abstract for DH2019 and updating the v1 / v2 comparison script to add in an additional column to show whether the v1 match was handled automatically or manually. I also created a new script to look at the siblings of categories that contain monosemous forms to see whether any of these might have matches at the same level. This script takes all of the monosemous matches as listed in the monosemous QA script and for each OED and HT category finds their unmatched siblings that don’t otherwise also appear in the list. The script then iterates through the OED siblings and for each of these compares the contents to the contents of each of the HT siblings. If there is a match (matches for this script being anything that’s green, lime green, yellow or orange) the row is displayed on screen. Where there are multiple monosemous categories at the same level the siblings will be analysed for each of the categories, so there is some duplication. E.g. the first monosemous link is ‘OED category 2797 01.02.06.01|03 (n) deep place or part matches HT category 1017 01.02.06.01.01|03 (n) deep place/part’ and there are two unmatched OED siblings (‘shallow place’ and ‘accumulation of water behind barrier’), so these are analysed. But the next monosemous category (OED category 2803 01.02.06.01|07 (n) bed of matches HT category 1024 01.02.06.01.01|07 (n) bed of) is at the same level, so the two siblings are analysed again. This happens quite a lot, but even so there are still some matches that this script finds that wouldn’t otherwise have been found due to changes is category number. I’ve made a count of the total unique matches (all colours) and it’s 162. I fear we are getting to the point where the amount of time it takes to write scripts to identify matches is taking longer than the time it would take to manually identify matches, though. It took several hours to write this script for 162 potential matches.
I also created a script that lists all of the non-matched OED and HT categories, split into various smaller lists, such as main categories or sub-categories, and on Wednesday I attended a meeting with Marc and Fraser to discuss our next steps. I came out of the meeting with another long list of items to try and tackle, and I spent some of the rest of the week going through the list. I ticked off the outstanding green, lime green and yellow matches on the lexeme pattern matching, sibling matching and monosemous matching scripts.
I then updated the sibling matching script to look for matches at any subcat level, but unfortunately this didn’t really uncover much new, at least initially. It found just one extra green and three yellows, although the 86 oranges look like they would mostly be ok too, with manual checking. I went over my script and it was definitely doing what I’m expecting it to do, namely: Get all of the unmatched OED cats (e.g. 03.05.01.02.01.02|05.04 (vt)); for subcats get all of the unmatched HT subcats of the maincat in the same POS (e.g. all the unmatched subcats of 03.05.01.02.01.02 that are vt); list all of the subcats; if one of the stripped headings matches or has a Levenshtein score of 1 then this is highlighted in green and its contents are compared.
I then updated the script so that it didn’t compare category headings at all, but instead only looked at the contents. In this script each possible match appears in its own row (e.g. cat 120031 appears 4 times, once as an orange, 3 times as purple). It has brought back 8 greens, 1 lime green, 4 yellows and 1617 oranges.
I then updated the monosemous QA script to identify categories where the monosemous form has dates that match and one further date matches, the idea being if these criteria are met the category match is likely to be legitimate. This was actually really difficult to implement and took most of a day to do. This is because the identification of monosemous forms was done at a completely different point (and actually by a completely different script) to the listing and comparing of the full category contents. I had to rewrite large parts of the function that gets and compares lexemes in order to integrate the monosemous forms. The script now makes all monosemous forms in the OED word list for each category bold and compares these forms and their dates to all of the HT words in the category. A count of all of the monsemous forms that match an HT form in terms of stripped / pattern matched content and start date is stored. If this count is 1 or more and the count of ‘Matched stripped lexemes (including dates)’ is 2 or more then the match is bumped up to yellow. This has identified 512 categories, which is about a sixth of the total OED unmatched categories with words, which is pretty good.
Other tasks this week included creating a new (and possibly final) blog post for the REELS project, dealing with some App related questions from someone in MVLS, having a brief meeting with Clara Cohen from English Language to discuss the technical aspects of a proposal she’s putting together and making a few further tweaks to the Bilingual Thesaurus website.
On Friday I attended the Corpus Linguistics in Scotland event at Edinburgh University. There were 12 different talks over the course of the day on a broad selection of subjects. As I’m primarily interested in the technologies used rather than the actual subject matter, here are some technical details. One presenter used Juxta (https://www.juxtaeditions.com/) to identify variation in manuscripts. Another used TEI to mark up pre-modern manuscripts for lexicographical use (looking at abbreviations, scribes, parts of speech, gaps, occurrences of particular headwords). Another speaker had created a plugin for the text editor Emacs that allows you to look at things like word frequencies, n-grams and collocations. A further speaker handled OCR using Google Cloud Vision (https://cloud.google.com/vision/) that can take images and analyse them in lots of ways, including extracting the text. A couple of speakers used AntConc (http://www.laurenceanthony.net/software/antconc/) and another couple used the newspaper collections available through LexisNexis (https://www.lexisnexis.com/ap/academic/form_news_wires.asp) as source data. Other speakers used Wordsmith tools (https://www.lexically.net/wordsmith/), Sketch Engine (https://www.sketchengine.eu) and WMatrix (http://ucrel.lancs.ac.uk/wmatrix/). It was very interesting to learn about the approaches taken by the speakers.