As with previous weeks recently, I spent quite a bit of time this week on the HT / OED category linking issue. One of the big things was to look into using the search terms for matching. The HT lexemes have a number of variant forms hidden in the background for search purposes, such as alternative spellings, forms with bracketed text removed or included, and text either side of slashes split up into different terms. Marc wondered whether we could use these to try and match up lexemes with OED lexemes, which would also mean generating similar terms for the OED lexemes too. For the HT I can get variants with or without any bracketed text easily enough, but slashes are not going to be straightforward. The search terms for HT lexemes were generated using multiple passes through the data, which would be very slow to do on the fly when comparing the contents of every category. An option might be to use the existing search terms for the HT and generate a similar set for the OED, but as things stand the HT search terms contain rows that would be too broad for us to use for matching purposes. For example, ‘sway (the sceptre/sword)’ has ‘sword’ on its own as one of the search terms and we wouldn’t want to use this for matching purposes.
Slashes in the HT are used to mean so many different things that it’s really hard to generate an accurate list of possible forms, and this is made even more tricky when brackets are added into the mix. Simple forms would be easy, e.g. for ‘Aimak/Aymag’ just split the form on the slash and treat the before and after parts as separate. This is also the case for some phrases too, e.g. ‘it is (a) wonder/wonder it is’. But then elsewhere the parts on either side of the slash are alternatives that should then be combined with the rest of the term after the word after the slash – e.g. ‘set/start the ball rolling’, or combined with the rest of the term before the word before the slash – e.g. ‘sway (the sceptre/sword)’, or combined with both the beginning and the end of the term while switching stuff out in the middle – e.g. ‘of a/the same suit’. In other places an ‘etc’ appears that shouldn’t be combined with any resulting form – e.g. ‘bear (rule/sway, etc.)’. Then there are a further group where the slash means there’s an alternative ending to the word before the slash – e.g. ‘connecter/-or’. But in other forms the bits after the slash should be added on rather than replacing the final letters – e.g. ‘radiogoniometric/-al’. Sometimes there are multiple slashes that might be treated in one or more of the above ways, e.g. ‘lie of/on/upon’. The there are multiple slashes in the same form, e.g. ‘throw/cast a stone/stones’.
It’s a horrible mess and even after several passes to generate the search terms I don’t think we managed to generate all legitimate search term, while we certainly did generate a lot of incorrect terms, the thinking at the time being that the weird forms didn’t matter as no-one would search for them anyway and they’d never appear on the site. But we should be wary about using them for comparison, as the ‘sword’ example demonstrates.
Thankfully the OED lexemes don’t include slashes. There are only 16 OED lexemes that include a slash, and these are things like ‘AC/DC’, so I could generate some search terms for the OED data without too much risk of forms being incorrect, but the HT data is pretty horrible and is going to be an issue when it comes to matching lexemes too.
I met with Marc on Tuesday and we discussed the situation and agreed that we’d just use the existing search terms, and I’d generate a similar set for the OED and we’d just see how much use these might be. I didn’t have time to implement this during the week, but hopefully will do next week. Other HT tasks I tackled this week included adding in a new column to lots of our matching scripts that lists the Leveshtein score between the HT and OED path and subcats. This will help us to spot categories that have moved around a lot. I also updated the sibling matching script so that categories with multiple potential matches are separated out into a separate table.
I then rearranged the advanced search form to make the chose of language more prominent (i.e. whether ‘Anglo Norman’, ‘Middle English’ or ‘Both’). I used the label ‘Headword Language’ as opposed to ‘Section’ as it seemed to be an accurate description and we needed some sort of label to attach the help icon to. Language choice is now handled by radio buttons rather than a drop-down list so it’s easier to see what the options are.
The thing that took the longest to implement was changing the way ‘category’ works in a search. Whereas before you entered some text and your search was then limited to any individual categories that featured this text in their headings, now as you start typing into the category box a list of matching categories appears, using the jQuery UI AutoComplete widget. You can then select a category from the list and your search is then limited to any categories from this point downwards in the hierarchy. Working out the code for grabbing all ‘descendant’ categories from a specified category took quite some time to do, as every branch of the tree from that point downwards needs to be traversed and its ID and child categories returned. E.g. if you start typing in ‘build’ and select ‘builder (n.)’ from the list and then limit your search to Anglo Norman headwords your results will display AN words from ‘builder (n.)’ and categories within this, such as ‘Plasterer/rough-caster’. Unfortunately I can’t really squeeze the full path into the list of categories that appears as you type into the category box, as that would be too much text, and it’s not possible to style the list using the AutoComplete plugin (e.g. to make the path information smaller than the category heading). This means some category headings are unclear due to a lack of context (e.g. there are 3 ‘Used in building’ categories that appear with nothing to differentiate them). However, the limit by category is a lot more useful now.
On Wednesday I gave a talk about AHRC Data Management Plans at an ArtsLab workshop. This was basically a repeat of the session I was involved with a month or so ago, and it all went pretty smoothly. I also sent a couple of sample data management plans to Mary Donaldson of the University’s Research Data Management team, as she’d asked whether I had any I could let her see. It was rather a busy week for data management plans, as I also had to spend some time writing an updated plan for a place-names project for Thomas Clancy and gave feedback and suggested updates to a plan for an ESRC project that Clara Cohen is putting together. I also spoke to Bryony Randall about a further plan she needs me to write for a proposal she’s putting together, but I didn’t have time to work on that plan this week.
Also this week I met with Andrew from Scriptate, who I’d previously met to discuss transcription services using an approach similar to the synchronised audio / text facilities that the SCOTS Corpus offers. Andrew has since been working with students in Computing Science to develop some prototypes for this and a corpus of Shakespeare adaptations and he showed me some of the facilities thy have been developing. It looks like they are making excellent progress with the functionality and the front-end and I’d say things are progressing very well.
I also had a further chat with Valentina Busin in MVLS about an app she’s wanting to put together and I spoke to Rhona Alcorn of SLD about the Scots School Dictionary app I’d created about four years ago. Rhona wanted to know a bit about the history of the app (the content originally came from the CD-ROM made in the 90s) and how it was put together. It looks like SLD are going to be creating a new version of the app in the near future, although I don’t know at this stage whether this will involve me.
I also spoke to Gavin Miller about a project I’m named on that recently got funded. I can’t say much more about it for now, but will be starting on this in January. I also started to arrange travel and things for the DH2019 conference I’ll be attending next year, and rounded off the week by looking at retrieving the semantically tagged Hansard dataset that Marc wants to be able to access for a paper he’s writing. Thankfully I managed to track down this data, inside a 13GB tar.gz file, which I have now extracted into a 67Gb MySQL dataset. I just need to figure out where to stick this so we can query it.