I continued to work on some Historical Thesaurus / OED data linking issues this week, working through my ‘to do’ list from the last meeting. The first thing I tackled was the creation of a script that tries to link up unmatched words in matched categories using a Levenshtein distance of 1. The script grabs the unmatched HT and OED words in every matched category and for each HT word it compares the stripped form to the stripped form of every unmatched OED word in the matched category. If any have a Levenshtein distance of 1 (or less) these then appear in bold after the word in the ‘HT unmatched lexemes’ column. The same process is repeated for unmatched OED lexemes, comparing each to every unmatched HT word.
There are quite a lot of occasions where there are multiple potential matches. E.g. in 01.02 there are two HT words (a world, world < woruld) that match one OED word (world). Some forms that look like they ought to match are not getting picked up, e.g. ‘palæogeography’ and ‘palaeogeography’ due to there being two characters difference. Some category contents seem rather odd, for example I’m not sure what’s going on with ‘North-east’ as it looks like the OED repeats the word ‘north-east’ four times. There are also some instances where comparing the ‘stripped’ form with a Levenshtein distance of 1 is giving a false positive – e.g. HT ‘chingle’ and ‘shingle’ are both matching OED ‘shingle’. However, I reckon there is a lot of potential for matching here, if we apply some additional limits.
I then wrote a further script that attempts to identify new OED words. It brings back all matched categories and then counts the number of unmatched words in the HT and OED categories. If there are more unmatched OED words than HT words then the category information is displayed. This includes the HT words for reference, and also the OED words complete with GHT dates, OED dates and whether the record was revised. There are 23212 matched categories with more unmatched OED words than HT words, although we probably should try and tick off more lexeme matches before we do too much with this script as many of the listed unmatched words clearly match. I also updated the ‘monosemous’ script I created last week to colour code the output based on Fraser’s suggestions, which will help in deciding which candidate matches to tick off.
On Wednesday Marc, Fraser and I had a meeting to discuss the current situation and to decide on further steps. Following on from this I made some further tweaks to existing scripts. There appeared to be a ug in the ‘monosemous’ script whereby some already matched OED words were being picked out as candidate monosemous matches. It’s turned out to be a rather large bug in terms of impact. The part of the script that checks through the potential OED matches to pick out those that are monosemous amongst the ‘not yet ticked off’ OED words was correctly identifying that a word that was monosemous within the unticked off OED words within a part of speech. However, the script needed to check through all occurrences of a word, and unfortunately it was set to use the last occurrence it reached, rather than the one that was the actual unticked monosemous form. For example, ‘Terrestrious’ as an Aj appears 4 times in the OED data. 3 of them have already been ticked off, so the remaining form is monosemous within Aj. But when checking through the 4 forms, the one that hasn’t been ticked off yet was looked at second. The script noted that the form only appeared once in the unticked Aj set, but was then using the last form it checked through, one that was already ticked off in category ‘having earth-like qualities’ rather than the unticked form in ‘occurring on’. I’m not sure if that makes sense or not, but basically in many cases the wrong OED category and words were being displayed, leading to many words being classed as not matches when in actual fact they were. I’ve updated the script to fix the bug.
I also made some updates to an existing category matching script and created a further script to list all of the unmatched words in matched categories that appear to match up based purely on the ‘stripped’ word form and not including a date check.
On Monday afternoon I attended the ‘Technician Commitment Launch Event’, which is aimed at promoting the role of technicians across the University. It was a busy event, with at least a couple of hundred people attending, and talks from technicians and senior management from within and beyond the University. It’s a promising initiative and I hope it’s a success.
I was contacted this week by College research admin staff who asked if I would write a Data Management Plan for a researcher who is based in the School of Culture and Creative Arts. As I’m employed specifically by the School of Critical Studies this really should not be by responsibility, especially as College recently appointed someone in a similar role to me to do such things. Despite pointing this out, and them not even bothering to contact this person, I was somehow still landed with the job of writing the DMP, which I’m not best pleased about. I spent most of Friday working on the plan, and it’s still not complete. I should have it finished early next week, but it’s meant I have been unable to work on the SCOSYA project this week and will have less time to work on it next week too.
Other than these tasks, and speaking to Carole Hough about some conference pages of hers that have gone missing from T4 and speaking to Wendy Anderson about some issues with the advanced search maps in the SCOTS corpus I spent the remainder of the week on DSL duties. This included beginning to write a script that will generate sections of entries for different types of advanced search (e.g. with quotes, without quotes, only quotes) and fixing some layout issues with the new version of the DSL website when viewed in portrait mode on mobile phones. The biggest task I focussed on was writing a script to go through the DOST and SND data that had been exported from scripts on the DSL’s test server, split up the XML and pick out the relevant information to update entries in the online DSL database. I started with the DOST file and this mostly went pretty well, although I ended up with a number of questions that I needed to send on to the DSL team. I also attempted to migrate the SND data but unfortunately the file that was outputted by the script on the test server is not valid XML so something must have gone wrong with it. This means my script is unable to parse the file, so I’ll need to try and figure out what has gone wrong with it. Further jobs for next week.