I spent quite a bit of time this week helping members of staff with research proposals. Last week I met with Ophira Gamliel in Theology to discuss a proposal she’s putting together and this week I wrote an initial version of a Data Management Plan for her project, which took a fair amount of time as it’s a rather multi-faceted project. I also met with Kirsteen McCue in Scottish Literature to discuss a proposal she’s putting together, and I spent some time after our meeting looking through some of the technical and legal issues that the project is going to encounter.
I also added three new pages to Matthew Creasey’s transcription / translation case study for his Decadence and Translation project (available here: https://dandtnetwork.glasgow.ac.uk/recreations-postales/) and sorted out some user account issues for the Place-names of Kircudbrightshire project and prepared an initial version of my presentation for the conference I’m speaking at in Bergamo the week after next.
I also helped Fraser to get some data for the new Scots Thesaurus project he’s running. This is going to involve linking data from the DSL to the OED via the Historical Thesaurus, so we’re exploring ways of linking up DSL headwords to HT lexemes initially, as this will then give us a pathway to specific OED headwords once we’ve completed the HT/OED linking process.
My first task was to create a script that returned all of the monosemous forms in the DSL, which Fraser suggested would be words that only have one ‘sense’ in their entries. The script I wrote goes through the DSL data and picks out all of the entries that have one <sense> tag in their XML. For each of these it then generates a ‘stripped’ form using the same algorithm that I created for the HT stripped fields (e.g. removing non alphanumeric characters). It then looks through the HT lexemes for an exact match of the HT lexeme ‘stripped’ field. If there is exactly one match then data about the DSL word and the matching HT word is added to the table.
For DOST there are 42177 words with one sense, and of these 2782 are monosemous in the HT and for SND there are 24085 words with one sense, and of these 1541 are monosemous in the HT. However, there are a couple of things to note. Firstly, I have not added in a check for Part of speech as the DSL POS field is rather inconsistent, often doesn’t even contain data and where there are multiple POSes there is no consistent way to split them up. Sometimes a comma is used, sometimes a space. A POS generally ends with a full stop, but not in forms like ‘n.1’ and ‘n.2’. Also, the DSL uses very different terms to the HT for POS, so without lots of extra work mapping out which corresponds to which it’s not possible to automatically match up an HT and a DSL POS. But as there are only a few thousand rows it should be possible to manually pick out the good ones.
Secondly, a word might have one sense but have two completely separate entries in the same POS, so as things currently stand the returned rows are not necessarily ‘monosemous’. See for example ‘bile’ (http://dsl.ac.uk/results/bile) which has four separate entries in SND that are nouns, plus three supplemental entries, so even though an individual entry for ‘bile’ contains one sense it is clearly not monosemous. After further discussions with Fraser I updated my script to count the number of times a DSL headword with one sense appears as a separate headword in the data. If the word is a DOST word and it appears more than once in DOST this number is highlighted in red. If it appears at all in SND the number is highlighted in red. For SND words it’s the same but reversed. There is rather a lot of red in the output, so I’m not sure how useful the data is going to be, but it’s a start. I also generated lists of DSL entries that contain the text ‘comb.’ and ‘attrb.’ as these will need to be handled differently.
All of the above took up most of the week, but I did have a bit of time to devote to HT/OED linking issues, including writing up my notes and listing action items following last Friday’s meeting and beginning to tick off a few of the items from this list. Pretty much all I managed to do was linked to the issue of HT lexemes with identical details appearing in multiple categories, and updating the output of an existing script to make it more useful.
Point 2 on my list was “I will create a new version of the non-unique HT words (where a word with the same ‘word’, ‘startd’ and ‘endd’ in multiple categories) to display how many of these are linked to OED words and how many aren’t“. I updated the script to add in a yes/no column for where there are links. I’ve also added in additional columns that display the linked OED lexeme’s details. Of the 154428 non-unique words 129813 are linked.
Point 3 was “I will also create a version of the script that just looks at the word form and ignores dates”. I’ve decided against doing this as just looking at word form without dates is going to lead to lots of connections being made where they shouldn’t really exist (e.g. all the many forms of ‘strike’).
Point 4 was “I will also create a version of the script that notes where one of the words with the same details is matched and the other isn’t, to see whether the non-matched one can be ticked off” and this has proved both tricky to implement and pretty useful. Tricky because a script can’t just compare the outputted forms sequentially – each identical form needs to be compared with every other. But as I say, it’s given some good results. There are 9056 of words that aren’t matched but probably should be, which could potentially be ticked off. Of course, this isn’t going to affect the OED ‘ticked off’ stats, but rather the HT stats. I’ve also realised that this script currently doesn’t take POS into consideration – it just looks at word form, firstd and lastd, so this might need further work.
I’m going to be on holiday next week and away at a conference for most of the following week, so this is all from me for a while.