I returned to work on Monday after being off last week. As usual there were a bunch of things waiting for me to sort out when I got back, so most of Monday was spent catching up with things. This included replying to Scott Spurlock about his Crowdsourcing project, responding to a couple of DSL related issues, updating access restrictions on the SPADE website, reading through the final versions of the DMP and other documentation for Matt Sangster and Katie Halsey’s project, updating some details on the Medical Humanities Network website, responding to a query about the use of the Thesaurus of Old English and speaking to Thomas Clancy about his Iona proposal.
With all that out of the way I returned to the OED / HT data linking issues for the Historical Thesaurus. In my absence last week Marc and Fraser had made some further progress with the linking, and had made further suggestions as to what strategies I should attempt to implement next. Before I left I was very much in the middle of working on a script that matched words and dates, and I hadn’t had time to figure out why this script was bringing back no matches. It turns out the HT ‘fulldate’ field was using long dashes, whereas I was joining the OED GHT dates with a short dash. So all matches failed. I replaced the long dashes with short ones and the script then displayed 2733 ‘full matches’ (where every stripped lexeme and its dates match) and 99 ‘partial matches’ (where more than 6 and 80% match both dates and stripped lexeme text). I also added in a new column that counts the number of matches not including dates.
Marc had alerted me to an issue where the number of OED matches was coming back as more than 100% so I then spent some time trying to figure out what was going on here. I updated both the ‘with dates’ and ‘no date check’ versions of the lexeme pattern matching scripts to add in the text ‘perc error’ to any percentage that’s greater than 100, to more easily search for all occurrences. There are none to be found in the script with dates, as matches are only added to the percentage score if their dates match too. On the ‘no date check’ script there are several of these ‘perc error’ rows and they’re caused for the most part by a stripped form of the word being identical to an existing non-stripped form. E.g. there are separate lexemes ‘she’ and ‘she-‘ in the HT data, and the dash gets stripped, so ‘she’ in the OED data ends up matching two HT words. There are some other cases that look like errors in the original data, though. E.g. in OED catid 91505 severity there’s the HT word ‘hard (OE-)’ and ‘hard (c1205-)’ and we surely shouldn’t have this word twice. Finally there are some forms where stripping out words results in duplicates – e.g. ‘pro and con’ and ‘pro or con’ both end up as ‘pro con’ in both OED and HT lexemes, leading to 4 matches where there should only be 2. There are no doubt situations where the total percentage is pushed over the 80% threshold or to 100% by a duplicate match – any duplicate matches where the percentage doesn’t get over 100 are not currently noted in the output. This might need some further work. Or, as I previously said, with the date check incorporated the duplicates are already filtered out, so it might not be so much of an issue.
I also then moved on to a new script that looks at monosemous forms. This script gets all of the unmatched OED categories that have a POS and at least one word and for each of these categories it retrieves all of the OED words. For each word the script queries the OED lexeme table to get a count of the number of times the word appears. Note that this is the full word, not the ‘stripped’ form, as the latter might end up with erroneous duplicates, as mentioned above. Each word, together with its OED date and GHT dates (in square brackets) and a count of the number of times it appears in the OED lexeme table is then listed. If an OED word only appears once (i.e. is monosemous) it appears in bold text. For each of these monosemous words the script then queries the HT data to find out where and how many times each of these words appears in the unmatched HT categories. All queries keep to the same POS but otherwise look at all unmatched categories, including those without an OEDmaincat. Four different checks are done, with results appearing in different columns: HT words where full word (not the stripped variety) matches and the GHT start date matches the HT start date; failing that, HT words where the full word matches but the dates don’t; failing either of these, HT words where the stripped forms of the words match and the dates match; failing all these, HT words where the stripped forms match but the dates don’t. For each of these the HT catid, OEDmaincat (or the text ‘No Maincat’ if there isn’t one), subcat, POS, heading, lexeme and fulldate are displayed. There are lots of monosemous words that just don’t appear in the HT data. These might be new additions or we might need to try pattern matching. Also, sometimes words that are monosemous in the OED data are polysemous in the HT data. These are marked with a red background in the data (as opposed to green for unique matches). Examples of these are ‘sedimental’, ‘meteorologically’, ‘of age’. Any category that has a monosemous OED word that is polysemous in the HT has a red border. I also added in some stats below the table. In our unmatched OED categories there are 24184 monosemous forms. There are 8086 OED categories that have at least one monosemous form that matches exactly one HT form. There are 220 OED monosemous forms that are polysemous in the HT. Now we just need to decide how to use this data.
Also this week I looked into an issue one of the REELS team was having when accessing the content management system (it turns out that some anti-virus software was mislabelling the site as having some kind of phishing software in it), and responded to a query about the Decadence and Translation Network website I’d set up. I also started to look at sourcing some Data Management Plans for an Arts Lab workshop that Dauvit Broun has asked me to help with next week. I also started to prepare my presentation for the Digital Editions workshop next week, which took a fair amount of time. I also met with Jennifer Smith and a new member of the SCOSYA project team in Friday morning to discuss the project and to show the new member of staff how the content management system works. It looks like my involvement with this project might be starting up again fairly soon.
On Tuesday Jeremy Smith contacted me to ask me to help out with a very last minute proposal that he is putting together. I can’t say much about the proposal, but it had a very tight deadline and required rather a lot of my time from the middle of the week onwards (and even into the weekend). This involved lots of email exchanges, time spent reading documentation, meeting with Luca, who might be doing the technical work for the project if it gets funded, and writing a Data Management Plan for the project. This all meant that I was unable to spend time working on other projects I’d hoped to work on this week, such as the Bilingual Thesaurus. Hopefully I’ll have time to get back into this next week, once the workshops are out of the way.