I spent a fair amount of time this week working on Historical Thesaurus duties following our team meeting on Friday last week. At the meeting Marc had mentioned another thesaurus that we are potentially going to host at Glasgow, the Bilingual Thesaurus of Medieval England. Marc gave me access to the project’s data and I spent some time looking through it and figuring out how we might be able to develop an online resource for it that would be comparable to the thesauri we currently host. I met with Marc and Fraser again on Tuesday to discuss the ongoing issue of matching up the HT and OED datasets, and prior to the meeting I spent some time getting back up to speed with the issues involved and figuring out where we’d left off. I also created some CSV files containing the unmatched data for us to use.
The meeting itself was pretty useful and I came out of it with a list of several new things to do, which I focussed on for much of the remainder of the week. This included writing a script that goes through each unmatched HT category, brings back the non-OE words and compares these with the ght_lemma field of all the words in unmatched OE categories. The script outputs a table featuring information about the categories as well as the words, and I think the output will be useful for identifying unmatched categories as well as words contained therein. Also at the meeting we’d noticed that if you perform a search on the front-end that contains an apostrophe the search itself works, but following a link in the search results to a word that also contains an apostrophe wasn’t working. I added in a bit of urlencoding magic and that sorted the issue.
I also created a few more scripts aimed at identifying categories and words to match (or to identify things that would have no matches). This included a script to display unmatched HT and OED categories that have non-alphanumeric characters in them, creating a CSV output of HT words that doesn’t feature OE words (as the OED doe not include OE words) and creating another script that identifies categories that have ‘pertaining to’ in their headings.
I also created a script that generated the full hierarchical pathway for each unmatched HT and OED category and then ran a Levenshtein test to figure out which OED path was the closest to which HT path (in the same part of speech). It took the best part of a morning to write the script, and the script itself took about 30 minutes to run, but unfortunately the output is not going to be much use in identifying potential matches.
For every unmatched HT category the script currently displays the OED category with the lowest Levenshtein score when comparing the full hierarchy of each. There’s very little in the way of matches that are of any value, but things might improve with some tweaking. As it stands the script generates the full HT hierarchy within the chosen POS, meaning for non-nouns the hierarchy generally doesn’t go all the way to the top. I could potentially use the noun hierarchy instead. Similarly, for the OED data I’ve kept within the POS, which means it hasn’t taken into consideration the top level OED categories that have no POS. Also, rather than generating the full hierarchy we might have more luck if we just looked at a smaller slice, for example two levels up from the current main cat, plus full subcat hierarchy. But even this might result in some useless results – e.g. the HT adverb ‘>South>most’ currently has as its closest match the OED adverb ‘>four>>four’ with a Levenshtein score of 6. But clearly it’s not a valid match.
My final script was one that identifies empty HT categories (or those that only include OE words). I figured that a lot of these probably don’t need to match up to an OED category. I also included any empty OED categories (not including the ‘top level’ OED categories that have no part of speech and are empty). Out of the 12034 unmatched HT cats 4977 are empty or only contain OE words. Out of the 6648 unmatched OED categories that have a POS there are 1918 that are empty. Hopefully we can do something about ticking these off as checked at some point.
While going through this data I made a slightly worrying discovery: At the meeting we’d found an OED word that referenced an OED category ID that didn’t exist in our database. This seemed rather odd. The next day I discovered another, and I figured out what was going on. It would appear that when uploading the OED data from their XML files to our database any OED category or word that included an apostrophe silently failed to upload. This unfortunately is not good news as it means many potential matches that should have been spotted by the countless sweeps through the data that we’ve already done have been missed due to the corresponding OED data simply not being there. I ran the XML through another script to count the OED categories and words that include apostrophes and there are 1843 categories and 26,729 words (the latter due to apostrophes in word definitions also causing words to fail to upload). This is not good news and it’s something we’re going to have to investigate next week. However, it does mean we should be able to match up more HT categories and words than we had previously matched, which is at least a sort of a silver lining.
Other than HT duties I did small bits of work for a number of different projects. I generated some data for Carole for the REELS project from the underlying database, and investigated a possible issue with the certainty levels for place-names (which thankfully turn out to not be an issue at all). I also responded to a couple of queries from Thomas Widmann of SLD, started to think about the new Galloway Glens place-name project and updated the images and image credits that appear on Matthew Creasy’s Decadence and Translation website.
I also spend the best part of a day preparing for this year’s P&DR process, ahead of my meeting next week.