As with the past few weeks, I spent a fair amount of time this week on the HT / OED data linking issue. I updated the ‘duplicate lexemes’ tables to add in some additional information. For HT categories the catid now links through to the category in the HT website and each listed word has an [OED] link after it that performs a search for the word on the OED website, as currently happens with words on the HT website. For OED categories the [OED] link leads directly to the sense on the OED website, using a combination of ‘refentry’ and ‘refid’.
I then created a new script that lists HT / OED categories where all the words match (HT and OED stripped forms are the same and HT startdate matches OED GHT1 date) or where all HT words match and there are additional OED forms (hopefully ‘new’ words), with the latter appearing in red after the matched words. Quite a large percentage of categories either have all their words matching or have everything matching except a few additional OED words (note that ‘OE’ words are not included in the HT figures):
For 01: 82300 out of 114872 categories (72%) are ‘full’ matches. 335195 out of 388189 HT words match (86%). 335196 out of 375787 OED words match (89%). For 02: 20295 out of 29062 categories (70%) are ‘full’ matches. 106845 out of 123694 HT words match (86%). 106842 out of 119877 OED words match (89%). For 03: 57620 out of 79248 categories (73%) are ‘full’ matches. 193817 out of 223972 HT words match (87%). 193186 out of 217771 OED words match (89%). It’s interesting how consistent the level of matching is across all three branches of the thesaurus.
I also received a new batch of XML data from the OED, which will need to replace the existing OED data that we’re working with. Thankfully I have set things up so that the linking of OED and HT data takes place in the HT tables, for example the link between an HT and OED category is established by storing the primary key of the OED category as a foreign key in the corresponding row of HT category table. This means that swapping out the OED data should (or at least I thought it should) be pretty straightforward.
I ran the new dataset through the script I’d previously created that goes through all of the OED XML, extracts category and lexeme data and inserts it into SQL tables. As was expected, the new data contains more categories than the old data. There are 238697 categories in the new data and 237734 categories in the old data, so it looks like 963 new categories. However, I think it’s likely to be more complicated than that. Thankfully the OED categories have a unique ID (called ‘CID’ in our database). In the old data this increments from 1 to 237734 with no gaps. In the new data there are lots of new categories that start with an ID greater than 900000. In fact, there are 1219 categories with such IDs. These are presumably new categories, but note that there are more categories with these new IDs than there are ‘new’ categories in the new data, meaning some existing categories must have been deleted. There are 237478 categories with an ID less than 900000, meaning 256 categories have been deleted. We’re going to have to work out what to do with these deleted categories and any lexemes contained within them (which presumably might have been moved to other categories).
Another complication is that the ‘Path’ field in the new OED data has been reordered to make way for changes to categories. For example, the OED category with the path ’02.03.02’ and POS ‘n’ in the old data is 139993 ‘Ancient Greek philosophy’. In the new OED data the category with the path ’02.03.02’ and POS ‘n’ is 911699 ‘badness or evil’, while ‘Ancient Greek philosophy’ now appears as ’02.01.15.02’. Thankfully the CID field does not appear to have been changed, for example, CID 139993 in the new data is still ‘Ancient Greek philosophy’ and still therefore links to the HT catid 231136 ‘Ancient Greek philosophy’, which has the ‘oedmainat’ of 02.03.02. I note that our current ‘t’ number for this category is actually ‘02.01.15.02’, so perhaps the updates to the OED’s ‘path’ field bring it into line with the HT’s current numbering. I’m guessing that the situation won’t be quite as simple as that in all cases, though.
Moving on to lexemes, there are 751156 lexemes in the new OED data and 715546 in the old OED data, meaning there are some 35,610 ‘new’ lexemes. As with categories I’m guessing it’s not quite as simple as that as some old lexemes may have been deleted too. Unfortunately, the OED does not have a unique identifier for lexemes in its data. I generate an auto-incrementing ID when I import the data, but as the order of the lexemes has changed between data the ID for the ‘old’ set does not correspond to the ID in the ‘new’ set. For example, the last lexeme in the ‘old’ set has an ID of 715546 and is ‘line’ in the category 237601. In the new set the lexeme with the ID 715546 is ‘melodica’ in the category 226870.
The OED lexeme data has two fields which sort of look like unique identifiers: ‘refentry’ and ‘refid’. The former is the ID for a dictionary entry while the latter is the ID for the sense. So for example refentry 85205 is the dictionary entry for ‘Heaven’ and refid 1922174 is the second sense, allowing links to individual senses, as follows: http://www.oed.com/view/Entry/85205#eid1922174. Unfortunately in the OED lexeme table neither of these IDs is unique, either on its own or in combination. For example, the lexeme ‘abaca’ has a refentry of 37 and a refid of 8725393, but there are three lexemes with these IDs in the data, associated with categories 22927, 24826 and 215239.
I was hoping that the combination of refentry, refid and category ID would be unique and and serve as a primary key, and I therefore wrote a script to check for this. Unfortunately this script demonstrated that these three fields are not sufficient to uniquely identify a lexeme in the OED data. There are 5586 times that refentry and refid appear more than once in a category. Even more strangely, these occurrences frequently have different lexemes and different dates associated with them. For example: ‘Ecliptic circle’ (1678-1712) and ‘ecliptic way’ (1712-1712) both have 59369 as refentry and 5963672 as refid.
While there are some other entries that are clearly erroneous duplicates (e.g. half-world (1615-2013) and 3472: half-world (1615-2013) have the same refentry (83400, 83400) and refid (1221624180, 1221624180)), the above example and others are (I guess) legitimate and would not be fixed by removing duplicates, so we can’t rely on a combination of cid, refentry and refid to uniquely identify a lexeme.
Based on the data we’d been given from the OED, in order to uniquely identify an OED lexeme we would need to include the actual ‘lemma’ field and/or date fields. We can’t introduce our own unique identifier as it will be redefined every time new OED data is inputted, so we will have to rely on a combination of OED fields to uniquely identify a row, in order to link up one OED lexeme and one HT lexeme. But if we rely on the ‘lemma’ or date fields the risk is these might change between OED versions, so the link would break.
To try and find a resolution to this issue I contacted James McCracken, who is the technical guy at the OED. I asked him whether there is some other field that the OED uses to uniquely identify a lexeme that was perhaps not represented in the dataset we had been given. James was extremely helpful and got back to me very quickly, stating that the combination of ‘refentry’ and ‘refid’ uniquely identifies the dictionary sense, but that a sense can contain several different lemmas, each of which may generate a distinct item in the thesaurus, and these distinct items may co-occur in the same thesaurus category. He did, however, note that in the source data, there’s also a pointer to the lemma (‘lemmaid’), which wasn’t included in the data we had been given. James pointed out that this field is only included when a lemma appears more than once in a category, but that we should therefore be able to use CID, refenty, refid and (where present) lemmaid to uniquely identify a lexeme. James very helpfully regenerated the data so that it included this field.
Once I received the updated data I updated my database structure to add in a new ‘lemmaid’ field and ran the new data through a slightly updated version of my migration script. The new data contains the same number of categories and lexemes as the dataset I’d been sent earlier in the week, so that all looks good. Of the lexemes there are 33283 that now have a lemmaid, and I also updated my script that looks for duplicate words in categories to check the combination of refentry, refid and lemmaid.
After adding in the new lemmaid field, the number of listed duplicates has decreased from 5586 to 1154. Rows such as ‘Ecliptic way’ and ‘Ecliptic circle’ have now been removed, which is great. There are still a number of duplicates listed that are presumably erroneous, for example ‘cock and hen (1785-2006)’ appears twice in CID 9178 and neither form has a lemmaid. Interestingly, the ‘half-world’ erroneous(?) duplicate example I gave previously has been removed as one of these has a ‘lemmaid’.
Unfortunately there are still rather a lot of what look like legitimate lemmas that have the same refentry and refid but no lemmaid. Although these point to the same dictionary sense they generally have different word forms and in many cases different dates. E.g. in CID 24296: poor man’s treacle (1611-1866) [Lemmaid 0] and countryman’s treacle (1745-1866) [Lemmaid 0] have the same refentry (205337, 205337) and refid (17724000, 17724000). We will need to continue to think about what to do with these next week as we really need to be able to identify individual lexemes in order to match things up properly with the HT lexemes. So this is a ‘to be continued’.
Also this week I spent some time in communication with the DSL people about issues relating to extracting their work in progress dictionary data and updating the ‘live’ DSL data. I can’t really go into detail about this yet, but I’ve arranged to visit the DSL offices next week to explore this further. I also made some tweaks to the DSL website (including creating a new version of the homepage) and spoke to Ann about the still in development WordPress version of the website and a log list of changes that she had sent me to implement.
I also tracked down a bug in the REELS system that was resulting in place-name element descriptions being overwritten with blanks in some situations. It would appear to only occur when associating place-name elements with a place when the ‘description’ field had carriage returns in it. When you select an element by typing characters into ‘element’ box to bring up a list of matching elements and then select an element from the list, a request is sent to the server to bring back all the information about the element in order to populate the various boxes in the form relating to the element. However, special characters used to represent carriage returns (\n and \r) are not valid in the JSON format. When an element description contained such characters, the returned file couldn’t be read properly by the script. Form elements up to the description field were getting automatically filled in, but then the description field was being left blank. Then when the user pressed the ‘update’ button the script assumed the description field had been updated (to clear the contents) and deleted the text in the database. Once I identified this issue I updated the script that grabs the information about an element so that special characters that break JSON files are removed, so hopefully this will not happen again.
Also this week I updated the transcription case study on the Decadence and Translation website to tweak a couple of things that were raised during a demonstration of the system and I created a further timeline for the RNSN project, which took most of Friday afternoon.