I was off on Monday this week, and managed to break a rib whilst coming down a water slide over the weekend so it’s been a bit of a painful week. Thankfully it’s not had any impact on the work I’ve managed to do, though. As with previous weeks recently, I spent a lot of the week continuing with the process of matching up the HT and OED datasets. I had another useful meeting with Marc and Fraser on Tuesday and created or executed a number of scripts to bring our total of unmatched categories ever closer to zero. I ran my ‘parent category matches’ script to tick off all the matches where the OED and HT subcats are definite matches (one to one). I also ticked off matches where there are multiple possible HT subcat matches (same stripped text) but one of the HT subcat numbers is exactly the same as the OED subcat number. After running the script the number of unmatched OED categories that have a part of speech went down from 5092 to 3318. However, after running this script also ticked off almost all of the potential matches that had been suggested by another script I’d created, which instead of listing hundreds of matches only then listed 22, of which 15 where definite. I ticked these off too, though, and made some changes to a couple of other scripts that Marc and Fraser wanted to work with, such as removing categories without a part of speech from a script that identifies gaps in the OED catid number sequence.
After this process we are down to 3303 unmatched OED categories that have a part of speech, and of these 426 are main categories, which is something that is confusing Fraser somewhat. To help reduce confusion I updated the ‘parent category matches’ script so the output is now tabular. Where an OED subcat has a parent that matches an HT category that has no unmatched subcats I now check the OED subcats with cids before and after the OED subcat in question. If these are matched then this is noted in green in the final column. I’ve also added a count of these underneath the table in green (there are 133 such categories but note that if the OED subcat is the first or last in the category then the cid before or after will be of a maincat).
I’ve also created a new script that lists the OE only categories that are matched. This lists the words found in the HT and OED categories. There are 344 OE only categories that match an OED category. Of these, only 51 of the matched OED categories contain words.
Later on in the week I returned to the matching issue and played around with some other possible methods of matching up the remaining OED categories. I created a new script that lists the unmatched OED categories that have a POS and looks for unmatched HT categories that have the same stripped heading and POS while ignoring the catnum / subcat. It finds 1613 potential matches, although of these 240 have multiple possible matches (e.g. ‘specific’ has 22 possible matches in the HT). Of the ones that don’t match I think I’ll be able to create some rules to try and find other matches. Of the multiples it might be possible to automatically deduce the closest catnum to suggest a match.
I then tweaked the script to add in a count of words in the OED and HT categories, and last words too. This should make it easier to check whether a potential match is likely. Lots of them are looking encouraging. I’ve also noted that out of the 3303 unmatched OED categories that have a POS, 858 have no words in them so presumably are not so important to match up.
I then updated the script further so that maincats and subcats appear in two separate tables, with maincats listed first. It looks like some of the unmatched maincats are empty categories that have been created for a POS so they sit alongside categories with other POSes. E.g. is you find ‘eleven to ninety-nine’ in the page you’ll see that there are four OED categories for VT, AV, AJ, and N. Of these the HT has ones for AV, AJ and N but not VT. All have no words in them in OED. Note that it looks like the reason these haven’t previously been matched is because of an erroneous HT catnum: 01.07.04.012
Whilst working on this I uncovered a matched category that is incorrect, which is rather worrying. OED category 235880 ‘après-ski’ (03.11.04.13.12.01|17 (n)) is unmatched, but HT category 223643 ‘après-ski’ (oedmaincat 03.11.04.13.12.01|18 (n)) is matched. However, it’s matched to OED category 235870 ‘parts or attachments’ (03.11.04.13.12.01|08.02 (n)). I’m not sure how this has happened, and it made me realise that I needed to create a script that would actually check that the matched data is correct. I decided to write a script that checks the stripped headings of all matched categories and lists those that have a Levenshtein score of more than a certain number, starting with 8. There are 11,666 of these, but the majority are not errors – e.g. ‘Promontory’ and ‘promontory, headland, or cape’. There are some that are definitely errors, though – e.g. ‘seed of’ and ‘turnip plant’.
I then made some further updates to the script, adding in category details and also reverting to an exact match of the stripped heading fields rather than using a Levenshtein test, just to be sure. I also excluded any categories where the catnum, subcat and pos are the same for HT and OED but the heading is different, as it seemed like these were all correct. What this leaves is 1403 possible errors. A lot of these are not errors at all, but are legitimate differences in headings (e.g. ‘Resembling animal/bird sounds’ and ‘sounds like animal or bird sounds’) but I’m afraid a lot of them are genuine errors. A lot of them seem to be where there is no ‘oedmaincat’, but not all of them. I think we’re going to have to get someone to go through the list and figure out which are real errors, which shouldn’t take more than an hour or so. I added in a new ‘checktype’ column to the output so we could see whether the errors appeared in the manual or automatically matched data. Most were through the automatic processes.
Marc was concerned that for the incorrectly matched categories there might be a bunch of incorrectly matched HT categories that my script isn’t picking up – e.g. HT ‘foraging equipment’ is set to match OED ‘casting equipment’, which is wrong. But what is HT ‘casting equipment’ matched to then? However, it would appear that the correct match on the HT side is just sitting there unconnected to any OED category. E.g. ‘casting equipment’ in the HT is not connected to any OED category yet. So once the erroneous matches are ‘dematched’ hopefully most of them can be matched up to the correct (and so far unmatched) category.
Marc also wanted to check whether any duplicate matches exist in the system – where one HT category points to multiple OED categories. A quick query of the database showed that there are a few duplicates in the system. Of the 226133 HT categories that have an OEDcatid, 226025 of them are unique. So there are 108 OED categories that are referenced in multiple HT categories. Thankfully a tiny number, but something that will need fixed. I created a script to list these and we’ll need to discuss this at the next meeting.
Other than working on the HT / OED linking I split my time mostly between two projects: The redesign of the Seeing Speech / Dynamic Dialects websites and the development of the Bilingual Thesaurus. For the former I added in content for all of the remaining ancillary pages. This took a fair amount of time to do as there was lots of working with raw HTML, adding in links, checking them, creating new images and such things. It’s pretty tedious stuff but it’s really worth doing as the new website works so much better than the old one. I also split the homepage up into shorted chunks, with lots of the text getting moved a new ‘about the project’ pages, and shorted then excessively long citation on the Dynamic Dialects site. I also added in a nice ‘top’ button that appears when you scroll down the page, and added in a ‘cite’ option to individual video overlays. I think we’re just about there now. Just the carousel images to update and a questionnaire about the new site to design and implement.
For the Bilingual Thesaurus I began working with the data I’d previously been sent, in JSON format. The file was pretty well structured, although I did have some questions relating to dates and languages. My initial task was to create a single MySQL table into which I would import the JSON data, and a simple PHP script that would go through each object in the JSON data, extract the individual variables and insert these into the table. After a bit of experimentation, I managed to get the data uploaded, resulting in 4779 rows. My next task was to rationalise the data into a relational database structure. For example, the original data had two language types (language of origin and language of citation), which were stored in an array in the JSON file. Each time a language appears its full text is listed, and sometimes the text has an initial question mark to denote uncertainty. Instead of this I created a ‘language’ table where each language (ignoring question marks) is listed once and is given a unique ID. There are 39 different languages in the data. Then I created a joining table that joins a headword entry with however many languages is needed. This table includes a field for the type of join (i.e. whether the language is ‘origin’ or ‘citation’) and a further field noting whether the join is uncertain (for those question marks). It’s a system that will allow much more flexible queries to be performed. I took a similar approach for dates and dictionary links too.
I then set about splitting up the ‘path’ field, which was similarly stored as an array in the JSON file, with each part of the hierarchical path appearing every time it was required for a headword with no unique ID or any other information. This is a lot of duplication of data, and it also means it’s impossible to search for a particular part of the hierarchy, as the same names are used multiple times to represent very different parts of the hierarchy.
I wrote a nice little script that I’m rather pleased with that went through the paths of each headword, extracted each part of the path, checked whether it already existed in my new ‘category’ database, associated the existing entry if there was one and created a new entry and associated that if there wasn’t one. Each part of the path is now listed just once with its own unique identifier and the ID of its parent category. Using this it will then be possible to generate a tree interface to the data.
I wrote a little test script that displays each headword, its original ‘path’ and then the ID and name of each hierarchical level (from bottom to top) in my new database, with the full hierarchy then listed underneath to check the original and generated forms match (which thankfully they all do). For example, that ‘Farming’ has been given the ID 431, and each time it appears it’s this unique ‘farming’ category that is displayed. I’m very pleased with how this is all working out so far.
Other than these tasks I responded to some queries from other members of staff, for example Simon Taylor who wanted advice on a proposal he’s writing, Ronnie Young who wanted me to update some content on his Burns Paper Database website, Brianna Robertson-Kirkland who wanted to know the copyright implications of embedding YouTube videos, and Valentina Busin, for whom I created an Google Play store listing, basic app details and user accounts for a new app. I also began updating the interface to Rob Maslen’s Fantasy blog on Friday afternoon, but the server went weird and blocked me as I was halfway through working on things. I’ll have to sort this out first thing on Monday.