I continued to work on the HT / OED data alignment for a lot of this week. I updated the matching scripts I had previously created so that all matches based on last lexeme were removed and instead replaced by a ‘6 matches or more and 80% of words in total match’ check. This was a lot more effective that purely comparing the last word in each category and helped match up a lot more categories. I also created a QA script to check the manual matches that were made during our first phase of matching. There are 1407 manual matches in the system. The script also listed all the words in each potential matched category to make it easier to tell where any potential difficulties were. I also updated the ‘pattern matching’ script I’d created last week to list all words and include the ‘6 matches and 80%’ check and changed the layout so that separate groupings now appear in different tables rather than being all mixed up in one table. It took quite a long time to sort this out, but it’s going to be much more useful for manual checking.
I then moved on to writing a new ‘sibling matching’ script. This script goes through all unmatched OED categories (this includes all that appear in other scripts such as the pattern matching one) and retrieves all sibling categories of the same POS. E.g. if the category is ‘01.01.01|03 (n)’ then the script brings back all HT noun subcats of ’01.01.01’ that are ‘level 1’ subcats and compares their headings. It then looks to see if there is a sibling category that has the same heading – i.e. looking for when a category has been renumbered within the same level of the thesaurus. This has uncovered several hundred such potential matches, which will hopefully be very helpful. I also then created a further script that compares non-noun headings to noun headings at the same level, as it looked like a number of times the OED kept the noun heading for other parts of speech while the HT renamed them. This identified a further 65 possible matches, which isn’t too bad.
I met with Marc and Fraser on Wednesday to discuss the recent updates I’d made, after which I managed to tick off 2614 matched categories, taking our total of unmatched OED categories that have a part of speech and are not empty down to 10,854. I then made a start on a new script that looks at pattern matching for category contents (i.e. words), but I didn’t have enough time to make a huge amount of progress with this.
to try and get things working but the callbacks were never being initiated – i.e. data wasn’t getting through to Google. Thankfully Stack Overflow had an answer that worked (After trying several that didn’t):
I’ve updated this so that pageviews rather than events are sent and now everything seems to be working again.
I spent a bit more time this week working on the Bilingual Thesaurus project, focussing on getting the front end for the thesaurus working. I’ve reworked the code for the HT’s browse facility to work with the project’s data. This required quite a lot of work as structurally the datasets are quite different – the HT relies in its ‘tier’ numbers for parent / child / sibling category relationships, and also has different categories for parts of speech and nested subcategories. The BTH data is much simpler (which is great) as it just has parent and child categories, with things like part of speech handled at word level. This meant I had to strip a lot of stuff out of the code and rework things. I’m also taking the opportunity to move to a new interface library (Bootstrap) so had to rework the page layout to take this into consideration too. I managed to get an initial version of the browse facility working now, which works in much the same way as the main HT site: clicking on a heading allows you to view its words and clicking on a ‘plus’ sign allows you to view the child categories. As with the HT you can link directly to a category too. I do still need to work on the formatting of the category contents, though. Currently words are just listed all together, with their type (AN or ME) listed first, then the word, then the POS in brackets, then dates (if available). I haven’t included data about languages of source or citation yet, or URLs. I’m also going to try and get the timeline visualisations working as well. I’ll probably split the AN and ME words into separate tabs, and maybe split the list up by POS too. I’m also wondering whether the full category hierarchy should be represented above the selected category (the right pane), as unlike the HT there’s no category number to show your position in the thesaurus. Also, as a lot of the categories are empty I’m thinking of making the ones with words in them bold in the tree, or even possibly adding a count of words in brackets after the category heading. I’ve also updated the project’s homepage to include the ‘sample category’ feature, allowing you to press the ‘reload’ icon to load a new random category.
On Friday I spent most of the day working on the RNSN project, adding direct links to the ‘nation’ introductions to the main navigation menu and creating new ‘storymap’ stories based on Powerpoint presentations that had been sent to me. This is actually quite a time-consuming process as it involves grabbing images from the PPT, reformatting them, uploading them to WordPress, linking to them from the Storymap pages, creating Zoomified versions of the image or images that will be used as the ‘map’ for the story, extracting audio files from the PPT and uploading them, grabbing all of the text and formatting it for display and other such tasks. However, despite being a long process the end result is definitely worth it as the stroymaps work very nicely. I managed to get two such stories completed today, and now I’ve re-familiarised myself with the process it should be quicker when the next set get sent to me.
I’m going to be on holiday next week so there won’t be another report from me until the week after that.