
Month: November 2018
Week Beginning 19th November 2018
This week I mainly working on three projects: The Historical Thesaurus, the Bilingual Thesaurus and the Romantic National Song Network. For the HT I continued with the ongoing and seemingly never-ending task of joining up the HT and OED datasets. Marc, Fraser and I had a meeting last Friday and I began to work through the action points from this meeting on Monday. By Wednesday I had ticked off most of the items, which I’ll summarise here.
Whilst developing the Bilingual Thesaurus I’d noticed that search term highlighting on the HT site wasn’t working for quick searches, only advanced searches for words, so I investigated and fixed this. I then updated the lexeme pattern matching / date matching script to incorporate the stoplist we’d created during last week’s meeting (words or characters that should be removed when comparing lexemes, such as ‘to ‘ and ‘the’). This worked well and has bumped matches up to better colour levels, but has resulted in some words getting matched multiple times. E.g. when removing ‘to’, ‘of’ etc this results in a form that then appears multiple times. For example, in one category the OED has ‘bless’ twice (presumably an erro?) and HT has ‘bless’ and ‘bless to’. With ‘to’ removed there then appear to be more matches that there should be. However, this is not an issue when dates are also taken into consideration. I also updated the script so that categories where there are 3 matches and at least 66% of words match have been promoted from orange to yellow.
When looking at the outputs at the meeting Marc wondered why certain matches (e.g. 120202 ‘relating to doctrine or study’ / ‘pertaining to doctrine/study’ and 88114 ‘other spec.’ / ‘other specific’) hadn’t been ticked off and wondered whether category heading pattern matching had worked properly. After some investigation I’d say it has worked properly – the reason these haven’t been ticked off is they contain too few words to have reached the criteria for ticking off.
Another script we looked at during our meeting was the sibling matching script, which looks for matches at the same hierarchical level and part of speech, but different numbers. I completely overhauled the script to bring it into line with the other scripts (including recent updates such as the stoplist for lexeme matching and the new yellow criteria). There are currently 19, 17 and 25 green, lime green and yellow matches that could be ticked off. I also ticked off the empty category matches listed on the ‘thing heard’ script (so long as they have a match) and for the ‘Noun Matching’ I ticked off the few matches that there were. Most were empty categories and there were less than 15 in total.
Another script I worked on was the ‘monosemous’ script, which looks for monosemous forms in unmatched categories and tries to identify HT categories that also contain these forms. We weren’t sure at the meeting whether this script identified words that were fully monosemous in the entire dataset, or those that were monosemous in the unmatched categories. It turned out it was the former, so I updated the script to only look through the unchecked data, which has identified further monosemous forms. This has helped to more accurately identify matched categories. I also created a QA script that checks the full categories that have potentially been matched by the monosemous script.
I also worked on the date fingerprinting script. This gets all of the start dates associated with lexemes in a category, plus a count of the number of times each date appears, and uses these to try and find matches in the HT data. I updated this script to incorporate the stoplist and the ‘3 matches and 66% match’ yellow rule, and ticked off lots of matches that this script identified. I ticked off all green (1556), lime green (22) and yellow (123) matches.
Out of curiosity, I wrote a script that looked at our previous attempt at matching the categories, which Fraser and I worked on last year and earlier this year. The script looks at categories that were matched during this ‘v1’ process that had yet to be matched during our current ‘v2’ process. For each of these the script performs the usual checks based on content: comparing words and first dates and colour coding based on number of matches (this includes the stoplist and new yellow criteria mentioned earlier). There are 7148 OED categories that are currently unmatched but were matched in V1. Almost 4000 of these are empty categories. There are 1283 ‘purple’ matches, which means (generally) something is wrong with the match. But there are 421 in the green, lime green and yellow sections, which is about 12% of the remaining unmatched OED categories that have words. It might also be possible to spot some patterns to explain why they were matched during v1 but have yet to be matched in v2. For example, 2711 ‘moving water’ has 01.02.06.01.02 and its HT counterpart has 01.02.06.01.01.02. There are possibly patterns in the 1504 orange matches that could be exploited too.
Finally, I updated the stats page to include information about main and subcats. Here are the current unmatched figures:
Unmatched (with POS): 8629
Unmatched (with POS and not empty): 3414
Unmatched Main Categories (with POS): 5036
Unmatched Main Categories (with POS and not empty): 1661
Unmatched Subcategories (with POS): 3573
Unmatched Subcategories (with POS and not empty): 1753
So we are getting there!
For the Bilingual Thesaurus I completed an initial version of the website this week. I have replaced the original colour scheme with a ‘red, white and blue’ colour scheme as suggested by Louise. This might be changed again, but for now here is an example of how the resource looks:
The ‘quick’ and ‘advanced’ searches are also now complete, using the ‘search words’ mentioned in a previous post, and ignoring accents on characters. As with the HT, by default the quick search matches category headings and headwords exactly, so ‘ale’ will return results as there is a category ‘ale’ and also a word ‘ale’ but ‘bread’ won’t match anything because there are no words or categories with this exact text. You need to use an asterisk wildcard to find text within word or category text: ‘bread*’ would find all items starting with ‘bread’, ‘*bread’ would find all items ending in ‘bread’ and ‘*bread*’ would find all items with ‘bread’ occurring anywhere.
The ‘advanced search’ lets you search for any combination of headword, category, part of speech, section, dates and languages or origin and citation. Note that if you specify a range of years in the date search it brings back any word that was ‘active’ in your chosen period. E.g. a search for ‘1330-1360’ will bring back ‘Edifier’ with a date of 1100-1350 because it was still in use in this period.
As with the HT, different search boxes are joined with ‘AND’ – e.g. if you tick ‘verb’ and select ‘Anglo Norman’ as the section then only words that are verbs AND Anglo Norman will be returned. Where search types allow multiple options to be selected (i.e. part of speech and languages of origin and citation) if multiple options in each list are selected these are joined by ‘OR’. E.g. if you select ‘noun’ and ‘verb’ and select ‘Dutch’, ‘Flemish’ and ‘Italian’ as languages or origin this will find all words that are either nouns OR verbs AND have a language of origin of Dutch OR Flemish OR Italian.
Search results display the full hierarchy leading to the category, the category name, plus the headword, section, POS and dates (if the result is a ‘word’ result rather than a ‘category’ result). Clicking through to the category highlights the word. I also added in a ‘cite’ option to the category page and updated the ‘About’ page to add a sentence about the current website. The footer still needs some work (e.g. maybe including logos for the University of Westminster and Leverhulme) and there’s a ‘terms of use’ page linked to from the homepage that currently doesn’t have any content, but other than that I think most of my work here is done.
For the Romantic National Song Network I continued to create timelines and ‘storymaps’ based on powerpoint presentations that had been sent to me. This is proving to be a very time-intensive process, as it involves extracting images, audio files and text from the presentations, formatting the text as HTML, reworking the images (resizing, sometimes joining multiple images together to form one image, changing colour levels, saving the images, uploading them to the WordPress site), uploading the audio files, adding in the HTML5 audio tags to get the audio files to play, creating the individual pages for each timeline entry / storymap entry. It took the best part of an afternoon to create one timeline for the project, which involved over 30 images, about 10 audio files and more than 20 Powerpoint slides. Still, the end result works really well, so I think it’s worth putting the effort in.
In addition to these projects I met with a PhD student, Ewa Wanat, who wanted help in creating an app. I spent about a day attempting to make a proof of concept for the app, but unfortunately the tools I work with are just not very well suited to the app she wants to create. The app would be interactive and highly dependent on logging user interactions as accurately as possible. I created looked into using the d3.js library to create the sort of interface she wanted (a circle that rotates with smaller circles attached to it, that the user should tap on when a certain point in the rotation is reached), but although this worked, the ‘tap’ detection was not accurate enough. In fact on touchscreens more often than not a ‘tap’ wasn’t even being registered. D3.js just isn’t made to deal with time-sensitive user interaction on animated elements and I have no experience with any libraries that are made in this way, so unfortunately it looks like I won’t be able to help out with this project. Also, Ewa wanted the app to be launched in January and I’m just far too busy with other projects to be able to do the required work in this sort of timescale.
Also this week I helped extract some data about the Seeing Speech and Dynamic Dialects videos for Eleanor Lawson, I responded to queries from Meg MacDonald and Jennifer Nimmo about technical work on proposals they are involved with, I responded to a request for advice from David Wilson about online surveys, and another request from Rachel Macdonald about the use of Docker on the SPADE server. I think that’s just about everything to report.
Week Beginning 12th November 2018
I spent most of my time this week split between three projects: The HT / OED category linking, the REELS project and the Bilingual Thesaurus. For the HT I continued to work on scripts to try and match up the HT and OED categories. This week I updated all the currently in use scripts so that date checks now extract the first four numeric characters (OE is converted to 1000 before this happens) from the ‘GHT_date1’ field in the OED data and the ‘fulldate’ field in the HT data. Doing this has significantly improved the matching on the first date lexeme matching script. Greens have gone from 415 to 1527, lime greens from 2424 to 2253, yellows from 988 to 622 and oranges from 2363 to 1788. I also updated the word lists to make them alphabetical, so it’s easier to compare the two lists and included two new columns. The first is for matched dates (ignoring lexeme matching), which is a count of the number of dates in the HT and OED categories that match while the second is this figure as a percentage of the total number of OED lexemes.
However, taking dates in isolation currently isn’t working very well, as if a date appears multiple times it generates multiple matches. So, for example, the first listed match for OED CID 94551 has 63 OED words, and all 63 match for both lexeme and date. But lots of these have the same dates, meaning a total count of matched dates is 99, or 152% of the number of OED words. Instead I think we need to do something more complicated with dates, making a note of each one AND the number of times each one appears in a category as its ‘date fingerprint’.
I created a new script to look at ‘date fingerprints’. The script generates arrays of categories for HT and OED unmatched categories. The dates of each word (or each word with a GHT date in the case of OED) in every category is extracted and a count of these is created (e.g. if the OED category 5678 has 3 words with 1000 as a date and 1 word with 1234 as a date then its ‘fingerprint’ is 5678[1000=>3,1234=>1]. I ran this against the HT database to see what matches.
The script takes about half an hour to process. It grabs each unmatched OED category that contains words, picks out those that have GHT dates, gets the first four numerical figures of each and counts how many times this appears in the category. It does the same for all unmatched HT categories and their ‘fulldate’ column too. The script then goes through each OED category and for each goes through every HT category to find any that have not just the same dates, but the same number of times each date appears too. If everything matches the information about the matched categories is displayed.
The output has the same layout as the other scripts but where a ‘fingerprint’ is not unique a category (OED or HT) may appear multiple times, linked to different categories. This is especially common for categories that only have one or two words, as the combination of dates is less likely to be unique. For an example of this search for our old favourite ‘extra-terrestrial’ and you’ll see that as this is the only word in its category, any HT categories that also have one word and the same start date (1963) are brought back as potential matches. Nothing other than the dates are used for matching purposes – so a category might have a different POS, or be in a vastly different part of the hierarchy. But I think this script is going to be very useful.
I also created a script that ignores POS when looking for monosemous forms, but this hasn’t really been a success. It finds 4421 matches as opposed to 4455, I guess because some matches that were 1:1 are being complicated by polysemous HT forms in different parts of speech.
With these updates in place, Marc and Fraser gave the go-ahead for connections to be ticked off. Greens, lime greens and yellows from ‘lexeme first date matching’ script have now been ticked off. There were 1527, 2253 and 622 in these respective sections, so a total of 4402 ticked off. That takes us down to 6192 unmatched OED categories that have a POS and are not empty, or 11380 unmatched that have a POS if you include empty ones. I then ‘unticked’ the 350 purple rows from the script I’d created to QA the ‘erroneous zero’ rows that had been accidentally ticked off last week. This means we now have 6450 unmatched OED categories with words, or 11730 including those without words. I then ticked off all of the ‘thing heard’ matches other than some rows tht Marc had spotted as being wrong. 1342 have been ticked off, bringing our unchecked but not empty total down to 5108 and our unchecked including empty total down to 10388. On Friday, Marc, Fraser and I had a further meeting to discuss our next steps, which I’ll continue with next week.
For the REELS project I continued going through my list of things to do before the project launch. This included reworking the Advanced Search layout, adding in tooltip text, updating the start date browse, which was including ‘inactive’ data in it’s count, created some further icons for combinations of classification codes, added in Creative Commons logos and information, added an ‘add special character’ box to the search page, added a ‘show more detail’ option to the record page that displays the full information about place-name elements, added an option to the API and Advanced Search that allows you to specify if your element search looks at current forms, historical forms or both, added in Google Analytics, updated the site text and page structure to make the place-name search and browse facilities publicly available, created a bunch of screenshots for the launch, set up the server on my laptop for the launch and made everything live. You can now access the place-names here: https://berwickshire-placenames.glasgow.ac.uk/ (e.g. by doing a quick search or choosing to browse place-names)
I also investigated a strange situation Carole had encountered with the Advanced Search, whereby a search for ‘pn’ and ‘<1500’ brings back ‘Hassington West Mains’, even though it only has a ‘pn’ associated with a Historical form from 1797. The search is really ‘give me all the place-names that have an associated ‘pn’ element and also have an earliest historical form before 1500’. The usage of elements in particular historical forms and their associated dates is not taken into consideration – we’re only looking at the earliest recorded date for each place-name. Any search involving historical form data is treated in the same way – e.g. if you search for ‘<1500’ and ‘Roy’ as a source you also get Hassington West Mains as a result, because its earliest recorded historical form is before 1500 and it includes a historical form that has ‘Roy’ as a source. Similarly if you search for ‘<1500’ and ‘N. mains’ as a historical form you’ll also get Hassignton West Mains, even though the only historical form before 1500 is ‘(lands of) Westmaynis’. This is because again the search is ‘get me all of the place-names with a historical form before 1500 that have any historical form including the text ‘N. mains’. We might need to make it clearer that ‘Earliest start date’ refers to the earliest historical form for a place-name record as a whole, not the earliest historical form in combination with ‘historical form’, ‘source’, ‘element language’ or ‘element’.
On Saturday I attended the ‘Hence the Name’ conference run by the Scottish Place-name Society and the Scottish Records Association, where we launched the website. Thankfully everything went well and we didn’t need to use the screenshots or the local version of the site on my laptop, and the feedback we received about the resource was hugely positive.
For the Bilingual Thesaurus I continued to implement the search facilities for the resource. This involved stripping out a lot of code from the HT’s search scripts that would not be applicable to the BTH’s data, and getting the ‘quick search’ feature to work. After getting this search to actually bring back data I then had to format the results page to incorporate the fields that were appropriate for the project’s data, such as the full hierarchy, whether the word results are Anglo Norman or Middle English, dates, parts of speech and such things. I also had to update the category browse page to get search result highlighting to work and to get the links back to search results working. I then made a start on the advanced search form.
Other than these projects I also spoke to fellow developer David Wilson to give him some advice on Data Management Plans, I emailed Gillian Shaw with some feedback on the University’s Technician Commitment, I helped out Jane with some issues relating to web stats, I gave some advice to Rachel Macdonald on server specifications for the SPADE project, I replied to two PhD students who had asked me for advice on some technical matters, and I gave some feedback to Joanna Kopaczyk about hardware specifications for a project she’s putting together.
Week Beginning 5th November 2018
After a rather hectic couple of weeks this was a return to a more regular sort of week, which was a relief. I still had more work to do than there was time to complete, but it feels like the backlog is getting smaller at least. As with previous weeks, I continued with the HT / OED linking of categories processes this week, following on from the meeting Marc, Fraser and I had the Friday before. For the lexeme / data matching script I separated out categories with zero matches that have words from the orange list into a new list with a purple background. So orange now only contains categories where at least one word and its start date match. The ones now listed in purple are almost certainly incorrect matches. I also changed the ordering of results so that categories are listed by the largest number of matches, to make it easier to spot matches that are likely ok.
I also updated the ‘monosemous’ script, so that the output only contains OED categories that feature a monosemous word and is split into three tables (with links to each at the top of the page). The first table features 4455 OED categories that include a monosemous word that has a comparable form in the HT data. Where there are multiple monosemous forms they each correspond to the same category in the HT data. The second table features 158 OED categories where the linked HT forms appear in more than one category. This might either be because the word is not monosemous in the HT data and appears in two different categories (these are marked with the text ‘red|’ they can be search for in page. An OED category can also appear in this table even if there are no red forms if (for example) one of the matched HT words is in a different category to all of the others (see OED catid 45524) where the word ‘Puncican’ is found in a different HT category to the other words). The final table contains those OED categories that feature monosemous words that have no match in the HT data. There are 1232 of these. I also created a QA script for the 4455 matched monosemous categories, which applies the same colour coding and lexeme matching as other QA scripts I’ve created. On Friday we had another meeting to discuss the findings and plan our next steps, which I will continue with next week.
Also this week I wrote an initial version of a Data Management Plan for Thomas Clancy’s Iona project, and commented on the DMP assessment guidelines that someone from the University’s Data Management people had put together. I can’t really say much more about these activities, but it took at least a day to get all of this done. I also did some app management duties, setting up an account for a new developer, and made the new Seeing Speech and Dynamic Dialects websites live. These can now be viewed here: https://www.seeingspeech.ac.uk/ and here: https://www.dynamicdialects.ac.uk/. I also had an email conversation with Rhona Alcorn about Google Analytics for the DSL site.
With the REELS project’s official launch approaching, I spent a bit of time this week going through the 23 point ‘to do’ list I’d created last week. In fact, I added another three items to it. I’m going to tackle the majority of the outstanding issues next week, but this week I investigated and fixed an issue with the ‘export’ script in the Content Management System. The script is very memory intensive and it was exceeding the server’s memory limits, so asking Chris to increase this limit sorted the issue. I also updated the ‘browse place-names’ feature of the CMS, adding a new column and ordering facility to make it clearer which place-names actually appear on the website. I also updated the front-end so that it ‘remembers’ whether you prefer the map or the text view of the data using HTML5 local storage and added in information about the Creative Commons license to the site and the API. I investigated the issue of parish boundary labels appearing on top of icons, but as of yet I’ve not found a way to address this. I might return to it before the launch if there’s time, but it’s not a massive issue. I moved all of the place-name information on the record page above the map, other than purely map-based data such as grid reference. I also removed the option to search the ‘analysis’ field from the advanced search and updated the element ‘auto-complete’ feature so that it only now matches the starting letters of an element rather than any letters. I also noticed that the combination of ‘relief’ and ‘water’ classifications didn’t have an icon on the map, so I created one for it.
I also continued to work on the Bilingual Thesaurus website this week. I updated the way in which source links work. Links to dictionary sources now appear as buttons in the page, rather in a separate pop-up. They feature the abbreviation (AND / MED / OED) and the magnifying glass icon and if you hover over a button the non-abbreviated form appears. For OED links I’ve also added the text ‘subscription required’ to the hover-over text. I also updated the word record so that where language of origin is ‘unknown’ the language of origin no longer gets displayed, and I made the headword text a bit bigger so it stands out more. I also added the full hierarchy above the category heading in the category section of the browse page, to make it easier to see exactly where you are. This will be especially useful for people using the site on narrow screens as the tree appears beneath the category section so is not immediately visible. You can click on any of the parts of the hierarchy here to jump to that point.
I then began to work on the search facility, and realised I needed to implement a ‘search words’ list that features variants. I did this for the Historical Thesaurus and it’s really useful. What I’ve done so far is generate alternatives for words that have brackets and dashes. For example, the headword ‘Bond(e)-man’ has the following search terms: Bond(e)-man, Bond-man, Bonde-man, Bond(e) man, Bond man, Bonde man, Bond(e)man, Bondman, Bondeman. None of these varieties will ever appear on the website, but instead will be used to find the word when people search. I’ll need some feedback as to whether these options will suffice, but for now I’ve uploaded variants to a table and began to get the quick search working. It’s not entirely there yet, but I should get this working next week. I also need to know what should be done about accented characters for search purposes. The simplest way to handle them would be to just treat them as non-accented characters – e.g. searching for ‘alue’ will find ‘alué’. However, this does mean you won’t be able to specifically search for words that include accented characters – e.g. a search for all the words featuring an ‘é’ will just bring back all characters with an ‘e’ in them.
I was intending to add a count of the number of words in each hierarchical level to the browse, or at least to make hierarchical levels that include words bold in the browse, so as to let users know whether it’s worthwhile clicking on a category to view the words at this level. However, I’ve realised that this will just confuse users as levels that have no words in them but include child categories that do have words in them would be listed with a zero or not in bold, giving the impression that there is no content lower down the hierarchy.
My last task for the week was to create a new timeline for the RNSN project based on data that had been given to me. I think this is looking pretty good, but unfortunately making these timelines and related storymaps is very time-intensive, as I need to extract and edit the images, upload them to WordPress, extract the text and convert it into HTML and fill out the template with all of the necessary fields. It took about 2 and a half hours to make this timeline. However, hopefully the end result will be worth it.
Week Beginning 29th October 2018
This was a slightly unusual week for me, as I don’t often speak at events but I had sessions at workshops on Tuesday and Wednesday. The first one was an ArtsLab event about AHRC Data Management Plans while the second one was a workshop organised by Bryony Randall about digital editions. I think both workshops went well, and my sessions went pretty smoothly. It does take time to prepare for these sorts of things, though, especially when the material needs to be written from scratch, so most of the start of the week was spent preparing for and attending these events.
I also had a REELS project meeting on Tuesday morning where we discussed the feedback we’d received about the online resource and made a plan for what still needs to be finalised before the resource goes live at an event on the 17th of November. There are 23 items on the plan I drew up, so there’s rather a lot to get sorted in the next couple of weeks. Also relating to place-name studies, I made the new, Leaflet powered maps for Thomas Clancy’s Saints Places website live this week. I made the new maps for this legacy resource to replace older Google-based maps that were no longer working due to Google now requiring credit card details to use their mapping services. An example of one of the new maps can be found here: https://saintsplaces.gla.ac.uk/saint.php?id=64.
Also this week I updated the ‘support us’ page of the DSL to include new information and a new structure (http://dsl.ac.uk/support-us/), arranged to meet Matthew Creasy to discuss future work on his Decadence and Translation project, and responded to a few more requests from Jeremy Smith about the last-minute bid he was putting together, which he managed to submit on Tuesday. I also spoke to Scott Spurlock about his crowdsourcing project and spoke to Jane Stuart-Smith about the questionnaire for the new Seeing Speech / Dynamic Dialects websites which are nearing completion. I set up an Google Play / App Store account for someone in MVLS who wanted to keep track of the stats for one of their apps and I spoke to Kirsteen McCue about timelines for her RNSN project.
By Thursday I managed to get settled back into my more regular work routine, and returned to work on the Bilingual Thesaurus for the first time in a few weeks. Louise Sylvester had supplied me with some text for the homepage and the about page, so I added that in. I also fixed the date for ‘Galiot’, which was previously only recorded with an end date, and changed the ‘there are no words in this category’ text to ‘there are no words at this level of the hierarchy’, which is hopefully less confusing.
I also split the list of words for each category into two separate lists, one for Anglo Norman and one for Middle English. Originally I was thinking of having these as separate tabs, but as there are generally not very many words in a category it seemed a little unnecessary, and would have made it harder for a user to compare AN and ME words at the same time. So instead the words are split into two sections of one list. I also added in the language of origin and language of citation text. This information currently appears underneath the line containing the headword, POS and dates. Finally, I added in the links to the source dictionaries. To retain the look of the HT site and to reduce clutter these appear in a pop-up that’s opened when you click on a ‘search’ icon to the right of the word (tooltip text appears if you hover over the search icon too). These might be replaced with in-page links for each word instead, though. Here’s a screenshot of how things currently look, but note that the colour scheme is likely to change as Louise has specified a preference for blue and red. I’ll probably reuse the colours below for the main ‘Thesaurus’ portal page.
I spent the rest of the week working through the HT / OED category linking issues. This included ticking off 6621 matches that were identified by the lexeme / first date matching script, ticking off 78 further matches that Fraser had checked manually, and creating a script that matches up 1424 categories within the category ‘Thing heard’ that had things done to their category numbers that had prevented these from being paired up by previous scripts. I haven’t ticked these off yet as Marc wanted to QA them first, so I created a further script to help with this process. I also wrote a script to fix the category numbers of some of the HT categories where an erroneous zero appears in the number – e.g. ‘016’ is used rather than ‘16’. There were 1355 of these errors, which have now been fixed, which should mean the previous matching scripts should be able to match up at least some of these. Marc, Fraser and I met on Friday to discuss the process, and unfortunately one of the scripts we looked at still had its ‘update’ code active, meaning the newly fixed ‘erroneous zero’ categories were passed through it and ticked off. After the meeting I deactivated the ‘update’ code and identified which rows had been ticked off, creating a script to help to QA these, so no real damage was done.
I also realised that the page I’d created to list statistics about matched / unmatched categories was showing an incorrect figure for unmatched categories that are not empty. Rather than having 2931 unmatched OED categories that have a POS and are not empty the figure is actually 10594. The stats page was subtracting the total matched figure (currently 213,553) from the total number of categories that have a POS and are not empty (216,484). I’m afraid I hadn’t included a count of matched categories that have a POS and are not empty (currently 205,890), which is what should have been used rather than the total matched figure. So unfortunately we have more matches to deal with than we thought.
I also made a tweak to the lexeme / first date matching script, removing ‘to ‘ from the start of lexemes in order to match them. This helped bump a number of categories up into our thresholds for potential matches. I also changed the thresholds and added in a new grouping. The criteria for potential matches has been reduced by one word to 5 matching words and a total of 80% matching words. I also created a new grouping for categories that don’t meet this threshold but still have 4 matching words. I’ll continue with this next week.