I spent most of my time this week split between three projects: The HT / OED category linking, the REELS project and the Bilingual Thesaurus. For the HT I continued to work on scripts to try and match up the HT and OED categories. This week I updated all the currently in use scripts so that date checks now extract the first four numeric characters (OE is converted to 1000 before this happens) from the ‘GHT_date1’ field in the OED data and the ‘fulldate’ field in the HT data. Doing this has significantly improved the matching on the first date lexeme matching script. Greens have gone from 415 to 1527, lime greens from 2424 to 2253, yellows from 988 to 622 and oranges from 2363 to 1788. I also updated the word lists to make them alphabetical, so it’s easier to compare the two lists and included two new columns. The first is for matched dates (ignoring lexeme matching), which is a count of the number of dates in the HT and OED categories that match while the second is this figure as a percentage of the total number of OED lexemes.
However, taking dates in isolation currently isn’t working very well, as if a date appears multiple times it generates multiple matches. So, for example, the first listed match for OED CID 94551 has 63 OED words, and all 63 match for both lexeme and date. But lots of these have the same dates, meaning a total count of matched dates is 99, or 152% of the number of OED words. Instead I think we need to do something more complicated with dates, making a note of each one AND the number of times each one appears in a category as its ‘date fingerprint’.
I created a new script to look at ‘date fingerprints’. The script generates arrays of categories for HT and OED unmatched categories. The dates of each word (or each word with a GHT date in the case of OED) in every category is extracted and a count of these is created (e.g. if the OED category 5678 has 3 words with 1000 as a date and 1 word with 1234 as a date then its ‘fingerprint’ is 5678[1000=>3,1234=>1]. I ran this against the HT database to see what matches.
The script takes about half an hour to process. It grabs each unmatched OED category that contains words, picks out those that have GHT dates, gets the first four numerical figures of each and counts how many times this appears in the category. It does the same for all unmatched HT categories and their ‘fulldate’ column too. The script then goes through each OED category and for each goes through every HT category to find any that have not just the same dates, but the same number of times each date appears too. If everything matches the information about the matched categories is displayed.
The output has the same layout as the other scripts but where a ‘fingerprint’ is not unique a category (OED or HT) may appear multiple times, linked to different categories. This is especially common for categories that only have one or two words, as the combination of dates is less likely to be unique. For an example of this search for our old favourite ‘extra-terrestrial’ and you’ll see that as this is the only word in its category, any HT categories that also have one word and the same start date (1963) are brought back as potential matches. Nothing other than the dates are used for matching purposes – so a category might have a different POS, or be in a vastly different part of the hierarchy. But I think this script is going to be very useful.
I also created a script that ignores POS when looking for monosemous forms, but this hasn’t really been a success. It finds 4421 matches as opposed to 4455, I guess because some matches that were 1:1 are being complicated by polysemous HT forms in different parts of speech.
With these updates in place, Marc and Fraser gave the go-ahead for connections to be ticked off. Greens, lime greens and yellows from ‘lexeme first date matching’ script have now been ticked off. There were 1527, 2253 and 622 in these respective sections, so a total of 4402 ticked off. That takes us down to 6192 unmatched OED categories that have a POS and are not empty, or 11380 unmatched that have a POS if you include empty ones. I then ‘unticked’ the 350 purple rows from the script I’d created to QA the ‘erroneous zero’ rows that had been accidentally ticked off last week. This means we now have 6450 unmatched OED categories with words, or 11730 including those without words. I then ticked off all of the ‘thing heard’ matches other than some rows tht Marc had spotted as being wrong. 1342 have been ticked off, bringing our unchecked but not empty total down to 5108 and our unchecked including empty total down to 10388. On Friday, Marc, Fraser and I had a further meeting to discuss our next steps, which I’ll continue with next week.
For the REELS project I continued going through my list of things to do before the project launch. This included reworking the Advanced Search layout, adding in tooltip text, updating the start date browse, which was including ‘inactive’ data in it’s count, created some further icons for combinations of classification codes, added in Creative Commons logos and information, added an ‘add special character’ box to the search page, added a ‘show more detail’ option to the record page that displays the full information about place-name elements, added an option to the API and Advanced Search that allows you to specify if your element search looks at current forms, historical forms or both, added in Google Analytics, updated the site text and page structure to make the place-name search and browse facilities publicly available, created a bunch of screenshots for the launch, set up the server on my laptop for the launch and made everything live. You can now access the place-names here: https://berwickshire-placenames.glasgow.ac.uk/ (e.g. by doing a quick search or choosing to browse place-names)
I also investigated a strange situation Carole had encountered with the Advanced Search, whereby a search for ‘pn’ and ‘<1500’ brings back ‘Hassington West Mains’, even though it only has a ‘pn’ associated with a Historical form from 1797. The search is really ‘give me all the place-names that have an associated ‘pn’ element and also have an earliest historical form before 1500’. The usage of elements in particular historical forms and their associated dates is not taken into consideration – we’re only looking at the earliest recorded date for each place-name. Any search involving historical form data is treated in the same way – e.g. if you search for ‘<1500’ and ‘Roy’ as a source you also get Hassington West Mains as a result, because its earliest recorded historical form is before 1500 and it includes a historical form that has ‘Roy’ as a source. Similarly if you search for ‘<1500’ and ‘N. mains’ as a historical form you’ll also get Hassignton West Mains, even though the only historical form before 1500 is ‘(lands of) Westmaynis’. This is because again the search is ‘get me all of the place-names with a historical form before 1500 that have any historical form including the text ‘N. mains’. We might need to make it clearer that ‘Earliest start date’ refers to the earliest historical form for a place-name record as a whole, not the earliest historical form in combination with ‘historical form’, ‘source’, ‘element language’ or ‘element’.
On Saturday I attended the ‘Hence the Name’ conference run by the Scottish Place-name Society and the Scottish Records Association, where we launched the website. Thankfully everything went well and we didn’t need to use the screenshots or the local version of the site on my laptop, and the feedback we received about the resource was hugely positive.
For the Bilingual Thesaurus I continued to implement the search facilities for the resource. This involved stripping out a lot of code from the HT’s search scripts that would not be applicable to the BTH’s data, and getting the ‘quick search’ feature to work. After getting this search to actually bring back data I then had to format the results page to incorporate the fields that were appropriate for the project’s data, such as the full hierarchy, whether the word results are Anglo Norman or Middle English, dates, parts of speech and such things. I also had to update the category browse page to get search result highlighting to work and to get the links back to search results working. I then made a start on the advanced search form.
Other than these projects I also spoke to fellow developer David Wilson to give him some advice on Data Management Plans, I emailed Gillian Shaw with some feedback on the University’s Technician Commitment, I helped out Jane with some issues relating to web stats, I gave some advice to Rachel Macdonald on server specifications for the SPADE project, I replied to two PhD students who had asked me for advice on some technical matters, and I gave some feedback to Joanna Kopaczyk about hardware specifications for a project she’s putting together.