Week Beginning 22nd October 2018

I returned to work on Monday after being off last week.  As usual there were a bunch of things waiting for me to sort out when I got back, so most of Monday was spent catching up with things.  This included replying to Scott Spurlock about his Crowdsourcing project, responding to a couple of DSL related issues, updating access restrictions on the SPADE website, reading through the final versions of the DMP and other documentation for Matt Sangster and Katie Halsey’s project, updating some details on the Medical Humanities Network website, responding to a query about the use of the Thesaurus of Old English and speaking to Thomas Clancy about his Iona proposal.

With all that out of the way I returned to the OED / HT data linking issues for the Historical Thesaurus.  In my absence last week Marc and Fraser had made some further progress with the linking, and had made further suggestions as to what strategies I should attempt to implement next.  Before I left I was very much in the middle of working on a script that matched words and dates, and I hadn’t had time to figure out why this script was bringing back no matches.  It turns out the HT ‘fulldate’ field was using long dashes, whereas I was joining the OED GHT dates with a short dash.  So all matches failed.  I replaced the long dashes with short ones and the script then displayed 2733 ‘full matches’ (where every stripped lexeme and its dates match) and 99 ‘partial matches’ (where more than 6 and 80% match both dates and stripped lexeme text).  I also added in a new column that counts the number of matches not including dates.

Marc had alerted me to an issue where the number of OED matches was coming back as more than 100% so I then spent some time trying to figure out what was going on here.  I updated both the ‘with dates’ and ‘no date check’ versions of the lexeme pattern matching scripts to add in the text ‘perc error’ to any percentage that’s greater than 100, to more easily search for all occurrences.  There are none to be found in the script with dates, as matches are only added to the percentage score if their dates match too.  On the ‘no date check’ script there are several of these ‘perc error’ rows and they’re caused for the most part by a stripped form of the word being identical to an existing non-stripped form.  E.g. there are separate lexemes ‘she’ and ‘she-‘ in the HT data, and the dash gets stripped, so ‘she’ in the OED data ends up matching two HT words.  There are some other cases that look like errors in the original data, though.  E.g. in OED catid 91505 severity there’s the HT word ‘hard (OE-)’ and ‘hard (c1205-)’ and we surely shouldn’t have this word twice.  Finally there are some forms where stripping out words results in duplicates – e.g. ‘pro and con’ and ‘pro or con’ both end up as ‘pro con’ in both OED and HT lexemes, leading to 4 matches where there should only be 2.  There are no doubt situations where the total percentage is pushed over the 80% threshold or to 100% by a duplicate match – any duplicate matches where the percentage doesn’t get over 100 are not currently noted in the output.  This might need some further work. Or, as I previously said, with the date check incorporated the duplicates are already filtered out, so it might not be so much of an issue.

I also then moved on to a new script that looks at monosemous forms.  This script gets all of the unmatched OED categories that have a POS and at least one word and for each of these categories it retrieves all of the OED words.  For each word the script queries the OED lexeme table to get a count of the number of times the word appears.  Note that this is the full word, not the ‘stripped’ form, as the latter might end up with erroneous duplicates, as mentioned above.  Each word, together with its OED date and GHT dates (in square brackets) and a count of the number of times it appears in the OED lexeme table is then listed.  If an OED word only appears once (i.e. is monosemous) it appears in bold text.  For each of these monosemous words the script then queries the HT data to find out where and how many times each of these words appears in the unmatched HT categories.  All queries keep to the same POS but otherwise look at all unmatched categories, including those without an OEDmaincat.  Four different checks are done, with results appearing in different columns: HT words where full word (not the stripped variety) matches and the GHT start date matches the HT start date; failing that, HT words where the full word matches but the dates don’t; failing either of these, HT words where the stripped forms of the words match and the dates match; failing all these, HT words where the stripped forms match but the dates don’t.  For each of these the HT catid, OEDmaincat (or the text ‘No Maincat’ if there isn’t one), subcat, POS, heading, lexeme and fulldate are displayed.  There are lots of monosemous words that just don’t appear in the HT data.  These might be new additions or we might need to try pattern matching.  Also, sometimes words that are monosemous in the OED data are polysemous in the HT data.  These are marked with a red background in the data (as opposed to green for unique matches).  Examples of these are ‘sedimental’, ‘meteorologically’, ‘of age’.  Any category that has a monosemous OED word that is polysemous in the HT has a red border.  I also added in some stats below the table.  In our unmatched OED categories there are 24184 monosemous forms.  There are 8086 OED categories that have at least one monosemous form that matches exactly one HT form.  There are 220 OED monosemous forms that are polysemous in the HT.  Now we just need to decide how to use this data.

Also this week I looked into an issue one of the REELS team was having when accessing the content management system (it turns out that some anti-virus software was mislabelling the site as having some kind of phishing software in it), and responded to a query about the Decadence and Translation Network website I’d set up.  I also started to look at sourcing some Data Management Plans for an Arts Lab workshop that Dauvit Broun has asked me to help with next week.  I also started to prepare my presentation for the Digital Editions workshop next week, which took a fair amount of time.  I also met with Jennifer Smith and a new member of the SCOSYA project team in Friday morning to discuss the project and to show the new member of staff how the content management system works.  It looks like my involvement with this project might be starting up again fairly soon.

On Tuesday Jeremy Smith contacted me to ask me to help out with a very last minute proposal that he is putting together.  I can’t say much about the proposal, but it had a very tight deadline and required rather a lot of my time from the middle of the week onwards (and even into the weekend).  This involved lots of email exchanges, time spent reading documentation, meeting with Luca, who might be doing the technical work for the project if it gets funded, and writing a Data Management Plan for the project.  This all meant that I was unable to spend time working on other projects I’d hoped to work on this week, such as the Bilingual Thesaurus.  Hopefully I’ll have time to get back into this next week, once the workshops are out of the way.

Week Beginning 8th October 2018

I continued to work on the HT / OED data alignment for a lot of this week.  I updated the matching scripts I had previously created so that all matches based on last lexeme were removed and instead replaced by a ‘6 matches or more and 80% of words in total match’ check.  This was a lot more effective that purely comparing the last word in each category and helped match up a lot more categories.  I also created a QA script to check the manual matches that were made during our first phase of matching.  There are 1407 manual matches in the system.  The script also listed all the words in each potential matched category to make it easier to tell where any potential difficulties were.  I also updated the ‘pattern matching’ script I’d created last week to list all words and include the ‘6 matches and 80%’ check and changed the layout so that separate groupings now appear in different tables rather than being all mixed up in one table.  It took quite a long time to sort this out, but it’s going to be much more useful for manual checking.

I then moved on to writing a new ‘sibling matching’ script.  This script goes through all unmatched OED categories (this includes all that appear in other scripts such as the pattern matching one) and retrieves all sibling categories of the same POS.  E.g. if the category is ‘01.01.01|03 (n)’ then the script brings back all HT noun subcats of ’01.01.01’ that are ‘level 1’ subcats and compares their headings.  It then looks to see if there is a sibling category that has the same heading – i.e. looking for when a category has been renumbered within the same level of the thesaurus.  This has uncovered several hundred such potential matches, which will hopefully be very helpful. I also then created a further script that compares non-noun headings to noun headings at the same level, as it looked like a number of times the OED kept the noun heading for other parts of speech while the HT renamed them.  This identified a further 65 possible matches, which isn’t too bad.

I met with Marc and Fraser on Wednesday to discuss the recent updates I’d made, after which I managed to tick off 2614 matched categories, taking our total of unmatched OED categories that have a part of speech and are not empty down to 10,854.  I then made a start on a new script that looks at pattern matching for category contents (i.e. words), but I didn’t have enough time to make a huge amount of progress with this.

I also fixed an issue with the HT’s Google Analytics not working properly.  It looks like the code stopped working around the time we shifted domains in the summer, but it was a bit of a weird one as everything was apparently set up correctly – the ‘Send test traffic’ option on the .JS tracking code section successfully navigated through to the site and the tracking code was correct, but nothing was getting through.  However, I replaced the existing GA JavaScript that we had on our page with a new snippet from the JS section of Google Analytics and this seems to have done the trick.

However, we also have some calls to GA in our JavaScript so that loading parts of the tree, changing parts of speech, selecting subcats etc are considered ‘page hits’ and were reported to GA.  None of these worked after I changed the code.  I followed some guidelines here:


to try and get things working but the callbacks were never being initiated – i.e. data wasn’t getting through to Google.  Thankfully Stack Overflow had an answer that worked (After trying several that didn’t):


I’ve updated this so that pageviews rather than events are sent and now everything seems to be working again.

I spent a bit more time this week working on the Bilingual Thesaurus project, focussing on getting the front end for the thesaurus working.  I’ve reworked the code for the HT’s browse facility to work with the project’s data.  This required quite a lot of work as structurally the datasets are quite different – the HT relies in its ‘tier’ numbers for parent / child / sibling category relationships, and also has different categories for parts of speech and nested subcategories.  The BTH data is much simpler (which is great) as it just has parent and child categories, with things like part of speech handled at word level.  This meant I had to strip a lot of stuff out of the code and rework things.  I’m also taking the opportunity to move to a new interface library (Bootstrap) so had to rework the page layout to take this into consideration too.  I managed to get an initial version of the browse facility working now, which works in much the same way as the main HT site:  clicking on a heading allows you to view its words and clicking on a ‘plus’ sign allows you to view the child categories.  As with the HT you can link directly to a category too.  I do still need to work on the formatting of the category contents, though.  Currently words are just listed all together, with their type (AN or ME) listed first, then the word, then the POS in brackets, then dates (if available).  I haven’t included data about languages of source or citation yet, or URLs.  I’m also going to try and get the timeline visualisations working as well.  I’ll probably split the AN and ME words into separate tabs, and maybe split the list up by POS too.  I’m also wondering whether the full category hierarchy should be represented above the selected category (the right pane), as unlike the HT there’s no category number to show your position in the thesaurus.  Also, as a lot of the categories are empty I’m thinking of making the ones with words in them bold in the tree, or even possibly adding a count of words in brackets after the category heading.  I’ve also updated the project’s homepage to include the ‘sample category’ feature, allowing you to press the ‘reload’ icon to load a new random category.

I also made some further tweaks to the Seeing Speech data (fixing some of the video titles and descriptions) and had a chat with Thomas Clancy about his Iona proposal, which is starting to come together again after several months of inactivity.  Also for the ‘Books and Borrowing’ proposal I replied to a request for more information on the technical side of things from Stirling’s IT Services, who will be hosting the website and database for the project.  I met with Luca this week as well, to discuss how best to grab XML data via an AJAX query and process it using client-side JavaScript.  This is something I had already tackled as part of the ‘New Modernist Editing’ project, so was able to give Luca some (hopefully) useful advice.  I also continued an email conversation with Bryony Randall and Ronan Crowley about the workshop we’re running on digital editions later in the month.

On Friday I spent most of the day working on the RNSN project, adding direct links to the ‘nation’ introductions to the main navigation menu and creating new ‘storymap’ stories based on Powerpoint presentations that had been sent to me.  This is actually quite a time-consuming process as it involves grabbing images from the PPT, reformatting them, uploading them to WordPress, linking to them from the Storymap pages, creating Zoomified versions of the image or images that will be used as the ‘map’ for the story, extracting audio files from the PPT and uploading them, grabbing all of the text and formatting it for display and other such tasks.  However, despite being a long process the end result is definitely worth it as the stroymaps work very nicely.  I managed to get two such stories completed today, and now I’ve re-familiarised myself with the process it should be quicker when the next set get sent to me.

I’m going to be on holiday next week so there won’t be another report from me until the week after that.

Week Beginning 1st October 2018

This was a week of many different projects, most of which required fairly small jobs doing, but some of which required most of my time.  I responded to a query from Simon Taylor about a potential new project he’s putting together that will involve the development of an app.  I fixed a couple of issues with the old pilot Scots Thesaurus website for Susan Rennie, and I contributed to a Data Management Plan for a follow-on project that Murray Pittock is working on.  I also made a couple of tweaks to the new maps I’d created for Thomas Clancy’s Saints Places project (the new maps haven’t gone live yet) and I had a chat with Rachel Macdonald about some further updates to the SPADE website.  I also made some small updates to the Digital Humanities Network website, such as replacing HATII with Information Studies.  I also had a chat with Carole Hough about the launch of the REELS resource, which will happen next month, and spoke to Alison Wiggins about fixing the Bess of Hardwick resource, which is currently hosted at Sheffield and is unfortunately no longer working properly.  I also continued to discuss the materials for an upcoming workshop on digital editions with Bryony Randall and Ronan Crowley.  I also made a few further tweaks to the new Seeing Speech and Dynamic Dialects websites for Jane Stuart-Smith.

I had a meeting with Kirsteen McCue and Brianna Robertson-Kirkland to discuss further updates to the Romantic National Song Network website.  There are going to be about 15 ‘song stories’ that we’re going to publish between the new year and the project’s performance event in March, and I’ll be working on putting these together as soon as the content comes through.  I also need to look into developing an overarching timeline with contextual events.

I spent some time updating the pilot crowdsourcing platform I had set up for Scott Spurlock.  Scott wanted to restrict access to the full-size manuscript images and also wanted to have two individual transcriptions per image.  I updated the site so that users can no longer right click on an image to save or view it.  This should stop most people from downloading the image, but I pointed out that it’s not possible to completely lock the images.  If you want people to be able to view an image in a browser it is always going to be possible for the user to get the image somehow – e.g. saving a screenshot, or looking at the source code for the site and finding the reference to the image.  I also pointed out that by stopping people easily getting access to the full image we might put people off from contributing – e.g. some people might want to view the full image in another browser window, or print it off to transcribe from a hard copy.

The crowdsourcing platform uses the Scripto tool, a plugin for Omeka, and unfortunately allowing multiple transcriptions to be made for each image isn’t something the Scripto tool is set up to do.  All the examples I’ve seen have allowed multiple users to collaborate in the preparation of a single transcriptions for each page, with all edits to the transcription tracked.  However, I had a look at the source code to see how I might update the system to allow two versions of each page to be stored.  It turns out that it’s not going to be a straightforward process to implement this.  I probably could get such an update working, but it’s going to mean rewriting many different parts of the tool – not just the transcription page, but the admin pages, the ‘watch’ pages, the ‘history’ pages, the scripts that process edits and the protecting of transcriptions.  It’s a major rewrite of the code, which might break the system or introduce bugs or security issues.  Suggested to Scott that his pilot should focus on single transcriptions, as I am uncertain he will get enough transcribers to fully transcribe each page twice.  There are huge amounts of text on each page and even completing a single transcription is going to be a massive amount of effort.  I pointed out that other projects that have used Scripto for single transcriptions on much shorter texts are still ongoing, years after launching, for example http://letters1916.maynoothuniversity.ie/diyhistory/ and https://publications.newberry.org/digital/mms-transcribe/index.  However, after further thought I did come up with a way of handling multiple transcriptions purely in the front end via JavaScript.  The system architecture wouldn’t need to be updated, but I could use a marker hidden in the transcription text to denote where the first transcription ends and the second begins.  The JavaScript would then use this marker to split the single transcription up into two separate text areas in the front-end and admin interfaces.  I’ll see how Scott would prefer to proceed with this next time we speak.

I also spent a bit of time continuing to work on the Bilingual Thesaurus.  I moved the site I’m working on to a new URL, as requested by Louise Sylvester, and updated the thesaurus data after receiving feedback on a few issues I’d raised previously.  This included updating the ‘language of citation’ for the 15 headwords that had no data for this, instead making them ‘uncertain’.  I also added in first dates for a number of words that previously only had end dates, based on information Louise sent to me.  I also noticed that several words have duplicate languages in the original data, for example the headword “Clensing (mashinge, yel, yeling) tonne” has for language of origin: “Old English|?Old English|Middle Dutch|Middle Dutch|Old English”.  My new relational structure ideally should have a language of origin / citation linked only once to a word, otherwise things get a bit messy, so I asked Louise whether these duplicates are required, and whether a word can have both an uncertain language of origin (“?Old English”) and a certain language of origin (“Old English”).  I haven’t heard back from her about this yet, but I wrote a script that strips out the duplicates, and where both an uncertain and certain connection exists keeps the uncertain one.  If needs be I’ll change this.  Other than these issues relating to the data, I spent some time working on the actual site for the Bilingual Thesaurus.  I’m taking the opportunity to learn more about the Bootstrap user interface library and am developing the website using this.  I’ve been replicating the look and feel of the HT website using Bootstrap syntax and have come up with a rather pleasing new version of the HT banner and menu layout.  Next week I’ll see about starting to integrate the data itself.

This just leaves the big project of the week to discuss:  the ongoing work to align the HT and OED datasets.  I continued to implement some of the QA and matching scripts that Marc, Fraser and I discussed at our meeting last week.  Last week I ‘dematched’ 2412 categories that don’t have a perfect number of lexemes match and have the same parent category.  I created a further script that checks how many lexemes in these potentially matched categories are the same.  This script counts the number of words in the potentially matched HT and OED categories and counts how many of them are identical (stripped).  A percentage of the number of HT words that are matched is also displayed.  If the number of HT and OED words match and the total number of matches is the same as the number of words in the HT and OED categories the row is displayed in green.  If the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1 this is also considered a match.  If the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1 this is also considered a match.  The total matches given are 1154 out of 2412.

I then moved onto creating a script that checks the manually matched data from our ‘version 1’ matching process.  There are 1407 manual matches in the system.  Of these:

  1. 795 are full matches (number of words and stripped last word match or have a Levenshtein score of 1 and 100% of HT words match OED words, or the categories are empty)
  2. There are 205 rows where all words match or the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1, or the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1
  3. There are 122 rows where the last word matches (or has a Levenshtein score of 1) but nothing else does
  4. There are 18 part of speech mismatches
  5. There are 267 rows where nothing matches

I then created a ‘pattern matching’ script, which changes the category headings based on a number of patterns and checks whether this then results in any matches.  The following patterns were attempted:

  • inhabitant of the -> inhabitant
  • inhabitant of -> inhabitant
  • relating to -> pertaining to
  • spec. -> specific
  • spec -> specific
  • specific -> specifically
  • assoc. -> associated
  • esp. -> especially
  • north -> n.
  • south -> s.
  • january -> jan.
  • march -> mar.
  • august -> aug.
  • september -> sept.
  • october -> oct.
  • november -> nov.
  • december -> dec.
  • Levenshtein difference of 1
  • Adding ‘ing’ onto the end

The script identified 2966 general pattern matches, 129 Levenshtein score 1 matches and 11 ‘ing’ matches, leaving 17660 OED categories that have a corresponding HT catnum with different details and a further 6529 OED categories that have no corresponding HT catnum.  Where there is a matching category number of lexemes / last lexeme / total matched lexeme checks as above are applied and rows are colour coded accordingly.

On Friday Marc, Fraser and I had a further meeting to discuss the above, and we came up with a whole bunch of further updates that I am going to focus on next week.  It feels like real progress is being made.

Week Beginning 24th September 2018

Having left Rob Maslen’s Fantasy blog in a somewhat unfinished state last Friday due to server access issues, I jumped straight into completing this work on Monday morning.  Thankfully I could access the server again and after spending an hour or so tweaking header images, choosing colour schemes and fonts, reinstating widgets, menus and such things I managed to get the site fully working again, with a fully responsive theme: http://fantasy.glasgow.ac.uk/.  I also updated some content on the Burns Paper Database website for Ronnie Young, completed my PDR, responded to a query about TheGlasgowStory and met with Matt Barr in Computing Science to discuss some possible future developments.  I also made some further tweaks to the Seeing Speech and Dynamic Dialect website upgrades that are still ongoing.  Eleanor had created new versions of some of the videos, so I uploaded them, and also updated all of the images in the image carousels for both sites.

I spent a fair amount of time this week updating the maps on the ‘Saints in Scottish Place-Names’ website.  As mentioned in a previous post, the maps on this site all use Google maps, and Google now blocks access to their maps API unless you connect via an account that has a credit card associated with it.  This is not very good for legacy research projects such as this one, so the plan was that I’d migrate the maps from Google to the free and open source Leaflet.js mapping library.  Another advantage of Leaflet is that the scripts are all stored on the same server as the rest of the resource – we’re no longer reliant on a third-party server so there should be less risk of the maps becoming unavailable in future.  Of course the map layers themselves are all stored on other third-party servers, but the ones I’ve chosen (based on the ones I selected for the REELS project) are all free to use, and another benefit of Leaflet is that it’s very simple to switch out one map layer for another – so if one tileset becomes unavailable I can replace it very quickly with another.

I created a new Leaflet powered version of the website in a subdirectory so I could test things out without messing up the live site.  As far as I could tell there were four pages that featured maps, each using them in different ways.  I migrated all of them over to the Leaflet mapping library and incorporated base maps and other features from the REELS and KCB map interface, namely:

  1. A map ‘display options’ button in the top left of the map that opens a panel through which you can change the base map.
  2. A choice of 6 base maps, as with REELS and KCB:
    1. A default topographical map
    2. A satellite map
    3. A satellite map with things like roads, rivers and settlements marked on it
    4. A modern OS map
    5. A historical OS map from 1840-1888
    6. A historical OS map from 1920-1933
  3. An ‘Attribution and copyright’ popup linked to from the bottom right of the map, which I adapted from REELS.
  4. A ‘full screen’ button in the bottom right of the map that allows you to view any map full screen. I’ve removed the ‘view larger map’ option on the Saints page as I didn’t think this was really necessary when the ‘full screen’ option is available anyway.
  5. A map scale (metric and imperial) appears in the bottom left of the map.


Here’s some information about the four map types that I updated:


  1. Place map

This is the simplest map and displays a marker showing the location of the place.  Hover over the marker to view the place-name.

  1. Saint map

This map colour codes the markers based on ‘certainty’.  I used the same coloured markers as found on the original map.  I also added a map legend to the top right that shows you what the colours represent.  You can turn any of the layers on or off to make it easier to see the markers you’re interested in (e.g. hide all ‘certain’ markers).  I removed the legend section that appeared underneath the original map as this is no longer needed due to the in-map version.

  1. Search map

As with the original version, when you zoom in on an area any place-names found in the vicinity appear as red dots.  I updated the functionality slightly so that as you pan round the map at one zoom level new markers continue to load (with the previous version you had to change the zoom level to initiate the loading of new markers).  Now as you pan around new red spots appear all over the place like measles.

  1. Search results map

I couldn’t get the original version of this map to work at all, so I think there must have been some problem with it in addition to the Google Maps issue.  Anyway, the new version displays the search results on a map, and if the search included a saint then the results are categorised by ‘certainty’ as with the saint map.  You can turn certainty levels on or off.  You can also open the marker pop-ups to link through to the place-name record and the saint record too.

There will no doubt be a few further tweaks that will be required before I replace the live site with the new version I’ve been working on, but I reckon that bulk of the work is now done.

I also continued with the Bilingual Thesaurus project, although I didn’t have as much time as I had hoped to work on this.  However, I updated the ‘language of origin’ data for the 1829 headwords that had no language of origin, assigning ‘uncertain’ to all of them.  I also noticed that 15 headwords have no ‘date of citation’ and I asked Louise whether this was ok.  I also updated the way I’m storing dates.  Previously I had set up a separate table where any number of date fields could be associated with a headword.  Instead I have now added two new columns to the main ‘lexeme’ table: startdate and enddate.  I then wrote a script that went through the originally supplied dates (e.g. [1230,1450], adding the first date to the startdate column and the second date to the enddate column.  Where an enddate is not supplied or is ‘0’ I’ve added the startdate to this column, just to make it clearer that this is a single year.  Louise had mentioned that some dates would have ‘1450+’ as a second date but I’ve checked the original JSON file I was given and no dates have the plus sign, o I’ve checked with her in case this data has somehow been lost.  I also discovered that there are 16 headwords that have an enddate but no startdate (e.g. the date in the original JSON file is something like [0,1436] and have asked what should happen to these.  Finally, I made a start on the front-end for the resource.  There is very little in place yet, but I’ve started to create a ‘Bootstrap’ based interface using elements from the other thesaurus websites (e.g. logo, fonts).  Once a basic structure is in place I’ll get the required search and browse facilities up and running and we can then thing about things such as colour schemes and site text.

I spent the rest of the week on the somewhat Sisyphean task of matching up the HT and OED category data.  This is a task that Fraser and I have been working with on and off for over a year now, and it seemed like the end was in sight, as we were down to just a few thousand OED categories that were unmatched.  However, last week I noticed some errors in the category matching, and some further occasions where an OED category has been connected to multiple HT categories.  Marc, Fraser and I met on Monday to discuss the process and Marc suggested we start the matching process from scratch again.  I was rather taken aback by this as there appeared to only be a few thousand erroneous matches out of more than 220,000 and it seemed like a shame to abandon all our previous work.  However, I’ve since realised this is something that needed to be done, mainly because the previous process wasn’t very well documented and could not be easily replicated.  It’s a process that Fraser and I could only focus on between other commitments and progress was generally tracked via email conversations and a few Word documents.  It was all very experimental, and we often ran a script, which matched a group of categories, then altered the script and ran it again, often several times in succession.  We also approached the matching from what I realise now is the wrong angle – starting with the HT categories and trying the match these to the OED categories.  However, it’s the OED categories that need to be matched and it doesn’t really matter if HT categories are left unmatched (as plenty will as they are more recent additions or are empty place-holder categories).  We’ve also learned a lot from the initial process and have identified certain scripts and processes that we know are the most likely to result in matches.

It was a bit of a wrench, but we have now abandoned our first stab at category matching and are starting over again.  Of course, I haven’t deleted the previous matches so no data has been lost.  Instead I’ve created new ‘v2’ matching fields and I’m being much more rigorous in documenting the processes that we’re putting the data through and ensuring every script is retained exactly as it was when it performed a specific task rather than tweaking and reusing scripts.

I then ran an initial matching script that looked for identical matches – where the maincat, subcat, part of speech and ‘stripped’ heading were all identical.  This matched 202030 OED categories, leaving just 27,295 unmatched.  However, it is possible that not all of these 202030 matches are actually correct.  This is because quite often a category heading is reused – e.g. there are lots of subcats that have the heading ‘pertaining to’ – so it’s possible that a category might look identical but in actual fact be something completely different.  To check for this I ran a script that the combination of the stripped heading and the part of speech appears in more than one category.  There are 166096 matched categories where this happens.  For these the script then compares the total number of words and the last word in each match to see whether the match looks valid.  There were 12,640 where the number of words or the last word are not the same and I created a further script that then checked whether these had identical parent category headings.  This then identified 2,414 that didn’t.  These will need further checking.

I also noticed that a small number of HT categories had a parent whose combination of ‘oedmaincat’, ‘subcat’ and ‘pos’ information was not unique.  This is an error and I created a further script to list all such categories.  Thankfully there are only 98 and Fraser is going to look at these.  I also created a new stats page for our V2 matching process, which I will hopefully continue to make good progress with next week.