I spent a fair amount of this week overhauling some of the STELLA resources and migrating them to the University’s T4 system. This has been pretty tedious and time consuming, but it’s something that will only have to be done once and if I don’t do it no-one else will. I completed the migration of the pages about Jane Stuart-Smith’s ‘Accent Change in Glaswegian’ project (which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/accent-change-in-glaswegian/). I ran into some issues with linking to images in the T4 media library and had to ask the web team to manually approve some of the images. It would appear that linking to images before they have been approved by the system by guessing what their filename will be somehow causes the system to block the approval of the images, so I’ll need to make sure I’m not being too clever in future. I also worked my way through the old STELLA resource ‘A Bibliography of Scottish Literature’ but I haven’t quite finished this yet. I have one section left to do, so hopefully I’ll be able to make this ‘live’ before Christmas.
Other than the legacy STELLA work I spent some time on another AHRC review that I’d been given, made another few tweaks to Carolyn Jess-Cooke’s project website and had an email conversation with Alice Jenkins about a project she is putting together. I’m going to meet with her in the first week of January to discuss this further. I also had some App management duties to attend to, namely giving some staff in MVLS access to app analytics.
Other than these tasks, I spent some time working on the Historical Thesaurus, as Fraser and I are still trying to figure out the best strategy for incorporating the new data from the OED. I created a new script that attempts to work out which categories in the two datasets match up based on their names. First of all it picks out all of the categories that are nouns that match between HT and OED. ‘Match’ means the our ‘oedmaincat’ field (combined with ‘subcat’ where appropriate) matches the OED’s ‘path’ field (combined with ‘sub’ where appropriate). Our ‘oedmaincat’ field is the ‘v1maincat’ field that has had some additional reworking done to it based on the document of changes Fraser had previously sent to me.
These categories can be split into three groups:
- 1. Ones where the HT and OED headings are identical (case insensitive)
- 2. Ones where the HT and OED headings are not identical (case insensitive)
- 3. Ones where there is no matching OED category for the HT category (likely due to our ‘empty categories’)
For our current purposes we’re most interested in number 2 in this list. I therefore created a version of the script that only displayed these categories, outputting a table containing the columns Fraser had requested. I also put the category heading string that was actually searched for in brackets after the heading as it appears in the database.
At the bottom of the script I also outputted some statistics: How many noun categories there are in total (124355), how many there are that don’t match (21109) and how many HT noun categories don’t have a corresponding OED category (6334). I also created a version of the script that outputs all categories rather than just number 2 in the list above. And made a further version that strips out punctuation when comparing headings too. This converts dashes to spaces, removes commas, full-stops and apostrophes and replaces a slash with ‘ or ‘. This has rather a good effect on the categories that don’t match, reducing this down to 5770. At least some of these can be ‘fixed’ by further rules – e.g. a bunch starting at ID 40807 that have the format ‘native/inhabitant’ can be matched by ensuring ‘of’ is added after ‘inhabitant’.
Fraser wanted to run some sort of Levenshtein test on the remaining categories to see which ones are closely matched and which ones are clearly very different. I was looking at this page about Levenshtein tests: http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Fall2006/Assignments/editdistance/Levenshtein%20Distance.htm which includes a handy algorithm for testing the similarity or different of two strings. The algorithm isn’t available in PHP, but the Java version looks fairly straightforward to migrate to PHP. The algorithm discussed on this page allows you to compare two strings and to be given a number reflecting how similar or different the strings are, based on how many changes would be required to convert one string into another. E.g. a score or zero means the strings are identical. A score of 2 means two changes would be required to turn the first string into the second one (either changing a character or adding / subtracting a character).
I could incorporate the algorithm on this page into my script, running the 5770 heading pairings through it. We could then set a threshold where we consider the headings to be ‘the same’ or not. E.g. ID 224446 ‘score-book’ and ‘score book’ would give a score of 1 and could therefore be considered ‘the same’, while ID 145656 would give a very high score as the HT heading is ‘as a belief’ while the OED heading is ‘maintains what is disputed or denied’(!).
I met with Fraser on Wednesday and we agreed that I would update my script accordingly. I will allow the user (i.e. Fraser) to pass a threshold number to the script that will then display only those categories that are above or below this threshold (depending on what is selected). I’m going to try and complete this next week.