
Month: September 2018
Week Beginning 17th September 2018
I was off on Monday this week, and managed to break a rib whilst coming down a water slide over the weekend so it’s been a bit of a painful week. Thankfully it’s not had any impact on the work I’ve managed to do, though. As with previous weeks recently, I spent a lot of the week continuing with the process of matching up the HT and OED datasets. I had another useful meeting with Marc and Fraser on Tuesday and created or executed a number of scripts to bring our total of unmatched categories ever closer to zero. I ran my ‘parent category matches’ script to tick off all the matches where the OED and HT subcats are definite matches (one to one). I also ticked off matches where there are multiple possible HT subcat matches (same stripped text) but one of the HT subcat numbers is exactly the same as the OED subcat number. After running the script the number of unmatched OED categories that have a part of speech went down from 5092 to 3318. However, after running this script also ticked off almost all of the potential matches that had been suggested by another script I’d created, which instead of listing hundreds of matches only then listed 22, of which 15 where definite. I ticked these off too, though, and made some changes to a couple of other scripts that Marc and Fraser wanted to work with, such as removing categories without a part of speech from a script that identifies gaps in the OED catid number sequence.
After this process we are down to 3303 unmatched OED categories that have a part of speech, and of these 426 are main categories, which is something that is confusing Fraser somewhat. To help reduce confusion I updated the ‘parent category matches’ script so the output is now tabular. Where an OED subcat has a parent that matches an HT category that has no unmatched subcats I now check the OED subcats with cids before and after the OED subcat in question. If these are matched then this is noted in green in the final column. I’ve also added a count of these underneath the table in green (there are 133 such categories but note that if the OED subcat is the first or last in the category then the cid before or after will be of a maincat).
I’ve also created a new script that lists the OE only categories that are matched. This lists the words found in the HT and OED categories. There are 344 OE only categories that match an OED category. Of these, only 51 of the matched OED categories contain words.
Later on in the week I returned to the matching issue and played around with some other possible methods of matching up the remaining OED categories. I created a new script that lists the unmatched OED categories that have a POS and looks for unmatched HT categories that have the same stripped heading and POS while ignoring the catnum / subcat. It finds 1613 potential matches, although of these 240 have multiple possible matches (e.g. ‘specific’ has 22 possible matches in the HT). Of the ones that don’t match I think I’ll be able to create some rules to try and find other matches. Of the multiples it might be possible to automatically deduce the closest catnum to suggest a match.
I then tweaked the script to add in a count of words in the OED and HT categories, and last words too. This should make it easier to check whether a potential match is likely. Lots of them are looking encouraging. I’ve also noted that out of the 3303 unmatched OED categories that have a POS, 858 have no words in them so presumably are not so important to match up.
I then updated the script further so that maincats and subcats appear in two separate tables, with maincats listed first. It looks like some of the unmatched maincats are empty categories that have been created for a POS so they sit alongside categories with other POSes. E.g. is you find ‘eleven to ninety-nine’ in the page you’ll see that there are four OED categories for VT, AV, AJ, and N. Of these the HT has ones for AV, AJ and N but not VT. All have no words in them in OED. Note that it looks like the reason these haven’t previously been matched is because of an erroneous HT catnum: 01.07.04.012
Whilst working on this I uncovered a matched category that is incorrect, which is rather worrying. OED category 235880 ‘après-ski’ (03.11.04.13.12.01|17 (n)) is unmatched, but HT category 223643 ‘après-ski’ (oedmaincat 03.11.04.13.12.01|18 (n)) is matched. However, it’s matched to OED category 235870 ‘parts or attachments’ (03.11.04.13.12.01|08.02 (n)). I’m not sure how this has happened, and it made me realise that I needed to create a script that would actually check that the matched data is correct. I decided to write a script that checks the stripped headings of all matched categories and lists those that have a Levenshtein score of more than a certain number, starting with 8. There are 11,666 of these, but the majority are not errors – e.g. ‘Promontory’ and ‘promontory, headland, or cape’. There are some that are definitely errors, though – e.g. ‘seed of’ and ‘turnip plant’.
I then made some further updates to the script, adding in category details and also reverting to an exact match of the stripped heading fields rather than using a Levenshtein test, just to be sure. I also excluded any categories where the catnum, subcat and pos are the same for HT and OED but the heading is different, as it seemed like these were all correct. What this leaves is 1403 possible errors. A lot of these are not errors at all, but are legitimate differences in headings (e.g. ‘Resembling animal/bird sounds’ and ‘sounds like animal or bird sounds’) but I’m afraid a lot of them are genuine errors. A lot of them seem to be where there is no ‘oedmaincat’, but not all of them. I think we’re going to have to get someone to go through the list and figure out which are real errors, which shouldn’t take more than an hour or so. I added in a new ‘checktype’ column to the output so we could see whether the errors appeared in the manual or automatically matched data. Most were through the automatic processes.
Marc was concerned that for the incorrectly matched categories there might be a bunch of incorrectly matched HT categories that my script isn’t picking up – e.g. HT ‘foraging equipment’ is set to match OED ‘casting equipment’, which is wrong. But what is HT ‘casting equipment’ matched to then? However, it would appear that the correct match on the HT side is just sitting there unconnected to any OED category. E.g. ‘casting equipment’ in the HT is not connected to any OED category yet. So once the erroneous matches are ‘dematched’ hopefully most of them can be matched up to the correct (and so far unmatched) category.
Marc also wanted to check whether any duplicate matches exist in the system – where one HT category points to multiple OED categories. A quick query of the database showed that there are a few duplicates in the system. Of the 226133 HT categories that have an OEDcatid, 226025 of them are unique. So there are 108 OED categories that are referenced in multiple HT categories. Thankfully a tiny number, but something that will need fixed. I created a script to list these and we’ll need to discuss this at the next meeting.
Other than working on the HT / OED linking I split my time mostly between two projects: The redesign of the Seeing Speech / Dynamic Dialects websites and the development of the Bilingual Thesaurus. For the former I added in content for all of the remaining ancillary pages. This took a fair amount of time to do as there was lots of working with raw HTML, adding in links, checking them, creating new images and such things. It’s pretty tedious stuff but it’s really worth doing as the new website works so much better than the old one. I also split the homepage up into shorted chunks, with lots of the text getting moved a new ‘about the project’ pages, and shorted then excessively long citation on the Dynamic Dialects site. I also added in a nice ‘top’ button that appears when you scroll down the page, and added in a ‘cite’ option to individual video overlays. I think we’re just about there now. Just the carousel images to update and a questionnaire about the new site to design and implement.
For the Bilingual Thesaurus I began working with the data I’d previously been sent, in JSON format. The file was pretty well structured, although I did have some questions relating to dates and languages. My initial task was to create a single MySQL table into which I would import the JSON data, and a simple PHP script that would go through each object in the JSON data, extract the individual variables and insert these into the table. After a bit of experimentation, I managed to get the data uploaded, resulting in 4779 rows. My next task was to rationalise the data into a relational database structure. For example, the original data had two language types (language of origin and language of citation), which were stored in an array in the JSON file. Each time a language appears its full text is listed, and sometimes the text has an initial question mark to denote uncertainty. Instead of this I created a ‘language’ table where each language (ignoring question marks) is listed once and is given a unique ID. There are 39 different languages in the data. Then I created a joining table that joins a headword entry with however many languages is needed. This table includes a field for the type of join (i.e. whether the language is ‘origin’ or ‘citation’) and a further field noting whether the join is uncertain (for those question marks). It’s a system that will allow much more flexible queries to be performed. I took a similar approach for dates and dictionary links too.
I then set about splitting up the ‘path’ field, which was similarly stored as an array in the JSON file, with each part of the hierarchical path appearing every time it was required for a headword with no unique ID or any other information. This is a lot of duplication of data, and it also means it’s impossible to search for a particular part of the hierarchy, as the same names are used multiple times to represent very different parts of the hierarchy.
I wrote a nice little script that I’m rather pleased with that went through the paths of each headword, extracted each part of the path, checked whether it already existed in my new ‘category’ database, associated the existing entry if there was one and created a new entry and associated that if there wasn’t one. Each part of the path is now listed just once with its own unique identifier and the ID of its parent category. Using this it will then be possible to generate a tree interface to the data.
I wrote a little test script that displays each headword, its original ‘path’ and then the ID and name of each hierarchical level (from bottom to top) in my new database, with the full hierarchy then listed underneath to check the original and generated forms match (which thankfully they all do). For example, that ‘Farming’ has been given the ID 431, and each time it appears it’s this unique ‘farming’ category that is displayed. I’m very pleased with how this is all working out so far.
Other than these tasks I responded to some queries from other members of staff, for example Simon Taylor who wanted advice on a proposal he’s writing, Ronnie Young who wanted me to update some content on his Burns Paper Database website, Brianna Robertson-Kirkland who wanted to know the copyright implications of embedding YouTube videos, and Valentina Busin, for whom I created an Google Play store listing, basic app details and user accounts for a new app. I also began updating the interface to Rob Maslen’s Fantasy blog on Friday afternoon, but the server went weird and blocked me as I was halfway through working on things. I’ll have to sort this out first thing on Monday.
Week Beginning 10th September 2018
This was the sort of week when I worked on many different projects. I created new ‘copyright’ and ‘terms of use’ pages for the DSL website and also made a few other tweaks that had been requested. I created a second version of the Data Management Plan for Matthew Sangster’s project, based on feedback from him and the PI at Stirling, Katie Halsey. I had an email discussion with a member of staff in MVLS about an app one of her students would like to publish, and I spoke to Zanne Domoney-Lyttle about a website she would like to factor into a funding proposal she is putting together.
On Wednesday I met with the project team for the Kirkcudbrightshire place-names project, which is just starting up. I’d already set up a version of the REELS system for the project to use, so this was an opportunity to meet the team and go through how to use the content management system. It was a useful session as a couple of technical issues cropped up that I needed to fix after the meeting, namely:
- The ‘add element’ feature wasn’t working. It turned out that this was because I’d forgotten to migrate the contents of the ‘language’ table over from the REELS system, and as the system expected some data that was not there the element boxes didn’t load. This has now been sorted.
- I was asked to migrate the contents of the ‘sources’ table over from REELS, which I did. These can now be viewed through the KCB CMS by pressing on the ‘browse sources’ link. However, a lot of these are not going to be relevant as they’re specifically about Berwickshire.
- When demonstrating the REELS place-name search facilities I noted that a quick search for ‘t*’ was bringing back place-names that didn’t start with ‘t’. I found out why: The quick search also searches elements, so any place-name that has an element starting with ‘t’ was also returned. This is a bit confusing so perhaps we want to limit the quick search to headwords only. However, you can use the advanced search to search specifically for ‘Current place-names’, or indeed you can use the ‘browse’ feature to bring back current place-names starting with a particular letter.
- I noticed at the meeting that the CMS automatically calculates the altitude of a place and I had a feeling that this was using Google Maps. As it has been months since I set the facility up I had to check to make sure. It turns out this part of the site does indeed use Google Maps, and there are issues with using this service now, as I discussed last week. The CMS connects to the Google Maps API, passes the latitude and longitude to the service and Google returns the altitude for that location. However, I realised that there is no need to worry about this feature (or the Google Map embedded in the ‘edit record’ page) breaking as the system is already set up to use my Google account, which has an associated credit card. I wasn’t aware that it would potentially be using my credit card until now, but there you go. However, as the only place we use Google Maps is in the CMS, which can only be accessed by the project teams of REELS and KCB I don’t think I’ll ever face a bill. The stats show that in the past 30 days there have been 278 calls to the Google Maps API and 8 calls to the Elevation API and the free tier allows up to 28,000 calls to the former and 40,000 calls to the latter. So unless we have a particularly malicious member of staff who sits and refreshes their page thousands of times I think I’m safe!
I also spent some time this week going through the updates to all of the ‘Seeing Speech’ and ‘Dynamic Dialects’ pages that Eleanor had sent me and setting up the content. This included creating new versions of image files that don’t have big, thick borders, creating new MP4 versions of some video files that were in a different format that couldn’t be supported natively by HTML5, and formatting all of the text for the new pages. The latter also included amalgamating many small pages into single longer pages, as these tend to be preferred these days due to touchscreens. The new site isn’t live yet, and there are still some changes to be made to the homepage text and other pages, but the bulk of the new site is now in place. Hopefully we’ll be able to go live with the new design in the coming weeks.
The rest of my time this week was spent on Historical Thesaurus duties. I had a productive meeting with Marc and Fraser on Tuesday, and devoted a lot of my time this week to writing scripts to help in the matching up of the HT and OED data. This included creating a new statistics page that lists stats about the HT and OED categories and lexemes and what still needs to be matched up. As part of this task Marc wanted to know how many HT categories only contain OE words, and how many are empty. The latter was easy to do but the former was rather tricky, as it meant going through every HT category and then every lexeme in each of these categories to check for the presence of non-OE words. This took too long to do on the fly so instead I updated the database to include a new ‘OE only’ field. Running the script to generate data for this field took about 20 minutes, but now the data is in the database it’s really quick to query. It turns out there are 3175 HT categories that only contain OE words.
I also wrote a script that address the issue of lexemes not being matched up because of pesky apostrophes. We’ve also matched up lots of new categories since I last did a lexeme match so I thought I’d run one. The script finds every HT category that has been matched to an OED category, brings back all of the unmatched words in both HT and OED categories and then compares the ‘stripped’ fields for each to identify words that should be linked together. I ran the script across all matched categories and it has identified 24,795 words that are not currently matched but should be (i.e. their category is matched and the contents of the ‘stripped’ field in the HT and OED word tables are identical). I haven’t ticked these off yet, but it’s a nice bit number of new matches.
I also created a script that for each unmatched OED subcategory finds its parent category. If this is matched to an HT category then the script finds this and returns all of its subcategories to see if there is one with the same name as the unmatched OED subcategory. This has actually worked very well. There are 4666 OED subcats that have a POS. Of these there are 3158 that have a parent that has been matched to an HT category. When looking at unmatched subcats in each of these HT maincats and comparing ‘stripped’ headings of each subcat to the OED subcat there are 2992 that match. I updated the script to mark off the matches but then something odd happened. When it marked off the matches it only reported 2710 subcat matches, which was a bit concerning, so I’ve reverted to a backup version of the category table that I’d made.
In order to investigate this discrepancy I updated the script so that any OED subcat that matches multiple HT subcats is now logged and is listed at the bottom of the page, together with counts of the duplicates and the total number of duplicates that are found (391). If you search the page for one of these IDs you can see where the duplicates occur. E.g. The OED subcat with ID 58953 (types of) within ‘clothing for body or trunk’ matches nine subcats within the joined HT maincat. This is because we’re looking at all subcats at all levels, and ‘types of’ crops up several times at different levels. I have therefore added in another check that identifies whether a match has the same subcat number. If there is one then ‘Subs match too’ appears in purple next to the green ‘Match’ text. This text appears for both single matches and multiple matches.
I’ve also added in some counts at the bottom of the page but above the list of duplicates. These appear in purple. There is a count of the matches where there are no duplicates. These are probably safe to tick off as proper matches. There are 2601 in total, out of 2992 subcat matches. Exact occurrences of these are marked in the output with the purple text ‘One match. Safe to log?’.
There is also a count of the possible matches where the subcat number is the same in both HT and OED data (where ‘subs match too’ appears in the output, as mentioned above). This is useful in identifying which of the duplicates might be the correct ones. There are 1732 matches where the sub numbers match, including both where there are duplicates and individual matches. If the subs don’t match where there is one match (e.g. 143009 “one’s lot” matching 136436 “one’s lot”) it is because the subcat order has been messed about with (in this example the OED subcat number is 02 while the HT subcat number is 02.02).
I think it should be relatively safe to log all occurrences where there is one match, whether the subcat number is the same or not. This would tick off 2601 categories. I think it should also be pretty safe to tick off matches where there are duplicates but the subcat number also matches. I’m not entirely sure how many that would tick off, but I would imagine it would be a fairly sizable portion of the 391 duplicates.
I also updated the script I had created last week that displays unmatched HT categories that have an ‘oedmaincat’ and therefore should be possible to match up to an OED category. Content is now displayed as a table to hopefully make it easier to read. I’ve added in a count of the words in the HT and OED categories and also the last word in each category, together with its dates. Where a category has multiple potential matches the first column has a red background colour and a ‘Y’ in it. I think it will be possible to automatically figure out the correct one for most of these multiples based on the words. E.g. the first category is HT 39514 ‘one who’ and its last word (well, only word) is ‘Malacologist’. Of the nine possible OED matches there is one whose last word is also ‘Malacologist’ so is no doubt the correct match. However, adding in the words shows that some potential direct matches have different contents, e.g. the first row ‘causing discomfort’ has 4 words but the matching OED category only has 3 (OED omits ‘discomfortable’). There is also often variation in the final words too, usually in spelling or use of punctuation, e.g. 15759 by occult methods has ‘point the bone’ while in the OED it’s ‘to point the (death) bone’. Using the ‘stripped’ field will catch a lot of these (e.g. ‘R.S.P.B’ and ‘RSPB’) but not all of them. Sometimes the word is completely different – e.g. 31915 pediculus corporis/body-louse has as its last word ‘typhus-louse’ while the corresponding category has the rather wonderful ‘pants rabbits’.
I made some further updates to this script to give cells a green background if the HT and OED numbers of words match and also if the last word (stripped) matches so you can see where the strong potential matches are. This works for categories where there are duplicate possibilities too. I’ve also added some stats to the bottom of the page. There are a total of 920 potential matches and of these 43 have multiple possibilities. Of these 32 have identical last words and are therefore probably the correct matches. Overall there are 708 strong matches (i.e. with the same number of words and the same last word), including going through the multiples. I would say it is probably safe to tick these 708 off. However, the output of this script overlaps with the output of the previous one. It is possible that most or possibly all of the matches identified by this script are already identified by the parent category match script. E.g. OED 43618 ‘shells’ is matched to HT 39522 ‘shells’ while it is also matched by the parent category match script
I also created a script that lists all matched maincats and gives a count of the total number of subcats in each (not differentiating between matched and unmatched subcats). Note that for HT data I’ve used the full ‘T’ numbers of the maincat to find its subcats rather than using the ‘oedmaincat’ field. I’ve highlighted the rows where the numbers of subcats in the HT and OED data don’t match. Where there are more HT subcats than OED subcats the background colour is the green of the HT header. Where there are more OED subcats than HT subcats the background colour is the blue of the OED header.
The final script I created identifies gaps in the matched OED categories. Currently the script orders the matched categories in the HT category table by the OED catid. Where there is a gap between the previous OED catid and the current OED catid (e.g. OED catid 24 and 26) the script displays the HT and OED category information for the previous and next matched categories and then lists the unmatched OED categories that appear in the gap. However, this is complicated by two things:
- Quite often the gap in OED numbering is caused by OED categories that have no POS and will therefore never be matched. I’ve marked these in the output of the script with a bold ‘No POS’.
- The ‘next’ matched category is often of a different part of speech. I guess where this happens then we should be able to figure out whether the missing categories that have a POS are likely to be connected to the ‘previous’ or ‘next’ category as their POS will likely match one or the other.
This will need further discussion when I meet with Marc and Fraser again next week. My final HT task of the week was to set up a basic interface for the new ‘Thesaurus’ portal site that we’re going to launch. It still needs a lot of work (and some content) but it’s beginning to take shape.
Week Beginning 3rd September 2018
It was back to normality this week after last week’s ICEHL conference. I had rather a lot to catch up with after being out of the office for four days last week and spending the fifth writing up my notes. I spent about a day thinking through the technical issues for an AHR proposal Matthew Sangster is putting together and then writing a first version of the Data Management Plan. I also had email conversations with Bryony Randall and Dauvit Broun about workshops they’re putting together that they each want me to participate in. I responded to a query from Richard Coates at Bristol who is involved with the English Place-Name Society about a database related issue the project is experiencing, and I also met with Luca a couple of times to help him with an issue related to using OpenStreetMap maps offline. Luca needed to set up a version of map-based interface he has created that needs to work offline, so he needed to download the map tiles for offline use. He figured out that this is possible with the Marble desktop mapping application (https://marble.kde.org/) but couldn’t figure out where the map tiles were stored. I helped him to figure this out, and also to fix a couple of JavaScript issues he was encountering. I was concerned that he’d have to set up a locally hosted map server for his JavaScript to connect to, but thankfully it turns out that all of the processing is done at the JavaScript end, and all you need is the required directory /subdirectory structure for map tiles and the PNG images themselves stored in this structure. It’s good to know for future use.
I also responded to queries from Sarah Phelan regarding the Medical Humanities Network and Kirsteen McCue about her Romantic National Song Network. Eleanor Lawson also got in touch with some text for one of the redesigned Seeing Speech website pages, so I added that. It also transpired that she had sent me a document containing lots of other updates in June, but I’d never received the email. It turns out she had sent it to a Brian Aitken at her own institution (QMU) rather than me. She sent the document on to me again and I’ll hopefully have some time to implement all of the required changes next week.
I also investigated an issue Thomas Clancy is having with his Saints Places website. The Google Maps used throughout the website are no longer working. After some investigation it would appear that Google is now charging for using its maps service. You can view information here: https://cloud.google.com/maps-platform/user-guide/. So you now have to set up an account with a credit card associated with it to use Google Maps on your website. Google offer $200 worth of free usage, and I believe you can set a limit that would mean if usage goes over that amount the service is blocked until the next monthly period. Pricing information can be found here: https://cloud.google.com/maps-platform/pricing/sheet/. The maps on the Saints website are ‘Dynamic Maps’, and although the information is pretty confusing I think the table on the above page says that the $200 of free credit would cover 28,000 loads of a map on the Saints website per month (the cost is $7 per 1000 loads), and every time a user loads a page with a map on it this is one load, so one user looking at several records will log multiple map loads.
This isn’t something I can fix and it has worrying implications for projects that have fixed periods of funding but need to continue to be live for years or decades after the period of funding. It feels like a very long time since Google’s motto was “Don’t be evil” and I’m very glad I moved over to using the Leaflet mapping library rather than Google a few years ago now.
I also spent a bit of time making further updates to the new Place-names of Kirkcudbrightshire website, creating some place-holder pages for the public website, adding in the necessary logos and a background map image, updating the parish three-letter acronyms in the database and updating the map in the front-end so that it defaults to showing the right part of Scotland.
I was engaged in some App related duties this week too, communicating with Valentina Busin in MVLS about publishing a student-created app. Pamela Scott in MVLS also contacted me to say that her ‘Molecular Methods’ app had been taken off the Android App store. After logging into the UoG Android account I found a bunch of emails from Google saying that about 6 of our apps had been taken down because they didn’t include a ‘child-directed declaration’. Apparently this is a new thing that was introduced and you have to tick a checkbox in the Developer console to say whether your app is primarily aimed at under 13 year-olds. Once that’s done your app gets added back to the store. I did this for the required apps and all was put right again about an hour later.
I spent about a day this week working on Historical Thesaurus duties. I set up a new ‘colophon’ page that will list all of the technologies we use on the HT website and I also returned to the ongoing task of aligning the HT and OED data. I created new fields for the HT and OED category and word tables to contain headings / words that are stripped of all non-alphanumeric characters (including spaces) and also all occurrences of ‘ and ‘ and ‘ or ‘ (with spaces round them). I also converted the text into all lower case. This means a word such as “in spite of/unþonc/maugre/despite one’s teeth” will be stored in the field as “inspiteofunþoncmaugredespiteonesteeth”. The idea is that it will be easier to compare HT and OED data with such extraneous information stripped out. With this in place I then ran a script that goes through all of the unmatched categories and finds any where the oedmaincat matches OED path, subcat matches OED sub, the pat of speech matches and the ‘stripped’ headings match. This has identified 1556 new matches, which I’ve now logged in the database. This brings the total unmatched HT categories down to 10,478 (of which 1679 have no oedmaincat and presumably can’t be matched). The total unmatched OED categories is 13,498 (of which 8406 have no pos and so will probably never match an HT category). There are also a further 920 potential matches where the oedmaincat matches the path, the pos matches and the ‘stripped’ headings match, but the subcat numbers are different. I’ll need to speak to Marc and Fraser about these next week.
I spent most of Friday working on the setting up the system for the ‘Records of Govan Old’ crowdsourcing site for Scott Spurlock. Although it’s not completely finished things are beginning to come together. It’s a system that’s based on the ‘Scripto’ crowdsourcing tool (http://scripto.org/) that uses Omeka and MediaWiki to manage data and versioning. The interface I’ve set up is pretty plain at the moment but I’ve set up a couple of sample pages with placeholder text (Home and About). It’s also possible to browse collections – currently there is only one collection (Govan old images) but this could be used to have different collections for different manuscripts, for example. You can then view items in the collection, or from the menu choose ‘browse items’ to access all of them.
For now there are only two sample images in the system, which are images from a related manuscript that Scott previously gave me. Users can create a user accounts via MediaWiki and then if you then go to the ‘Browse items’ page then select one of the images to transcribe you can view the image in a zoomable / panable image viewer, view any existing transcription that’s been made, view the history of changes made and if you press the ‘edit’ link a section will open that allows you to edit the transcription and add your own.
I’ve added in a bunch of buttons that place tags in the transcription area when they’re clicked on. They’re TEI tags so eventually (hopefully) we’ll be able to shape the texts into valid TEI XML documents. All updates made by users are tracked and you can view all previous versions of the transcriptions, so if anyone comes along and messes things up it’s easy to revert to an earlier version. There’s also an admin interface where you can view the pages and ‘protect’ them, which prevents future edits being made by anyone other than admin users.
There’s still a lot to be done with this. For example, at the moment it’s possible to add any tags and HTML to the transcription, which we want to prevent for security reasons as much as anything else. The ‘wiki’ that sits behind the transcription interface (which you see when creating an account) is also open for users to edit and mess up so that needs to be locked down too. I also want to update the item lists so that it displays which items have not be transcribed, which have been started and which have been ‘protected’, to make it easier for users to find something to work on. I need to get the actual images that we’ll use in the tool before I do much more with this, I reckon.
Week Beginning 27th August 2018
I attended the ICEHL (International Conference on English Historical Linguistics) conference in Edinburgh this week (see http://www.conferences.cahss.ed.ac.uk/icehl20/). It was a pretty intense conference, running from 9-5 each day with up to 8 parallel sessions and workshops running in addition to plenaries , drinks receptions and a lovely conference dinner in the Playfair Library in the Old College. As Glasgow and Edinburgh are so geographically close I’d decided that rather than staying in a hotel I’d commute through each day, which turned out to be a bit of a mistake, as my door to door commute was two hours each way, which was pretty exhausting. I did for a time live in Glasgow and work at Edinburgh University so I should really have known better, but I guess I’d just blocked the horrendousness of the commute out of my mind.
Anyway, my blog this week is really just going to be a summary of some of the papers I saw at ICEHL. Although pretty much all of them were full of interesting stuff not all of them were especially relevant to my own particular field of Digital Humanities, so I’ll try to focus more on those that did have a larger DH component. These were mostly all grouped into a day-long workshop that took place on the last day of the conference, with the theme of ‘Visualisations in Historical Linguistics’. I contributed to a paper that was given by Fraser on visualisations in the Historical Thesaurus during this workshop too.
Monday
Monday started with a plenary session about the ‘irregularisation’ of verbs in Early Modern English. The speaker showed some mathematical formulae that could be used to test for this, and showed how rules for predicting which past tense verb forms will be acquired during native language acquisition could be established. After this I attended a paper on ‘The lemmatisation of Old English class VII strong verbs on a lexical database’. The speaker discussed the Nerthus project (http://www.nerthusproject.com/) which has about 30,000 records of OE words, with data taken from many OE dictionaries. It includes alternative spellings and forms, the part of speech and other such data. The project incorporated three million files into a database and lemmatised words using a lemmatiser called Norna. The database itself is based on Filemaker. I also attended a paper on ‘Ambiguity resolution and the evolution of homophones in English’. This used the CELEX corpus (https://catalog.ldc.upenn.edu/LDC96L14) and the speaker said that in her sample about 22% of the data were homophones. These included diatones, where the noun and verb are spelled the same but stress is used to differentiate them (for example ‘contract’). The speaker showed visualisations generated by near infrared spectroscopy of the brain that showed the optical paths in the brain when diatones were spoken. This showed that different pathways were active when the noun or the verb form was heard.
I also attended a series of papers as part of the ‘Standardisation after Caxton’ workshop, which I found very interesting, even though it wasn’t massively connected to Digital Humanities. There was a handy introductory session where the speak discussed Haugen’s standardisation model, of codification, elaboration, selection and acceptance (e.g. see https://courses.nus.edu.sg/course/elltankw/history/Standardisation/B.htm). The speaker pointed out that standardisation was already under way before Caxton and previous studies have identified four types, of which three are all London based. However, we need to consider what’s going on beyond London and also consider multilingual factors, such as the influence of French. The speaker also pointed out that the convention is that variation ended ‘soon’ after Caxton, but that this might actually be as late as the 1800s, and that variation persists, especially in handwritten materials (and this continues to the present day). The speaker gave the example of alchemical works, which tended to be handwritten as they were illegal and contain much variation in spelling. Issues such as whether materials were private or public also need to be considered, as do social factors, so standardisation cannot be purely looked at as based on geography.
The next speaker gave a paper on this particular subject: ‘Broadening the horizon of the written Standard English debate: a view beyond the metropolis’. The speaker reiterated that the established view was that the standard developed from government (chancery) scribes in London, but that this view is now being challenged, and that a standard wouldn’t develop from one place. The speaker argued that there are a variety of processes in play, including regional and social. The speaker’s project looked at emerging standards in four locations: York, Bristol, Coventry and Norwich (see http://www.emergingstandards.eu/) and looked at trade and migration as well as politics. The four locations are the largest outside of London in the period and are situated in different Middle English dialect areas. The project is investigating urban vernaculars using corpora of local texts that have been transcribed using the http://www.histei.info/p/home.html tool. The project looked at the replacement of –th with –s. London appears to be the primary centre for this, but –s is already used in the North before it appears in London. The speaker pointed out that the text type was important (e.g. private letters vs more public documents) and also that certain verb types (e.g. do and have) were much slower to adopt –s. The data suggested that York is a –s majority the earliest while Bristol only has –th up to about 1600 before becoming more mixed. Coventry also only has –th up to around 1600 and then just a few –s examples after this. The speaker noted how text type and verb type play a big role in this, as does scribal preference.
The next speaker discussed ‘Charting spelling variation and editorial reliability in English historical letters’, pointing out that while EEBO is a good resource for printed materials, there is a lack of resource for manuscripts. The speaker’s project wanted to see whether an edition based corpus like CEEC (Corpus of Early English Correspondence, pronounced ‘seek’ – see http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/index.html) could be used to look at private spelling. CEEC contains manuscript texts from 1402-1800 and consists of 5.2 million words, and the speaker investigated how reliable the editions are and whether it’s possible to work around editorial changes. The speaker also mentioned a n-gram browser for EEBO that can be accessed here: https://earlyprint.wustl.edu/tooleebospellingbrowser.html. The speaker’s project looked at the variation of spelling of ‘u’ and ‘v’ – e.g. ‘use’ and ‘vse’, ‘above’ and ‘aboue’. The use of ‘u’ and ‘v’ appeared to be very different in the CEEC texts as opposed to EEBO, but maybe this was because of editorial changes. The speaker focussed on a smaller text that kept ‘u’ and ‘v’ use intact – the Electronic Text Edition of Depositions 1560-1760 (see http://www.engelska.uu.se/research/english-language/electronic-resources/english-witness/) that contained 267,000 words. The speaker discovered that ‘u’ and ‘v’ usage here matches EEBO, which suggests CEEC is not reliable for ‘u’ and ‘v’ recording. The speaker also looked at the use of ‘ie’ and ‘ei’ in words like ‘friend’ and compared this in EEBO, CEEC and the depositions and discovered a similar pattern. This has resulted in the ERRATAS project (https://tuhat.helsinki.fi/portal/files/91629680/ERRATAS_flyer.pdf) that aims to estimate the reliability of manuscript editions without going back to the manuscripts. A checklist of textual features was fed into an Access database and this is used to run against text to see (for example) if the texts have features you would expect from the 1600s. This then allows you to identify more authentic editions and to create sub-corpora only containing these. Unfortunately the sub-corpus still didn’t give good results for ‘u’ and ‘v’, and the speaker reckoned this was because editions can be classed as ‘really good’ if most features are highly rated but some are rated poorly. The speaker pointed out that all editions are eclectic. However, from looking at the depositions the speaker noted that ‘u’ and ‘v’ standardisation occurred later and took longer than in print, and that the same could be observed for ‘ie’ and ‘ei’ too.
The following speaker looked at ‘Verb inflection in the early editions of the book of good manners’ and gave an overview of current thinking of standardisation, namely that standardisation of orthography happened about 1650 due to the combined efforts of spelling reformers, grammarians, schoolmasters, and also printers. Printers included master printers, journeymen, compositors, booksellers and publishers. The written language (especially spelling) was standardised and optional variability was supressed. The speaker pointed out that it has been claimed that the earliest printers were not able to regularise spelling, or were not interested in doing so because they were foreign or lacked education. However, maintaining flexibility was a good thing for printers. Printers also tended to imitate the spelling of important authors. The speaker’s project looked at levels of consistency in the third person singular verb ending (e.g. –eth, -ith) in different editions of the Book of Good Manners translated from French by three printers before 1500. The speaker found that Caxton only uses non-final –e and was consistent in his usage even though he was the earliest.
The last speaker of the day looked at ‘Regularisation in the Corpus of Early English Correspondence’ and how to define and quantify spelling variation. The speaker pointed out that variation is when there are multiple forms for one function, or more orthographical forms for one lexico-grammatical unit. The speaker investigated the ratio of the number of forms to units (i.e. types). However, looking at ratios doesn’t take into consideration the distribution of tokens. The speaker also wanted to calculate entropy – the measure of uncertainty. The higher the value the more variability there is. However, this needs to be weighted in the calculations otherwise values such as 98,1,1 will give the same figure as 50,25,25. The weighting was implemented by measuring the relative frequency of the types. The speaker also used CEEC for data, and pointed out that it is not lemmatised, but existing part of speech tagging helps. The speaker ended up with 250,000 forms and also metadata about writers – gender, recipient relationship, authenticity (whether an autograph or written by a scribe). The speaker also used the process of bootstrapping (see https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/) where a sample of data is taken and randomised and this is done 1000 times, with the same query run on each sample to see whether the results are the same each time. The speaker noted that there were not many female writers in the dataset and there is low reliability associated with small sample sizes. A way to get around this is to use different time periods to get similar sample sizes. Results include noting that there is more variability in letters by women, that women mostly send letters to family, and that autographs are more variable than using scribes.
Tuesday
On Tuesday I attended the morning sessions that were focussed on Scots. The first paper looked at ‘The emergence of the vernacular in 15th century Scottish legal texts’. The speaker stated that by the 15th century Scots had a distinct orthography that different from the standard form that developed in Edinburgh in the 16th century, which was based on English. The speaker looked at legal texts as these are linguistically conservative and took three sources: court records from Aberdeen from 1398-1511 (which are the oldest and most complete run of civic records available), the ‘Common Buke’ from Haddington from 1423-1470 (Haddington was the fourth largest town in Scotland in the 15th century) and the Newburgh (in Fife) burgh court book from 1459-1479, looking at how the vernacular spread in these documents. The speaker pointed out that multilingualism was common in legal and other medieval texts, using a mixture of Latin and Scots, with abbreviations used that could actually be in either language. The speaker identified the ‘matrix text’ – the most common language each document. For Aberdeen, entries in Scots increase over the 15th century while in Haddington about half of records are in Scots and it’s possible to identify two town clerks, one of whom uses more Latin. In the same period as the Aberdeen records there are much fewer entries in Scots. In Newburgh 98% of entries are in Scots while in Aberdeen in the same period less than a quarter are in Scots. The speaker stated that vernacularisation in Aberdeen happened later and slower. This might have been due to scribal preference and diachronic change. Aberdeen was at the periphery while the other locations were closer to Edinburgh where laws were passed. But there are also different proportions of Scots depending on the content too. The speaker concluded that geographical, socio-political and economic matters need to be taken into consideration.
The second speaker was my colleague Carole Hough, who talked about the REELS project. The focus was on the evidence for Old Northumbrian in the place-names. Berwickshire was settled from Northumbria by Old English speakers and Old Northumbrian is one of the least well documented varieties of Old English. The evidence for it is documentary, epigraphic and toponymic but the first two are very limited and come from a few mostly religious texts like Cædmon’s Hymn. There is little previous research on place-names and REELS is doing this. In the Dictionary of Old English (letters A-H) there are 269 headwords with Old Northumbrian evidence, mostly religious, and place-names can give a different balance. REELS has identified 82 Old Northumbrian terms and 12 personal names, mostly concrete nouns (66) that are landscape features, buildings, creatures and people. The speaker gave examples for each letter from A-H. E.g. ‘Auchencrow’ comes from ‘Aldengraue’ and is the earliest example of ‘olden’. ‘Bassendean’ is from bæc-stan and is the only example in Scotland. Chirnside is a ‘churn shaped hill’ and shows the metaphorical connection between containers and landscape. ‘Fast Castle’ is ‘fastcastell’ in its earliest form and means ‘fortified castle’. The use of ‘fast’ to mean ‘strong’ has an earliest source in DOST some 200 years later than the place-name evidence. Similarly, ‘Lennel’ comes from OE ‘hlæne’ meaning lean (so ‘poor quality land’) and this evidence is 300 years earlier than DOST records.
The next speaker gave a paper on ‘A quantitative analysis of socio-political change on 18th century Scots’, stating that anglicanisation and revitalisation were strong at the same time. It was the time of the union of parliaments and the ‘age of politeness’ where people were keen to use ‘correct’ English forms rather than local forms, but it was also a time when there was a ‘vernacular backlash’ when certain speakers chose to use more Scots terms, e.g. Burns, and the development of Scottish Standard English which became equally acceptable in ‘polite’ use. It was also the time of the Jacobite risings, when people rejected the union, of public unrest, also of anti-Scots discrimination and radicalisation stimulated by the French and US revolutions. The speaker looked at the interaction between language and politics, both in general society and in authors. The speaker created a corpus using a subsection of the Corpus of Modern Scots Writing, which was stored in a Labbcat corpus. Scots and English words were identified based on spellings and words in the corpus were tagged. 770,000 tokens were tagged to allow frequencies of Scots usage to be investigated in combination with other factors such as political alignment. The speaker used statistical methods, namely ‘conditional trees’ (c-trees, see https://www.rdocumentation.org/packages/partykit/versions/1.2-2/topics/ctree) and ‘random forests’ (see https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd) with analysis carried out using ‘R’. The data was split across genre, publication place, profession, showing the percentage of Scots or English usage. The speaker discovered that genre was the strongest predictor – more Scots words were used in ‘creative’ works while more English was used in ‘professional’ texts. The birthplace of politicians was important too, with Glasgow born politicians using more Scots words. The ‘random forest’ was made up of about 1000 c-trees, with data split into multiple subsets and the same calculations were then run on each set.
The final speaker in the session looked at the ‘Loss and reinstatement of /r/ and /l/ in varieties of Scottish English’, with /r/ being investigated in present day sources and /l/ in sources from the 15th to 18th centuries. The speaker pointed out that Labov has said that linguistic processes now are the same as historical ones and the speaker wanted to see if this was the case. The speaker used the SCOTS corpus and the BBC voices recordings to investigate the decrease in rhoticity in Scottish middle class speakers at the start of the 20th century and an increase again from the 70s onwards. For /l/ vocalisation the speaker looked at 4 historical dictionaries. The speaker compared these two sounds because they are from the same sound class of liquids. Rhoticity was categorised into four types (tap/trill, approximant, zero and other) in a variety of contexts (e.g. before fricatives, before consonants) and the pilot study looked at 6 speakers born in 3 different decades. For /l/ vocalisation the speaker wanted to identify its use in words like ‘gold’, ‘folk’, ‘full’ and ‘pull’.
The second plenary talk was about how competition is central to language change. The speaker discussed the use of cobweb and spiderweb, and how the former appears to be declining as the latter increases. The speaker also noted that some other forms such as ‘about’ and ‘without’ also do the same but it’s not clear what they are being replaced with, or how things correlate. Sometimes new forms just fit in without replacing anything. The speaker categorised types of competition as ‘squirrel type’, as when grey squirrels introduced to Europe led to a major decrease in native red squirrels – in language change such ‘squirrel’ changes have a direct causality between competitors. However, there are also ‘salmon type’ changes too – where causality is indirect. The speaker suggested that phonetic, morphological and some grammatical changes are ‘squirrel type’ while most grammaticalisation and typological shifts are more ‘salmon type’. ‘Squirrel type’ dominates historical linguistics because it allows accountability, more comprehensive study and is a closed system. It’s easier to do statistical analysis and easier to demonstrate the effect of competition. However, it can downplay the effect of other things. As an example of ‘salmon type’ the speaker discussed the rise of ‘want to’, which is part of a trend of modals (e.g. may, can, could, shall) declining while semi-modal use is increasing (be going to, have got to, want to, need to). The speaker wanted to know whether the changes interact and looked at ‘want to’ and alternative expressions (will/would) over time.
The speaker looked at different translations of Don Quixote over time. There have been many English translations from 1612 onwards and the speaker wanted to see how the use of models in the translations have changed through 8 translations (2 US, 6 UK) plus the original Spanish, which were added to a corpus, with each text comprising about 400,000 words. However, the speaker pointed out that the translators might not have translated independently and would also have had different aims. The speaker identified 912 ‘want to’ tokens and a sample of ‘will’ and ‘would’ and compared the corpus with the Brown corpus, and they both showed similar trends. The speaker visualised the various translations of ‘quiero’ via network diagrams, with nodes being translations and the connections demonstrating which translations are possible in the same passage. The diagram was then simplified, leaving out the search term and grouping lexically similar items. Weak ties were also excluded, as were semantically general words. The speaker then grouped translations by volition – strength of desire, the time lag between desire and attainment, barriers and likelihood of attainment. This allowed the speaker to gain an insight into the competitors of ‘want to’ and its evolving meaning. In the 18th century it signified low subject control and moderate / strong desire while in the 21st century it has high subject control and is verging on a future marker.
For ‘will’ and ‘would’ there is complex polysemy – meanings shade into each other and the speaker noted that the original Spanish can help to disambiguate. Volitional meanings are on the decline and this is faster for ‘will’ than ‘would’. The speaker then discussed whether there was any causality and stated that the decline in ‘will’ is unlikely to have been caused by ‘want to’ but for ‘would’ it’s less clear.
After lunch I headed into the session about lexicon and spelling. The first paper was given by two speakers and was about the ‘Semantic Distribution of Antedated senses in the OED and HT’. The speakers discussed the work that is going on with dating in OED3, for example how ‘Scotswoman’ in OED2 has a date of 1820 while in OED3 the date is 1522. Similarly ‘Scotchwoman’ has been revised from 1818 to 1623. As these are the only words in this particular category in the Historical Thesaurus this means the entire category has now been antedated. The speakers wanted to investigate whether certain semantic fields have been more greatly affected by antedating. The top 10 branches that have the most antedated senses include ‘trade’ where 56% of senses have been revised and ‘people’ where senses have been revised an average of 42 years. The speakers then discussed how branches could be weighted by splitting the senses into 100 year chunks and then ranked in each period. Using this method all the major antedated categories are within ‘The Social World’ (except for ‘People’), although the speakers pointed out that not all data has been linked yet. The ‘branches’ referred to correspond to the ‘Tier 2’ categories in the thesaurus and include everything below that, for example, there are 22 categories within ‘Trade and Finance’. These could then be arranged by antedated senses and their size could be compared. The category of ‘Money’ appeared to be important, with several senses antedated by more than 100 years. General patterns seemed to be that compound words and verbs with affixes were more likely to be antedated. The sources of antedatings were also discussed. For ‘people’ The Times was used, as were journals of anthropology. Most sources were also used in OED2 and earlier. For the antedatings of nations and ethnicities in Early Modern English 65% are from books in EEBO.
The next paper looked at lexical replacement, ‘From Eadig to Happy’. The speaker discussed how lexical replacement in Middle English happened gradually by layering – new layers continually emerge and can co-exist with old layers. ‘Eadig’ has 1650 occurrences in the DOE corpus and also meant ‘wealth’ as well as ‘happy. In ME ‘edi’ has about 100 occurrences up to 1400 in the corpus of ME prose, with a shift from more concrete ‘wealth’ to more abstract ‘happy’. The speaker pointed out that the Old Norse ‘happ’ is the source of ‘happy’, originally meaning ‘good luck’ but developing a new adjectival meaning from the noun. OE also had ‘gehæppre’, meaning ‘handy’ while ‘hap’ in ME meant a person’s lot. The speaker stated that ‘happy’ is not a direct loanword but instead the form comes in and is adapted following English rules. Lucky also replaced some senses of ‘happy’.
The final paper I attended this day looked at the use of <u> and <v> in early modern English manuscripts. There is alternation between these in this period. The speaker looked at the court documents of the Salem witch trials from 1692-3 as writers were slower to adopt the conventions set by printers. The Salem documents were written by members of the public rather than professionals and there were more than 200 scribes in over 1000 documents (but no females). The speaker looked at mixed instances in transcriptions that retained the original spellings, looking at medial ‘u’, initial ‘v’ and final ‘u’. The speaker found that there were no ‘u’ forms in initials that should be capitals. ‘u’ is the most common and represents both ‘u’ and ‘v’ and in the final position ‘u’ completely dominates while medial ‘u’ represents vowels. In compounds (e.g. ‘herevnto’) ‘u’ is only used once. The speaker created profiles for each scribe and noted that the age of the scribe also affected the pattern of use. The speaker also noted that lexical variation also needs to be considered – e.g. preceding letters affect use, such as ‘av’ or ‘au’. Finally, the speaker noted that modern spelling conventions were not firmly in place until the 1690s.
Wednesday
On Wednesday I attended the workshop on investigating meaning, which was being led by the LinguisticDNA project that I was involved with. The first paper was ‘Distributions of concepts in the Old Bailey Voices Corpus’. The speaker pointed out that the concept of domain models of language have been around for more than 200 years since ‘the alphabet of human thought’. When looking at historical texts there’s a challenge as there aren’t that many tagged ones to choose from. The Old Bailey corpus was good because it had lots of female participants and also many examples of lower social speakers. There is data from about 200,000 trials and around 134 million words and it’s possible to trace the speakers through the trials and see what happened to them. It’s also linguistically controlled as it’s one single genre. However, defendants don’t always speak due to ‘plea bargaining’ in later texts. The focus of the project was on 1800-1820 as there is more speech in this period. The text also have a lot of metadata – information about offences, gender of speakers etc. Most offences are theft so the project focussed on this. The speech is also split by role – legal males (there are no legal females), plus non-legal males and females (these are witnesses and defendants). The project annotated the corpus using the SAMUELS tagger to get the concepts. The Spacy tagger (https://spacy.io/api/tagger) was also used to get part of speech. Analysis was then undertaken using Python Jupyter notebooks (http://jupyter.org/), which allowed complex searches to be created. The project identified the most frequent concepts for different speaker types, e.g. legal males and negative questioning. It was possible to find the most characteristic concepts and discover what concepts appear more frequently than would be expected. For non-legal women the most frequent concepts were relationships and household related things while for non-legal males it was activities outside the house. The project also looked at grammatical analysis – e.g. how people were described, and agency: what concepts were used in more active or passive constructs.
The next paper was by the researchers who created the Bilingual Thesaurus of Medieval England, who I will be working with in the coming months. They looked at semantic shifts and why these occur, looking specifically at lexical borrowing. Middle English was chosen as there is lots of borrowing in this period, from Latin, Norse and French, and at different levels of the semantic hierarchy. The speakers looked at the technical register in ME and French borrowings – looking at more precise and specific terms using the bilingual thesaurus as a data source. They looked at specific domains, e.g. buildings, and needed to look up the hierarchy as well as at lower levels to see which senses broaden or narrow over time. The speakers stated the lexical borrowing is a trigger for semantic change and terms often start with a specific meaning and then broaden. The semantic hierarchy is useful to see the levels of borrowing and also patters: shifts or obsolescence, the types of words borrowed, whether different parts of speech behave differently. The semantic hierarchy was based on the Historical Thesaurus but was not directly mapped, and OED regional usage labels were also used. For the pilot study the speakers focussed on polysemy and whether this might lead to a semantic shift. They discovered that polysemy was unevenly distributed –building terms had the most while food preparation had the least. They discovered that there is a link between borrowing and native obsolescence – the domains with the highest proportion of loanwords have the highest proportion of obsolete native terms (with obsolete meaning obsolete by modern times). The bilingual thesaurus’ categories fit into the HT’s tiers 3-7 and the speakers investigated whether the number of subcategories was a sign of polysemy and / or technicality. They discovered that the greater number of items there were in a category meant there were more synonymous terms and a semantic shift would be more likely. Polysemy was identified via the OED and other dictionaries and the speakers noted that technical vocabulary shouldn’t have much polysemy as technical terms should be distinct.
The third paper in the session looked at the ‘Semantics of whorishness in Jacobean drama’. The speaker looked at the ‘city comedies’ to get a good sense of what Jacobean drama was like, looking at authors such as Johnson and Dekker, and not including Shakespeare. The speaker made a corpus from texts taken from the Visualising English Print project (http://graphics.cs.wisc.edu/WP/vep/) and identified about 17 plays, comprising about 1 million words. The speaker identified terms for ‘whore’ from the Historical Thesaurus of English that were in use at the time and used the ‘Ubiqu+Ity’ tool to generate stats (See https://vep.cs.wisc.edu/ubiq/). The speaker looked at several hierarchical levels of the HT, covering things like licentiousness and unchastity. About 1500 words and phrases were identified and from these 304 were current in the period 1546-1606. These were arranged into groups and the Ubiqu+Ity tool then generated graphs showing the use of the words across all of the plays. The speaker noted that there was no consistency of use across all of the plays – some have lots of words in one category but none in others – and there were some outliers, for example the play ‘Roaring Girl’ has a character called ‘Moll’ and this is also a ‘whore word’ so results for this play were skewed. The speaker also looked at the context of the words and colour coded words in different categories to show their proximity, and also looked at other collocations such as the use of pronouns – e.g. ‘You whore’ vs ‘Son of a whore’.
The final paper of the session looked at ‘Systematically detecting patters of social, historical and linguistic change’. The speaker stated that to systematically detect linguistic change this has to be undertaken computationally. This can either be by a logic based approach – using AI to find answers to questions. But this is difficult with large historical texts. An alternative is a distributional approach – looking for words with similar distributional properties and seeing whether they have similar meanings. This works well with large corpora but there are problems with synonymy, typography and antonymy. The speaker stated that looking for co-occurrence, to capture associations via context windows is another approach – looking at collocations and document classification. The speaker mentioned using the TOEFL word similarity tests (multiple choice – given a word there are four potential synonyms that a person / AI has to choose from). The speaker linked this into topic modelling too. The speaker used texts from EEBO and the CLMET corpus and ran these through the Mallet topic modelling tool (http://mallet.cs.umass.edu/topics.php) to generate a conceptual map. Distributional semantics for each text were plotted on a multi-axis map. If the angle of the line for two texts is similar then the texts can be said to be similar. The speaker looked at the concept of poverty in 8 novels by Dickens and generated a heatmap to show occurrences within the text as opposed to a wider corpus (CLMET). The speaker used ‘kernel density estimation’ (See https://mathisonian.github.io/kde/) to look at the semantic distances between words and developed a network map of the results.
I returned to the ‘Investigating Meaning’ session after the coffee break, and the first paper looked at ‘A network methods approach to exploring conceptual forms’. The speaker looked to focus on trying to define one or more ‘meanings’ using co-occurrence patterns to yield networks, or ‘constellations’ of interconnected ideas without necessarily requiring a central word or phrase. When looking at associations or co-occurrence the speaker counted co-occurrences and divided by the frequencies in the total corpus to see if they are more likely to occur. The distance can be changed (e.g. sentence, whole document) and the process will still work, and can be used to create network diagrams. The speaker stated it is then possible to compare different network maps and the process can also be continued without a central word. You can start with a ‘seed word’ and track the associations over time – some go, others come in and the seed word itself can also go. E.g. looking at the networks associated with ‘grievances’ from 1800-1960 the central term goes by 1920 but other connections remain. The speaker then asked what can you do with networks other than look at them? The speaker then discussed quantitative techniques for understanding political concepts in a linguistic context, focussing on ‘cliques’ – subnetworks in a larger network and how frequently you have to pass through a node to reach another. These can be tracked over time and individual words don’t need to be bothered about. For example, ‘dissipation’ is not a current word for ‘drunkenness’ but it did appear in different periods. The speaker pointed out that it is possible to work out the relative strength of cliques using ‘betweenness centrality’ (see https://www.sci.unich.it/~francesc/teaching/network/betweeness.html), which works out centrality based on the shortest path between nodes. Nodes that are the links between clusters are conceptually critical and have ‘high betweenness’. The speaker demonstrated a tool for displaying centrality and cliques which can currently be accessed here: http://54.194.211.202:3838/viewer-0-9/
The next speaker discussed ‘Mapping Discursive Concepts’, and presented an overview of some of the outputs of the LinguisticDNA project that I was involved with. The speaker stated that the project looked at working out meaning in Early Modern English text via lexical co-occurrence – looking at every word in every text in EEBO-TCP (60,000 text and1 billion words). Lemmas were pre-processed with the MorphAdorner tool (http://morphadorner.northwestern.edu/morphadorner/) and then co-occurrence with a window of 50 tokens either side of a word were looked at to identify trios rather than pairs of co-occurrence. For example, if ‘diversity’ and ‘opinion’ are found together what are the third terms that appear with them? (e.g. ‘religion’) This resulted in billions of trios in CSV files, which were analysed for statistical significance. The project developed a public interface (https://www.dhi.ac.uk/ldna/) that features noun lemmas that appear at least 5000 times, with pairs occurring at least 500 time and trios at least 50. The speaker stated that it is possible to find the prominent trios in subsets of texts, for example sermons, and to look at strength of association and unusualness. The speaker stated that this is different from topic modelling as the project is not categorising texts and are looking at co-occurrences within a window of 100 words so it’s more focussed. The project is looking at identifying typical and atypical trios – working out what is weak and what is strong. This is calculated using a variant of PMI (Pointwise mutual information) and looking at the range of differences. It’s possible to look at words that have a small number of pairs but a large number of trios (or vice-versa) and to map these out. The project still intends to make visualisations available and to link to semantic and pragmatic features. Expanding beyond trios to quads and more is also an option.
The final speaker in the workshop also represented the Linguistic DNA project and discussed the ‘construction of co-occurrence clusters.’ The speaker covered some of the same ground as the previous speaker and discussed the public interface the project is developing. The speaker pointed out that the interface contains a ‘stop list’ of words that are too frequent, such as ‘God, man, thing, Christ’. These all appear between 1.7million to 6.5million times and so would have swamped everything else. The speaker gave some examples of how searching for trios and pairs could work, for example looking at the trios that are linked to ‘life’ and ‘death’ – e.g. ‘body, soul, heaven, earth’.
Wednesday’s plenary speaker was Marc Alexander, who gave a wonderfully entertaining talk on ‘lexicalisation pressure’. The talk was focussed on the Historical Thesaurus, and more specifically on a number of the visualisations that I’d been involved in creating, so it was particularly nice for me to see everything being discussed. But as I already knew a lot about what was being discussed I didn’t make particularly copious notes. The speaker pointed out the important face that there are no exact synonyms in the HT categories – all of the words contained therein have subtle differences. When discussing the numbers of words that are added to English over time, the speaker discussed the concept of ‘churn’ – period where the total number of words doesn’t seem to change but this is because the number of words lost balances out the number of words gained. The speaker also pointed out that the importance of categories can’t necessarily be ascertained by the size of the category, as firstly the OED sometimes over-represents minor things, and also some concepts naturally have only a few words – e.g. there is only really one word for ‘terrorist’ but terrorism as a concept is important in modern times. The speaker also discussed ‘density’ – working out important word forms based on if the word is reused in the same semantic field, for example reusing a noun as a verb (fish, record), or being used in compounds or different parts of speech, e.g. ‘run’ appearing 10 times within the category ‘swiftness’.
Thursday
On the final day of the conference I spent the entire day in the workshop on ‘Visualisations in historical linguistics’. The first speaker discussed ‘Visualising the interaction between grammar and style’. The speaker discussed using correspondence analysis (see http://www.mathematica-journal.com/2010/09/an-introduction-to-correspondence-analysis/) in order to visualise frequency counts geographically. The speaker mentioned the importance of reducing multidimensionality in order to make it easier to understand data – to bring variation in the data down to something that can appear on a biplot (a two-variable scatterplot). The speaker also discussed distance matrices (the distance between rows and columns in a table) and statistical approaches that can be used to analyse the data, e.g. Chi squared and weighted Euclidian distances. The speaker used R and the CA package for R (see http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/113-ca-correspondence-analysis-in-r-essentials/), running this on a manually transcribed set of 13 horse manual texts. The speaker created a correspondence plot using the Shiny package for R (https://shiny.rstudio.com/) with axes showing the degree of variation. The speaker also looked at Ælfric’s text to see how exemplary they are of Old English using a similar method.
The second paper was a discussion of HistoBankVis (http://subva.dbvis.de/histobankvis-v1.0/) , a tool that’s been in development for the past few years . The speaker asked how useful visual analytic approaches are. In Historical Linguistics data tends to be high dimensional and contains subspaces, which makes it an interesting challenge for computer scientists. Linguists are often not good at looking at lists of numbers so visualisations can help and real-time visual analytics allow hypotheses to be tested immediately. The speaker looked at subject case and word order in the history of Icelandic using the IcePaHC corpus (https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)). The corpus was uploaded to the tool and the features the researcher is interested in can then be picked out, then the results can be visualised through the interface. The speaker demonstrated some histograms, bar charts and heatmaps, which were generated using chi squared and Euclidian distance statistical methods to identify distances that are of statistical significance. Features could also be visualised using the parallel sets technique (see https://www.jasondavies.com/parallel-sets/) and the interface allows the user to drag and drop sections, change colours, cite particular views of the data
The third paper discussed ‘Visualising ‘excrescent’ <t> and <t> deletion in fifteenth century Scots’. This was part of the FITS project (http://www.amc.lel.ed.ac.uk/fits/) and looked at the relationships between sound and spelling, using data from the Linguistic Atlas of Older Scots 1360-1500. The project wanted to uncover what phonological facts underlie the diversity of spelling in Scots, and developed a series of ‘triads’: a pre-scots sound (e.g. OE [i]), an Old Scots sound (e.g. OSc [I] and an Old Scots spelling unit (e.g. Osc <y>), for example ‘fysch’ for ‘fish’. A search page allows you to search various forms, spellings, tokens and sources (http://www.amc.lel.ed.ac.uk/fits/search.html) and the project developed the ‘Medusa’ visualisation to represent interconnected graphemes, which can currently be accessed here: http://www.amc.lel.ed.ac.uk/fits/fits-display-synchronic-data3.html.
The fourth paper discussed ‘Stylo visualisations of Middle English Documents’. Stylo is a script for R that was primarily developed to determine authorship attribution (see https://eadh.org/projects/stylo-r-package). It establishes links between texts via multiple sweeps rather than being based on just one similarity. The project used Stylo and also exported the data for use in Gephi as well. The project used the MELD corpus (https://www.uis.no/research/history-languages-and-literature/the-mest-programme/a-corpus-of-middle-english-local-documents-meld/meld-files/) of text from 1400-1525. Text had been localised extralinguistically and texts were for lots of different counties. The speaker noted that as there is lots of spelling variation in ME word n-grams are not much use. Therefore the speaker used character n-grams. These were generated for each text and the scores for each text were then compared. Each ME text then had a unique set of character n-grams – the ‘spelling fingerprint’ of the text like DNA code. The speaker used trigrams as these gave the best resolution of the data. The trigrams respected word boundaries and the speaker picked out the 500 most frequent trigrams per text. The first attempt at visualisation used one line per text and different colours for each county. The focus was on genre and county – e.g. ‘letters’ in four counties. The speaker noted that letter genres (e.g. conveyances) appeared in the same area, probably due to their standardised vocabulary. The speaker then simplified the visualisations by focussing on trigrams with frequencies of between 50 and 200 and joined text from each county together, reducing the number to 40. The speaker demonstrated how Northern texts are generally separate from Southern texts. Data was then imported into Gephi to look more at the network connections – looking at the strength of links. The speaker noted that ‘all trigrams lead to Warwickshire’, and other features such as texts from the East Riding being different to those from the West and North Ridings. The speaker pointed out, however, that the number of texts per county is not the same and the lengths of the texts are different. This can affect the relationships and skew the figures.
Thursday’s plenary then followed, which was about ‘a typology of syntactic change in Postcolonial Englishes’. It focussed on language change in Indian and Singapore English and discussed why some variations die out and there is stabilisation over time. Both India and Singapore were colonial from around 1830 and were multilingual, but there are many differences between the two countries in terms of length of contact and the interaction of other languages. There is also now a decline in English after independence and no historical corpora are available. The speaker discussed examples of usage from both country and discussed how these have developed.
After lunch I returned to the visualisation workshop, for the next paper on ‘Fingerprinting historical texts’. The paper discussed the Text Variation Explorer version 2 (TVE2) – see http://www.uta.fi/sis/tauchi/virg/projects/dammoc/tve.html, although this appears to only be about TVE1. It’s a free and open source tool for visualising text, allowing you to gain an overview of the texts, spot variation. Using visualisations is easier for the brain to comprehend. The tool splits texts into fragments of equal size and extracts hapax legomena, type / token ratios, average word length and such things. Results are then displayed in a stacked area chart. Features such as the most frequent words in each fragment can be passed through principal component analysis (see http://setosa.io/ev/principal-component-analysis/) to show clusters. The TVE2 interface allows users to drag and drop files into the system and metadata files containing any information you want to search on can also be uploaded. As a case study the speaker discussed the Laycester letters collection – letters between Queen Elizabeth and her advisors in 1585/6, comprising 65,000 words. Using the tool the speaker demonstrated how personal pronouns could be extracted how clusters of words could be generated and how the most frequent words could be viewed.
The following talk was about ‘Visualising semantic category development using the Historical Thesaurus of English’. This was presented by Fraser Dallachy and I was a co-author of the paper. The session involved a discussion of the new types of visualisations that we have added to the HT in the past year or so: The sparklines and heatmaps, timelines and mini-timelines. There’s not much more for me to say here as there’s already plenty about such matters in other posts of mine.
The next speaker discussed ‘how to visualise high-dimensional data’, and gave an overview of data structures, such as vectors, matrices and manifolds. The speaker stated that when dealing with multi-dimensional data you either need to reduce the dimensionality to 3 or less in order to make the data comprehensible, or to use cluster analysis. The speaker also discussed principal component analysis, as an earlier speaker had done.
The next paper was on the subject of ‘Mapping Language Change’, with mapping here meaning actual maps. The speaker noted that static maps are problematic as they only give us a single snapshot in time, and using dynamic maps via GIS approaches is something that is so far underused in Historical Linguistics. The speaker used QGIS (https://qgis.org/en/site/), a desktop GIS package, throughout the talk, showing how maps can be set up with different layers that can be selected or deselected, how metadata can be incorporated, how spatial analysis could be used and how linguistic data could be linked to archaeological data. For example, the use of kinship terms can be linked to the paths of rivers and railways. The speaker also pointed out some of the problems of using historical data with GIS: small datasets, limited number of texts, unequal distribution (both temporal and spatial), uncertainty of the location of texts, restricted metadata. The speaker illustrated how GIS could be used in Historical Linguistics by using the Index of Sources to the Linguistic Atlas of Middle English (http://www.lel.ed.ac.uk/ihd/laeme2/laeme2_framesZ.html). It was demonstrated how this could be mapped onto a map of the diocese of English to show distances between monastic houses, to show how these relate to the principal towns, plotting medieval roads etc.
The final paper of the day was ‘Creating interactive visualisations of big datasets to explore the re-emergence of initial /h/’. The speaker stated that <h> initial was lost in Middle English and the project identified the use of ‘a’ and ‘an’ as a diagnostic, looking at collocates of this in a huge dataset, namely the Google Books n-gram corpus. This contains 4.5 million books and 468 billion words. It’s possible to download all the bigrams from this, and the speaker identified 362 high frequency bigrams ‘an h…’ and ‘a h…’, consisting of 219 million bigrams. These were split into 5 categories and covered 5 centuries based on the publication date of the books. The project differentiated native words (happy, hand) versus borrowed words from French, Latin, Greek and other languages. Some Norman words were also Germanic borrowings (e.g. hamlet). Some preserved mute ‘h’ (e.g. hour, honest) and these were taken out. The speaker also looked at stress – long and short vowels following an ‘h’. The project developed an interface using the Shiny library that presented a series of settings on the left and line graphs in the main window. This allowed a variety of searches to be visualised, e.g. the proportion of ‘an’ over decades, borrowed words vs Germanic words. The speaker noted that the proportion of ‘an’ use decreases over time for both, but it is higher for borrowed words. Using the interface it is possible to zoom into sections of the graph, turn lines on or off. It’s also possible to select individual words to compare, e.g. hypocritical, hypocrisy, hypocrite and you can move the mouse over the curve to view the underlying data. The data pre-processing was done in Python using a 5.2Gb data file from Google Books.
And with that the conference ended. It was a hugely interesting and useful four days, but also pretty exhausting, especially factoring all of the commuting. I’m really glad I attended and I feel like I’ve learned a lot. But it will be nice to return to normality next week.