I attended the ICEHL (International Conference on English Historical Linguistics) conference in Edinburgh this week (see http://www.conferences.cahss.ed.ac.uk/icehl20/). It was a pretty intense conference, running from 9-5 each day with up to 8 parallel sessions and workshops running in addition to plenaries , drinks receptions and a lovely conference dinner in the Playfair Library in the Old College. As Glasgow and Edinburgh are so geographically close I’d decided that rather than staying in a hotel I’d commute through each day, which turned out to be a bit of a mistake, as my door to door commute was two hours each way, which was pretty exhausting. I did for a time live in Glasgow and work at Edinburgh University so I should really have known better, but I guess I’d just blocked the horrendousness of the commute out of my mind.
Anyway, my blog this week is really just going to be a summary of some of the papers I saw at ICEHL. Although pretty much all of them were full of interesting stuff not all of them were especially relevant to my own particular field of Digital Humanities, so I’ll try to focus more on those that did have a larger DH component. These were mostly all grouped into a day-long workshop that took place on the last day of the conference, with the theme of ‘Visualisations in Historical Linguistics’. I contributed to a paper that was given by Fraser on visualisations in the Historical Thesaurus during this workshop too.
Monday started with a plenary session about the ‘irregularisation’ of verbs in Early Modern English. The speaker showed some mathematical formulae that could be used to test for this, and showed how rules for predicting which past tense verb forms will be acquired during native language acquisition could be established. After this I attended a paper on ‘The lemmatisation of Old English class VII strong verbs on a lexical database’. The speaker discussed the Nerthus project (http://www.nerthusproject.com/) which has about 30,000 records of OE words, with data taken from many OE dictionaries. It includes alternative spellings and forms, the part of speech and other such data. The project incorporated three million files into a database and lemmatised words using a lemmatiser called Norna. The database itself is based on Filemaker. I also attended a paper on ‘Ambiguity resolution and the evolution of homophones in English’. This used the CELEX corpus (https://catalog.ldc.upenn.edu/LDC96L14) and the speaker said that in her sample about 22% of the data were homophones. These included diatones, where the noun and verb are spelled the same but stress is used to differentiate them (for example ‘contract’). The speaker showed visualisations generated by near infrared spectroscopy of the brain that showed the optical paths in the brain when diatones were spoken. This showed that different pathways were active when the noun or the verb form was heard.
I also attended a series of papers as part of the ‘Standardisation after Caxton’ workshop, which I found very interesting, even though it wasn’t massively connected to Digital Humanities. There was a handy introductory session where the speak discussed Haugen’s standardisation model, of codification, elaboration, selection and acceptance (e.g. see https://courses.nus.edu.sg/course/elltankw/history/Standardisation/B.htm). The speaker pointed out that standardisation was already under way before Caxton and previous studies have identified four types, of which three are all London based. However, we need to consider what’s going on beyond London and also consider multilingual factors, such as the influence of French. The speaker also pointed out that the convention is that variation ended ‘soon’ after Caxton, but that this might actually be as late as the 1800s, and that variation persists, especially in handwritten materials (and this continues to the present day). The speaker gave the example of alchemical works, which tended to be handwritten as they were illegal and contain much variation in spelling. Issues such as whether materials were private or public also need to be considered, as do social factors, so standardisation cannot be purely looked at as based on geography.
The next speaker gave a paper on this particular subject: ‘Broadening the horizon of the written Standard English debate: a view beyond the metropolis’. The speaker reiterated that the established view was that the standard developed from government (chancery) scribes in London, but that this view is now being challenged, and that a standard wouldn’t develop from one place. The speaker argued that there are a variety of processes in play, including regional and social. The speaker’s project looked at emerging standards in four locations: York, Bristol, Coventry and Norwich (see http://www.emergingstandards.eu/) and looked at trade and migration as well as politics. The four locations are the largest outside of London in the period and are situated in different Middle English dialect areas. The project is investigating urban vernaculars using corpora of local texts that have been transcribed using the http://www.histei.info/p/home.html tool. The project looked at the replacement of –th with –s. London appears to be the primary centre for this, but –s is already used in the North before it appears in London. The speaker pointed out that the text type was important (e.g. private letters vs more public documents) and also that certain verb types (e.g. do and have) were much slower to adopt –s. The data suggested that York is a –s majority the earliest while Bristol only has –th up to about 1600 before becoming more mixed. Coventry also only has –th up to around 1600 and then just a few –s examples after this. The speaker noted how text type and verb type play a big role in this, as does scribal preference.
The next speaker discussed ‘Charting spelling variation and editorial reliability in English historical letters’, pointing out that while EEBO is a good resource for printed materials, there is a lack of resource for manuscripts. The speaker’s project wanted to see whether an edition based corpus like CEEC (Corpus of Early English Correspondence, pronounced ‘seek’ – see http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/index.html) could be used to look at private spelling. CEEC contains manuscript texts from 1402-1800 and consists of 5.2 million words, and the speaker investigated how reliable the editions are and whether it’s possible to work around editorial changes. The speaker also mentioned a n-gram browser for EEBO that can be accessed here: https://earlyprint.wustl.edu/tooleebospellingbrowser.html. The speaker’s project looked at the variation of spelling of ‘u’ and ‘v’ – e.g. ‘use’ and ‘vse’, ‘above’ and ‘aboue’. The use of ‘u’ and ‘v’ appeared to be very different in the CEEC texts as opposed to EEBO, but maybe this was because of editorial changes. The speaker focussed on a smaller text that kept ‘u’ and ‘v’ use intact – the Electronic Text Edition of Depositions 1560-1760 (see http://www.engelska.uu.se/research/english-language/electronic-resources/english-witness/) that contained 267,000 words. The speaker discovered that ‘u’ and ‘v’ usage here matches EEBO, which suggests CEEC is not reliable for ‘u’ and ‘v’ recording. The speaker also looked at the use of ‘ie’ and ‘ei’ in words like ‘friend’ and compared this in EEBO, CEEC and the depositions and discovered a similar pattern. This has resulted in the ERRATAS project (https://tuhat.helsinki.fi/portal/files/91629680/ERRATAS_flyer.pdf) that aims to estimate the reliability of manuscript editions without going back to the manuscripts. A checklist of textual features was fed into an Access database and this is used to run against text to see (for example) if the texts have features you would expect from the 1600s. This then allows you to identify more authentic editions and to create sub-corpora only containing these. Unfortunately the sub-corpus still didn’t give good results for ‘u’ and ‘v’, and the speaker reckoned this was because editions can be classed as ‘really good’ if most features are highly rated but some are rated poorly. The speaker pointed out that all editions are eclectic. However, from looking at the depositions the speaker noted that ‘u’ and ‘v’ standardisation occurred later and took longer than in print, and that the same could be observed for ‘ie’ and ‘ei’ too.
The following speaker looked at ‘Verb inflection in the early editions of the book of good manners’ and gave an overview of current thinking of standardisation, namely that standardisation of orthography happened about 1650 due to the combined efforts of spelling reformers, grammarians, schoolmasters, and also printers. Printers included master printers, journeymen, compositors, booksellers and publishers. The written language (especially spelling) was standardised and optional variability was supressed. The speaker pointed out that it has been claimed that the earliest printers were not able to regularise spelling, or were not interested in doing so because they were foreign or lacked education. However, maintaining flexibility was a good thing for printers. Printers also tended to imitate the spelling of important authors. The speaker’s project looked at levels of consistency in the third person singular verb ending (e.g. –eth, -ith) in different editions of the Book of Good Manners translated from French by three printers before 1500. The speaker found that Caxton only uses non-final –e and was consistent in his usage even though he was the earliest.
The last speaker of the day looked at ‘Regularisation in the Corpus of Early English Correspondence’ and how to define and quantify spelling variation. The speaker pointed out that variation is when there are multiple forms for one function, or more orthographical forms for one lexico-grammatical unit. The speaker investigated the ratio of the number of forms to units (i.e. types). However, looking at ratios doesn’t take into consideration the distribution of tokens. The speaker also wanted to calculate entropy – the measure of uncertainty. The higher the value the more variability there is. However, this needs to be weighted in the calculations otherwise values such as 98,1,1 will give the same figure as 50,25,25. The weighting was implemented by measuring the relative frequency of the types. The speaker also used CEEC for data, and pointed out that it is not lemmatised, but existing part of speech tagging helps. The speaker ended up with 250,000 forms and also metadata about writers – gender, recipient relationship, authenticity (whether an autograph or written by a scribe). The speaker also used the process of bootstrapping (see https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/) where a sample of data is taken and randomised and this is done 1000 times, with the same query run on each sample to see whether the results are the same each time. The speaker noted that there were not many female writers in the dataset and there is low reliability associated with small sample sizes. A way to get around this is to use different time periods to get similar sample sizes. Results include noting that there is more variability in letters by women, that women mostly send letters to family, and that autographs are more variable than using scribes.
On Tuesday I attended the morning sessions that were focussed on Scots. The first paper looked at ‘The emergence of the vernacular in 15th century Scottish legal texts’. The speaker stated that by the 15th century Scots had a distinct orthography that different from the standard form that developed in Edinburgh in the 16th century, which was based on English. The speaker looked at legal texts as these are linguistically conservative and took three sources: court records from Aberdeen from 1398-1511 (which are the oldest and most complete run of civic records available), the ‘Common Buke’ from Haddington from 1423-1470 (Haddington was the fourth largest town in Scotland in the 15th century) and the Newburgh (in Fife) burgh court book from 1459-1479, looking at how the vernacular spread in these documents. The speaker pointed out that multilingualism was common in legal and other medieval texts, using a mixture of Latin and Scots, with abbreviations used that could actually be in either language. The speaker identified the ‘matrix text’ – the most common language each document. For Aberdeen, entries in Scots increase over the 15th century while in Haddington about half of records are in Scots and it’s possible to identify two town clerks, one of whom uses more Latin. In the same period as the Aberdeen records there are much fewer entries in Scots. In Newburgh 98% of entries are in Scots while in Aberdeen in the same period less than a quarter are in Scots. The speaker stated that vernacularisation in Aberdeen happened later and slower. This might have been due to scribal preference and diachronic change. Aberdeen was at the periphery while the other locations were closer to Edinburgh where laws were passed. But there are also different proportions of Scots depending on the content too. The speaker concluded that geographical, socio-political and economic matters need to be taken into consideration.
The second speaker was my colleague Carole Hough, who talked about the REELS project. The focus was on the evidence for Old Northumbrian in the place-names. Berwickshire was settled from Northumbria by Old English speakers and Old Northumbrian is one of the least well documented varieties of Old English. The evidence for it is documentary, epigraphic and toponymic but the first two are very limited and come from a few mostly religious texts like Cædmon’s Hymn. There is little previous research on place-names and REELS is doing this. In the Dictionary of Old English (letters A-H) there are 269 headwords with Old Northumbrian evidence, mostly religious, and place-names can give a different balance. REELS has identified 82 Old Northumbrian terms and 12 personal names, mostly concrete nouns (66) that are landscape features, buildings, creatures and people. The speaker gave examples for each letter from A-H. E.g. ‘Auchencrow’ comes from ‘Aldengraue’ and is the earliest example of ‘olden’. ‘Bassendean’ is from bæc-stan and is the only example in Scotland. Chirnside is a ‘churn shaped hill’ and shows the metaphorical connection between containers and landscape. ‘Fast Castle’ is ‘fastcastell’ in its earliest form and means ‘fortified castle’. The use of ‘fast’ to mean ‘strong’ has an earliest source in DOST some 200 years later than the place-name evidence. Similarly, ‘Lennel’ comes from OE ‘hlæne’ meaning lean (so ‘poor quality land’) and this evidence is 300 years earlier than DOST records.
The next speaker gave a paper on ‘A quantitative analysis of socio-political change on 18th century Scots’, stating that anglicanisation and revitalisation were strong at the same time. It was the time of the union of parliaments and the ‘age of politeness’ where people were keen to use ‘correct’ English forms rather than local forms, but it was also a time when there was a ‘vernacular backlash’ when certain speakers chose to use more Scots terms, e.g. Burns, and the development of Scottish Standard English which became equally acceptable in ‘polite’ use. It was also the time of the Jacobite risings, when people rejected the union, of public unrest, also of anti-Scots discrimination and radicalisation stimulated by the French and US revolutions. The speaker looked at the interaction between language and politics, both in general society and in authors. The speaker created a corpus using a subsection of the Corpus of Modern Scots Writing, which was stored in a Labbcat corpus. Scots and English words were identified based on spellings and words in the corpus were tagged. 770,000 tokens were tagged to allow frequencies of Scots usage to be investigated in combination with other factors such as political alignment. The speaker used statistical methods, namely ‘conditional trees’ (c-trees, see https://www.rdocumentation.org/packages/partykit/versions/1.2-2/topics/ctree) and ‘random forests’ (see https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd) with analysis carried out using ‘R’. The data was split across genre, publication place, profession, showing the percentage of Scots or English usage. The speaker discovered that genre was the strongest predictor – more Scots words were used in ‘creative’ works while more English was used in ‘professional’ texts. The birthplace of politicians was important too, with Glasgow born politicians using more Scots words. The ‘random forest’ was made up of about 1000 c-trees, with data split into multiple subsets and the same calculations were then run on each set.
The final speaker in the session looked at the ‘Loss and reinstatement of /r/ and /l/ in varieties of Scottish English’, with /r/ being investigated in present day sources and /l/ in sources from the 15th to 18th centuries. The speaker pointed out that Labov has said that linguistic processes now are the same as historical ones and the speaker wanted to see if this was the case. The speaker used the SCOTS corpus and the BBC voices recordings to investigate the decrease in rhoticity in Scottish middle class speakers at the start of the 20th century and an increase again from the 70s onwards. For /l/ vocalisation the speaker looked at 4 historical dictionaries. The speaker compared these two sounds because they are from the same sound class of liquids. Rhoticity was categorised into four types (tap/trill, approximant, zero and other) in a variety of contexts (e.g. before fricatives, before consonants) and the pilot study looked at 6 speakers born in 3 different decades. For /l/ vocalisation the speaker wanted to identify its use in words like ‘gold’, ‘folk’, ‘full’ and ‘pull’.
The second plenary talk was about how competition is central to language change. The speaker discussed the use of cobweb and spiderweb, and how the former appears to be declining as the latter increases. The speaker also noted that some other forms such as ‘about’ and ‘without’ also do the same but it’s not clear what they are being replaced with, or how things correlate. Sometimes new forms just fit in without replacing anything. The speaker categorised types of competition as ‘squirrel type’, as when grey squirrels introduced to Europe led to a major decrease in native red squirrels – in language change such ‘squirrel’ changes have a direct causality between competitors. However, there are also ‘salmon type’ changes too – where causality is indirect. The speaker suggested that phonetic, morphological and some grammatical changes are ‘squirrel type’ while most grammaticalisation and typological shifts are more ‘salmon type’. ‘Squirrel type’ dominates historical linguistics because it allows accountability, more comprehensive study and is a closed system. It’s easier to do statistical analysis and easier to demonstrate the effect of competition. However, it can downplay the effect of other things. As an example of ‘salmon type’ the speaker discussed the rise of ‘want to’, which is part of a trend of modals (e.g. may, can, could, shall) declining while semi-modal use is increasing (be going to, have got to, want to, need to). The speaker wanted to know whether the changes interact and looked at ‘want to’ and alternative expressions (will/would) over time.
The speaker looked at different translations of Don Quixote over time. There have been many English translations from 1612 onwards and the speaker wanted to see how the use of models in the translations have changed through 8 translations (2 US, 6 UK) plus the original Spanish, which were added to a corpus, with each text comprising about 400,000 words. However, the speaker pointed out that the translators might not have translated independently and would also have had different aims. The speaker identified 912 ‘want to’ tokens and a sample of ‘will’ and ‘would’ and compared the corpus with the Brown corpus, and they both showed similar trends. The speaker visualised the various translations of ‘quiero’ via network diagrams, with nodes being translations and the connections demonstrating which translations are possible in the same passage. The diagram was then simplified, leaving out the search term and grouping lexically similar items. Weak ties were also excluded, as were semantically general words. The speaker then grouped translations by volition – strength of desire, the time lag between desire and attainment, barriers and likelihood of attainment. This allowed the speaker to gain an insight into the competitors of ‘want to’ and its evolving meaning. In the 18th century it signified low subject control and moderate / strong desire while in the 21st century it has high subject control and is verging on a future marker.
For ‘will’ and ‘would’ there is complex polysemy – meanings shade into each other and the speaker noted that the original Spanish can help to disambiguate. Volitional meanings are on the decline and this is faster for ‘will’ than ‘would’. The speaker then discussed whether there was any causality and stated that the decline in ‘will’ is unlikely to have been caused by ‘want to’ but for ‘would’ it’s less clear.
After lunch I headed into the session about lexicon and spelling. The first paper was given by two speakers and was about the ‘Semantic Distribution of Antedated senses in the OED and HT’. The speakers discussed the work that is going on with dating in OED3, for example how ‘Scotswoman’ in OED2 has a date of 1820 while in OED3 the date is 1522. Similarly ‘Scotchwoman’ has been revised from 1818 to 1623. As these are the only words in this particular category in the Historical Thesaurus this means the entire category has now been antedated. The speakers wanted to investigate whether certain semantic fields have been more greatly affected by antedating. The top 10 branches that have the most antedated senses include ‘trade’ where 56% of senses have been revised and ‘people’ where senses have been revised an average of 42 years. The speakers then discussed how branches could be weighted by splitting the senses into 100 year chunks and then ranked in each period. Using this method all the major antedated categories are within ‘The Social World’ (except for ‘People’), although the speakers pointed out that not all data has been linked yet. The ‘branches’ referred to correspond to the ‘Tier 2’ categories in the thesaurus and include everything below that, for example, there are 22 categories within ‘Trade and Finance’. These could then be arranged by antedated senses and their size could be compared. The category of ‘Money’ appeared to be important, with several senses antedated by more than 100 years. General patterns seemed to be that compound words and verbs with affixes were more likely to be antedated. The sources of antedatings were also discussed. For ‘people’ The Times was used, as were journals of anthropology. Most sources were also used in OED2 and earlier. For the antedatings of nations and ethnicities in Early Modern English 65% are from books in EEBO.
The next paper looked at lexical replacement, ‘From Eadig to Happy’. The speaker discussed how lexical replacement in Middle English happened gradually by layering – new layers continually emerge and can co-exist with old layers. ‘Eadig’ has 1650 occurrences in the DOE corpus and also meant ‘wealth’ as well as ‘happy. In ME ‘edi’ has about 100 occurrences up to 1400 in the corpus of ME prose, with a shift from more concrete ‘wealth’ to more abstract ‘happy’. The speaker pointed out that the Old Norse ‘happ’ is the source of ‘happy’, originally meaning ‘good luck’ but developing a new adjectival meaning from the noun. OE also had ‘gehæppre’, meaning ‘handy’ while ‘hap’ in ME meant a person’s lot. The speaker stated that ‘happy’ is not a direct loanword but instead the form comes in and is adapted following English rules. Lucky also replaced some senses of ‘happy’.
The final paper I attended this day looked at the use of <u> and <v> in early modern English manuscripts. There is alternation between these in this period. The speaker looked at the court documents of the Salem witch trials from 1692-3 as writers were slower to adopt the conventions set by printers. The Salem documents were written by members of the public rather than professionals and there were more than 200 scribes in over 1000 documents (but no females). The speaker looked at mixed instances in transcriptions that retained the original spellings, looking at medial ‘u’, initial ‘v’ and final ‘u’. The speaker found that there were no ‘u’ forms in initials that should be capitals. ‘u’ is the most common and represents both ‘u’ and ‘v’ and in the final position ‘u’ completely dominates while medial ‘u’ represents vowels. In compounds (e.g. ‘herevnto’) ‘u’ is only used once. The speaker created profiles for each scribe and noted that the age of the scribe also affected the pattern of use. The speaker also noted that lexical variation also needs to be considered – e.g. preceding letters affect use, such as ‘av’ or ‘au’. Finally, the speaker noted that modern spelling conventions were not firmly in place until the 1690s.
On Wednesday I attended the workshop on investigating meaning, which was being led by the LinguisticDNA project that I was involved with. The first paper was ‘Distributions of concepts in the Old Bailey Voices Corpus’. The speaker pointed out that the concept of domain models of language have been around for more than 200 years since ‘the alphabet of human thought’. When looking at historical texts there’s a challenge as there aren’t that many tagged ones to choose from. The Old Bailey corpus was good because it had lots of female participants and also many examples of lower social speakers. There is data from about 200,000 trials and around 134 million words and it’s possible to trace the speakers through the trials and see what happened to them. It’s also linguistically controlled as it’s one single genre. However, defendants don’t always speak due to ‘plea bargaining’ in later texts. The focus of the project was on 1800-1820 as there is more speech in this period. The text also have a lot of metadata – information about offences, gender of speakers etc. Most offences are theft so the project focussed on this. The speech is also split by role – legal males (there are no legal females), plus non-legal males and females (these are witnesses and defendants). The project annotated the corpus using the SAMUELS tagger to get the concepts. The Spacy tagger (https://spacy.io/api/tagger) was also used to get part of speech. Analysis was then undertaken using Python Jupyter notebooks (http://jupyter.org/), which allowed complex searches to be created. The project identified the most frequent concepts for different speaker types, e.g. legal males and negative questioning. It was possible to find the most characteristic concepts and discover what concepts appear more frequently than would be expected. For non-legal women the most frequent concepts were relationships and household related things while for non-legal males it was activities outside the house. The project also looked at grammatical analysis – e.g. how people were described, and agency: what concepts were used in more active or passive constructs.
The next paper was by the researchers who created the Bilingual Thesaurus of Medieval England, who I will be working with in the coming months. They looked at semantic shifts and why these occur, looking specifically at lexical borrowing. Middle English was chosen as there is lots of borrowing in this period, from Latin, Norse and French, and at different levels of the semantic hierarchy. The speakers looked at the technical register in ME and French borrowings – looking at more precise and specific terms using the bilingual thesaurus as a data source. They looked at specific domains, e.g. buildings, and needed to look up the hierarchy as well as at lower levels to see which senses broaden or narrow over time. The speakers stated the lexical borrowing is a trigger for semantic change and terms often start with a specific meaning and then broaden. The semantic hierarchy is useful to see the levels of borrowing and also patters: shifts or obsolescence, the types of words borrowed, whether different parts of speech behave differently. The semantic hierarchy was based on the Historical Thesaurus but was not directly mapped, and OED regional usage labels were also used. For the pilot study the speakers focussed on polysemy and whether this might lead to a semantic shift. They discovered that polysemy was unevenly distributed –building terms had the most while food preparation had the least. They discovered that there is a link between borrowing and native obsolescence – the domains with the highest proportion of loanwords have the highest proportion of obsolete native terms (with obsolete meaning obsolete by modern times). The bilingual thesaurus’ categories fit into the HT’s tiers 3-7 and the speakers investigated whether the number of subcategories was a sign of polysemy and / or technicality. They discovered that the greater number of items there were in a category meant there were more synonymous terms and a semantic shift would be more likely. Polysemy was identified via the OED and other dictionaries and the speakers noted that technical vocabulary shouldn’t have much polysemy as technical terms should be distinct.
The third paper in the session looked at the ‘Semantics of whorishness in Jacobean drama’. The speaker looked at the ‘city comedies’ to get a good sense of what Jacobean drama was like, looking at authors such as Johnson and Dekker, and not including Shakespeare. The speaker made a corpus from texts taken from the Visualising English Print project (http://graphics.cs.wisc.edu/WP/vep/) and identified about 17 plays, comprising about 1 million words. The speaker identified terms for ‘whore’ from the Historical Thesaurus of English that were in use at the time and used the ‘Ubiqu+Ity’ tool to generate stats (See https://vep.cs.wisc.edu/ubiq/). The speaker looked at several hierarchical levels of the HT, covering things like licentiousness and unchastity. About 1500 words and phrases were identified and from these 304 were current in the period 1546-1606. These were arranged into groups and the Ubiqu+Ity tool then generated graphs showing the use of the words across all of the plays. The speaker noted that there was no consistency of use across all of the plays – some have lots of words in one category but none in others – and there were some outliers, for example the play ‘Roaring Girl’ has a character called ‘Moll’ and this is also a ‘whore word’ so results for this play were skewed. The speaker also looked at the context of the words and colour coded words in different categories to show their proximity, and also looked at other collocations such as the use of pronouns – e.g. ‘You whore’ vs ‘Son of a whore’.
The final paper of the session looked at ‘Systematically detecting patters of social, historical and linguistic change’. The speaker stated that to systematically detect linguistic change this has to be undertaken computationally. This can either be by a logic based approach – using AI to find answers to questions. But this is difficult with large historical texts. An alternative is a distributional approach – looking for words with similar distributional properties and seeing whether they have similar meanings. This works well with large corpora but there are problems with synonymy, typography and antonymy. The speaker stated that looking for co-occurrence, to capture associations via context windows is another approach – looking at collocations and document classification. The speaker mentioned using the TOEFL word similarity tests (multiple choice – given a word there are four potential synonyms that a person / AI has to choose from). The speaker linked this into topic modelling too. The speaker used texts from EEBO and the CLMET corpus and ran these through the Mallet topic modelling tool (http://mallet.cs.umass.edu/topics.php) to generate a conceptual map. Distributional semantics for each text were plotted on a multi-axis map. If the angle of the line for two texts is similar then the texts can be said to be similar. The speaker looked at the concept of poverty in 8 novels by Dickens and generated a heatmap to show occurrences within the text as opposed to a wider corpus (CLMET). The speaker used ‘kernel density estimation’ (See https://mathisonian.github.io/kde/) to look at the semantic distances between words and developed a network map of the results.
I returned to the ‘Investigating Meaning’ session after the coffee break, and the first paper looked at ‘A network methods approach to exploring conceptual forms’. The speaker looked to focus on trying to define one or more ‘meanings’ using co-occurrence patterns to yield networks, or ‘constellations’ of interconnected ideas without necessarily requiring a central word or phrase. When looking at associations or co-occurrence the speaker counted co-occurrences and divided by the frequencies in the total corpus to see if they are more likely to occur. The distance can be changed (e.g. sentence, whole document) and the process will still work, and can be used to create network diagrams. The speaker stated it is then possible to compare different network maps and the process can also be continued without a central word. You can start with a ‘seed word’ and track the associations over time – some go, others come in and the seed word itself can also go. E.g. looking at the networks associated with ‘grievances’ from 1800-1960 the central term goes by 1920 but other connections remain. The speaker then asked what can you do with networks other than look at them? The speaker then discussed quantitative techniques for understanding political concepts in a linguistic context, focussing on ‘cliques’ – subnetworks in a larger network and how frequently you have to pass through a node to reach another. These can be tracked over time and individual words don’t need to be bothered about. For example, ‘dissipation’ is not a current word for ‘drunkenness’ but it did appear in different periods. The speaker pointed out that it is possible to work out the relative strength of cliques using ‘betweenness centrality’ (see https://www.sci.unich.it/~francesc/teaching/network/betweeness.html), which works out centrality based on the shortest path between nodes. Nodes that are the links between clusters are conceptually critical and have ‘high betweenness’. The speaker demonstrated a tool for displaying centrality and cliques which can currently be accessed here: http://22.214.171.124:3838/viewer-0-9/
The next speaker discussed ‘Mapping Discursive Concepts’, and presented an overview of some of the outputs of the LinguisticDNA project that I was involved with. The speaker stated that the project looked at working out meaning in Early Modern English text via lexical co-occurrence – looking at every word in every text in EEBO-TCP (60,000 text and1 billion words). Lemmas were pre-processed with the MorphAdorner tool (http://morphadorner.northwestern.edu/morphadorner/) and then co-occurrence with a window of 50 tokens either side of a word were looked at to identify trios rather than pairs of co-occurrence. For example, if ‘diversity’ and ‘opinion’ are found together what are the third terms that appear with them? (e.g. ‘religion’) This resulted in billions of trios in CSV files, which were analysed for statistical significance. The project developed a public interface (https://www.dhi.ac.uk/ldna/) that features noun lemmas that appear at least 5000 times, with pairs occurring at least 500 time and trios at least 50. The speaker stated that it is possible to find the prominent trios in subsets of texts, for example sermons, and to look at strength of association and unusualness. The speaker stated that this is different from topic modelling as the project is not categorising texts and are looking at co-occurrences within a window of 100 words so it’s more focussed. The project is looking at identifying typical and atypical trios – working out what is weak and what is strong. This is calculated using a variant of PMI (Pointwise mutual information) and looking at the range of differences. It’s possible to look at words that have a small number of pairs but a large number of trios (or vice-versa) and to map these out. The project still intends to make visualisations available and to link to semantic and pragmatic features. Expanding beyond trios to quads and more is also an option.
The final speaker in the workshop also represented the Linguistic DNA project and discussed the ‘construction of co-occurrence clusters.’ The speaker covered some of the same ground as the previous speaker and discussed the public interface the project is developing. The speaker pointed out that the interface contains a ‘stop list’ of words that are too frequent, such as ‘God, man, thing, Christ’. These all appear between 1.7million to 6.5million times and so would have swamped everything else. The speaker gave some examples of how searching for trios and pairs could work, for example looking at the trios that are linked to ‘life’ and ‘death’ – e.g. ‘body, soul, heaven, earth’.
Wednesday’s plenary speaker was Marc Alexander, who gave a wonderfully entertaining talk on ‘lexicalisation pressure’. The talk was focussed on the Historical Thesaurus, and more specifically on a number of the visualisations that I’d been involved in creating, so it was particularly nice for me to see everything being discussed. But as I already knew a lot about what was being discussed I didn’t make particularly copious notes. The speaker pointed out the important face that there are no exact synonyms in the HT categories – all of the words contained therein have subtle differences. When discussing the numbers of words that are added to English over time, the speaker discussed the concept of ‘churn’ – period where the total number of words doesn’t seem to change but this is because the number of words lost balances out the number of words gained. The speaker also pointed out that the importance of categories can’t necessarily be ascertained by the size of the category, as firstly the OED sometimes over-represents minor things, and also some concepts naturally have only a few words – e.g. there is only really one word for ‘terrorist’ but terrorism as a concept is important in modern times. The speaker also discussed ‘density’ – working out important word forms based on if the word is reused in the same semantic field, for example reusing a noun as a verb (fish, record), or being used in compounds or different parts of speech, e.g. ‘run’ appearing 10 times within the category ‘swiftness’.
On the final day of the conference I spent the entire day in the workshop on ‘Visualisations in historical linguistics’. The first speaker discussed ‘Visualising the interaction between grammar and style’. The speaker discussed using correspondence analysis (see http://www.mathematica-journal.com/2010/09/an-introduction-to-correspondence-analysis/) in order to visualise frequency counts geographically. The speaker mentioned the importance of reducing multidimensionality in order to make it easier to understand data – to bring variation in the data down to something that can appear on a biplot (a two-variable scatterplot). The speaker also discussed distance matrices (the distance between rows and columns in a table) and statistical approaches that can be used to analyse the data, e.g. Chi squared and weighted Euclidian distances. The speaker used R and the CA package for R (see http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/113-ca-correspondence-analysis-in-r-essentials/), running this on a manually transcribed set of 13 horse manual texts. The speaker created a correspondence plot using the Shiny package for R (https://shiny.rstudio.com/) with axes showing the degree of variation. The speaker also looked at Ælfric’s text to see how exemplary they are of Old English using a similar method.
The second paper was a discussion of HistoBankVis (http://subva.dbvis.de/histobankvis-v1.0/) , a tool that’s been in development for the past few years . The speaker asked how useful visual analytic approaches are. In Historical Linguistics data tends to be high dimensional and contains subspaces, which makes it an interesting challenge for computer scientists. Linguists are often not good at looking at lists of numbers so visualisations can help and real-time visual analytics allow hypotheses to be tested immediately. The speaker looked at subject case and word order in the history of Icelandic using the IcePaHC corpus (https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)). The corpus was uploaded to the tool and the features the researcher is interested in can then be picked out, then the results can be visualised through the interface. The speaker demonstrated some histograms, bar charts and heatmaps, which were generated using chi squared and Euclidian distance statistical methods to identify distances that are of statistical significance. Features could also be visualised using the parallel sets technique (see https://www.jasondavies.com/parallel-sets/) and the interface allows the user to drag and drop sections, change colours, cite particular views of the data
The third paper discussed ‘Visualising ‘excrescent’ <t> and <t> deletion in fifteenth century Scots’. This was part of the FITS project (http://www.amc.lel.ed.ac.uk/fits/) and looked at the relationships between sound and spelling, using data from the Linguistic Atlas of Older Scots 1360-1500. The project wanted to uncover what phonological facts underlie the diversity of spelling in Scots, and developed a series of ‘triads’: a pre-scots sound (e.g. OE [i]), an Old Scots sound (e.g. OSc [I] and an Old Scots spelling unit (e.g. Osc <y>), for example ‘fysch’ for ‘fish’. A search page allows you to search various forms, spellings, tokens and sources (http://www.amc.lel.ed.ac.uk/fits/search.html) and the project developed the ‘Medusa’ visualisation to represent interconnected graphemes, which can currently be accessed here: http://www.amc.lel.ed.ac.uk/fits/fits-display-synchronic-data3.html.
The fourth paper discussed ‘Stylo visualisations of Middle English Documents’. Stylo is a script for R that was primarily developed to determine authorship attribution (see https://eadh.org/projects/stylo-r-package). It establishes links between texts via multiple sweeps rather than being based on just one similarity. The project used Stylo and also exported the data for use in Gephi as well. The project used the MELD corpus (https://www.uis.no/research/history-languages-and-literature/the-mest-programme/a-corpus-of-middle-english-local-documents-meld/meld-files/) of text from 1400-1525. Text had been localised extralinguistically and texts were for lots of different counties. The speaker noted that as there is lots of spelling variation in ME word n-grams are not much use. Therefore the speaker used character n-grams. These were generated for each text and the scores for each text were then compared. Each ME text then had a unique set of character n-grams – the ‘spelling fingerprint’ of the text like DNA code. The speaker used trigrams as these gave the best resolution of the data. The trigrams respected word boundaries and the speaker picked out the 500 most frequent trigrams per text. The first attempt at visualisation used one line per text and different colours for each county. The focus was on genre and county – e.g. ‘letters’ in four counties. The speaker noted that letter genres (e.g. conveyances) appeared in the same area, probably due to their standardised vocabulary. The speaker then simplified the visualisations by focussing on trigrams with frequencies of between 50 and 200 and joined text from each county together, reducing the number to 40. The speaker demonstrated how Northern texts are generally separate from Southern texts. Data was then imported into Gephi to look more at the network connections – looking at the strength of links. The speaker noted that ‘all trigrams lead to Warwickshire’, and other features such as texts from the East Riding being different to those from the West and North Ridings. The speaker pointed out, however, that the number of texts per county is not the same and the lengths of the texts are different. This can affect the relationships and skew the figures.
Thursday’s plenary then followed, which was about ‘a typology of syntactic change in Postcolonial Englishes’. It focussed on language change in Indian and Singapore English and discussed why some variations die out and there is stabilisation over time. Both India and Singapore were colonial from around 1830 and were multilingual, but there are many differences between the two countries in terms of length of contact and the interaction of other languages. There is also now a decline in English after independence and no historical corpora are available. The speaker discussed examples of usage from both country and discussed how these have developed.
After lunch I returned to the visualisation workshop, for the next paper on ‘Fingerprinting historical texts’. The paper discussed the Text Variation Explorer version 2 (TVE2) – see http://www.uta.fi/sis/tauchi/virg/projects/dammoc/tve.html, although this appears to only be about TVE1. It’s a free and open source tool for visualising text, allowing you to gain an overview of the texts, spot variation. Using visualisations is easier for the brain to comprehend. The tool splits texts into fragments of equal size and extracts hapax legomena, type / token ratios, average word length and such things. Results are then displayed in a stacked area chart. Features such as the most frequent words in each fragment can be passed through principal component analysis (see http://setosa.io/ev/principal-component-analysis/) to show clusters. The TVE2 interface allows users to drag and drop files into the system and metadata files containing any information you want to search on can also be uploaded. As a case study the speaker discussed the Laycester letters collection – letters between Queen Elizabeth and her advisors in 1585/6, comprising 65,000 words. Using the tool the speaker demonstrated how personal pronouns could be extracted how clusters of words could be generated and how the most frequent words could be viewed.
The following talk was about ‘Visualising semantic category development using the Historical Thesaurus of English’. This was presented by Fraser Dallachy and I was a co-author of the paper. The session involved a discussion of the new types of visualisations that we have added to the HT in the past year or so: The sparklines and heatmaps, timelines and mini-timelines. There’s not much more for me to say here as there’s already plenty about such matters in other posts of mine.
The next speaker discussed ‘how to visualise high-dimensional data’, and gave an overview of data structures, such as vectors, matrices and manifolds. The speaker stated that when dealing with multi-dimensional data you either need to reduce the dimensionality to 3 or less in order to make the data comprehensible, or to use cluster analysis. The speaker also discussed principal component analysis, as an earlier speaker had done.
The next paper was on the subject of ‘Mapping Language Change’, with mapping here meaning actual maps. The speaker noted that static maps are problematic as they only give us a single snapshot in time, and using dynamic maps via GIS approaches is something that is so far underused in Historical Linguistics. The speaker used QGIS (https://qgis.org/en/site/), a desktop GIS package, throughout the talk, showing how maps can be set up with different layers that can be selected or deselected, how metadata can be incorporated, how spatial analysis could be used and how linguistic data could be linked to archaeological data. For example, the use of kinship terms can be linked to the paths of rivers and railways. The speaker also pointed out some of the problems of using historical data with GIS: small datasets, limited number of texts, unequal distribution (both temporal and spatial), uncertainty of the location of texts, restricted metadata. The speaker illustrated how GIS could be used in Historical Linguistics by using the Index of Sources to the Linguistic Atlas of Middle English (http://www.lel.ed.ac.uk/ihd/laeme2/laeme2_framesZ.html). It was demonstrated how this could be mapped onto a map of the diocese of English to show distances between monastic houses, to show how these relate to the principal towns, plotting medieval roads etc.
The final paper of the day was ‘Creating interactive visualisations of big datasets to explore the re-emergence of initial /h/’. The speaker stated that <h> initial was lost in Middle English and the project identified the use of ‘a’ and ‘an’ as a diagnostic, looking at collocates of this in a huge dataset, namely the Google Books n-gram corpus. This contains 4.5 million books and 468 billion words. It’s possible to download all the bigrams from this, and the speaker identified 362 high frequency bigrams ‘an h…’ and ‘a h…’, consisting of 219 million bigrams. These were split into 5 categories and covered 5 centuries based on the publication date of the books. The project differentiated native words (happy, hand) versus borrowed words from French, Latin, Greek and other languages. Some Norman words were also Germanic borrowings (e.g. hamlet). Some preserved mute ‘h’ (e.g. hour, honest) and these were taken out. The speaker also looked at stress – long and short vowels following an ‘h’. The project developed an interface using the Shiny library that presented a series of settings on the left and line graphs in the main window. This allowed a variety of searches to be visualised, e.g. the proportion of ‘an’ over decades, borrowed words vs Germanic words. The speaker noted that the proportion of ‘an’ use decreases over time for both, but it is higher for borrowed words. Using the interface it is possible to zoom into sections of the graph, turn lines on or off. It’s also possible to select individual words to compare, e.g. hypocritical, hypocrisy, hypocrite and you can move the mouse over the curve to view the underlying data. The data pre-processing was done in Python using a 5.2Gb data file from Google Books.
And with that the conference ended. It was a hugely interesting and useful four days, but also pretty exhausting, especially factoring all of the commuting. I’m really glad I attended and I feel like I’ve learned a lot. But it will be nice to return to normality next week.