Week Beginning 22nd April 2019

As Monday was Easter Monday this was a four-day week for me.  I spent almost the entire time working on Gavin Miller’s new Glasgow Medical Humanities project.  This is a Wellcome Trust funded project that is going to take the existing University of Glasgow Medical Humanities Network resource (https://medical-humanities.glasgow.ac.uk/) that I helped set up for Megan Coyer a number of years ago and broaden it out to cover all institutions in Glasgow.  The project will have a new website and interface, with facilities to enable an administrator to manage all of the existing data, plus add new data from both UoG and other institutions.  I met with Gavin a few weeks ago to discuss how the new resource should function.  He had said he wanted the functionality of a blog with additional facilities to manage the data about Medical Humanities projects, people, teaching materials, collections and keywords.  The old site enabled any UoG based person to register and then log in to add data, but this feature was never really used – in reality all of the content was managed by the project administrators.  As the new site would no longer be restricted to UoG staff we decided that to keep things simple and less prone to spamming we would not allow people to register with the site, and that all content would be directly managed by the project team.  Anyone who wants to add or edit content would have to contact the project team and ask them to do so.

I wasn’t sure how best to implement the management of data.  The old site had a different view of certain pages when an admin user was signed in, enabling them to manage the data, but as we’re no longer going to let regular users sign in I’d rather keep the admin interface completely separate.  As a blog is required the main site will be WordPress powered, and there were two possible ways of implementing the admin interface for managing the project’s data.  The first approach would be to write a plug-in for WordPress that would enable the data to be managed directly through the WordPress Admin interface.  I took this approach with Gavin’s earlier SciFiMedHums project (https://scifimedhums.glasgow.ac.uk/).  However, this does mean the admin interface is completely tied in to WordPress and if we ever wanted to keep the database going but drop the WordPress parts the process would be complicated.  Also, embedding the data management pages within the WordPress Admin interface limits the layout options and can make the user interface more difficult to navigate.  This brings me to the second option, which is to develop a separate content management system for the data, that connects to WordPress to supply user authentication, but is not connected to WordPress in any other way.  I’ve taken this approach with several other projects, such as The People’s Voice (https://thepeoplesvoice.glasgow.ac.uk/).  This approach allows greater flexibility in the creation of the interface, allows the Admin user to log in with their WordPress details, but as the system and WordPress are very loosely coupled any future separation will be straightforward to manage.  The second option is the one I decided to adopt for the new project.

I spent the week installing WordPress, setting up a theme and some default pages, designing an initial banner image based on images from the old site, migrating the database to the new domain and tweaking it to make it applicable for data beyond the University of Glasgow and then developing the CMS for the project.  This allows an Admin user to add, edit and delete information about Medical Humanities projects, people, teaching materials, collections and keywords.  Thankfully I could adapt most of the code from the old site, although a number of tweaks had to be made along the way.

With the CMS in place I then began to create the front-end pages to access the data.  As with projects such as The People’s Voice, these page connect to WordPress in order to pull in theme information, and are embedded within WordPress by means of menu items, but are otherwise separate entities with no connection to WordPress.  In in future the pages need to function independently of WordPress the only updates required will be to delete a couple of lines of code that reference WordPress from the scripts, and everything else will continue to function.  I created new pages to allow projects and people to be browsed, results to be displayed and individual records to be presented.  Again, much of the code was adapted from the old website, and some new stuff was adapted from other projects I’ve worked on.  I didn’t quite manage to get all of the front-end functionality working by the end of the week, and I still have the pages for teaching materials, collections and keywords to complete next week.  The site is mostly all in place, though.  Here’s a screenshot of one of the pages, but note that the interface, banner and colour scheme might change before the site goes live:

In addition to working on this project I also got the DSL website working via HTTPS (https://dsl.ac.uk/), which took a bit of sorting out with Arts IT Support but is fully working now.  I also engaged in a pretty long email conversation about a new job role relating to the REF, and provided feedback on a new job description.  Next week I hope to complete the work on the Glasgow Medical Humanities site, do some work for the DSL, maybe find some time to get back into Historical Thesaurus issues and also begin work on the front-end features for the SCOSYA project.  Quite a lot to do, then.

Week Beginning 16th April 2019

It was a four-day week due to Good Friday, and I spent the beginning of the week catching up on things relating to last week’s conference – writing up my notes from the sessions and submitting my expenses claims.  Marc also dropped off a bunch of old STELLA materials, such as old CD-ROMS, floppy disks, photos and books, so I spent a bit of time sorting through these.  I also took delivery of a new laptop this week, so spent some time installing things on it and getting it ready for work.

Apart from these tasks I completed a final version of the Data Management Plan for Ophira Gamliel’s project, which has now been submitted to college and I met with Luca Guariento to discuss his new role.  Luca will be working across the College in a similar capacity to my role within Critical Studies, which is great news both for him and the College.  I also made a number of tweaks to one of the song stories for the RNSN project and engaged in an email discussion about REF and digital outputs.

I had two meetings with the SCOSYA project this week to discuss the development of the front end features for the project.  It’s getting to the stage where the public atlas will need to be developed and we met to discuss exactly what features it will need to include.  There will actually be four public interfaces – an ‘experts interface’ which will be very similar to the atlas I developed for the project team, a simplified atlas that will only include a selection of features and search types, the ‘story maps’ about 15-25 particular features, and a ‘listening atlas’ that will present the questionnaire locations and samples of speech at each place.  There’s a lot to develop and Jennifer would like as much as possible to be in place for a series of events the project is running in mid-June, so I’ll need to devote most of May to the project.

I spent about a day this week working for DSL.  Rhona contacted me to say they were employing a designer who needed to know some details about the website (e.g. fonts and colours), so I got that information to her.  The DSL’s Facebook page has also changed so I needed to update that on the website too.  Last week Ann sent me a list of further updates that needed to be made to the WordPress version of the DSL website that we’ve been working on, so I implemented those.  This included sorting out the colours of various boxes, and ensuring that these are relatively easy to update in future, sorting out the contents box that stay fixed on the page as the user scrolls on one of the background essay sections, creating some new versions of images and ensuring the citation pop-up worked on a new essay.

I spent a further half-day or so working on the REELS project, making updates to the ‘export for publication’ feature I’d created a few weeks ago.  This feature grabs all of the information about place-names and outputs it in a format that reflects how it should look on the printed page.  It is then possible to copy this output into Word and retain the formatting.  Carole has been using the feature and had sent me a list of updates.  This included enabling paragraph divisions in the analysis section.  Previously each place-name entry was a single paragraph, therefore paragraph tags in any sections contained within were removed.  I have changed this now so that each place-name uses an HTML <div> element rather than a <p> element, meaning any paragraphs can be represented.  However, this has potentially resulted in there being more vertical space between parts of the information than there were previously.

Carole had also noted that in some places on import into Word spaces were being interpreted as non-breaking spaces, meaning some phrases were moved down to a new line even though some of the words would fit on the line above.  I investigated this, but it was a bit of a weird one.  It would appear that Word is using a ‘non-breaking space’ in some places and not in others.  Such characters (represented in Word if you turn markers on by what looks like a superscript ‘o’ as opposed to a mid-dot) link words together and prevent them being split over multiple lines.  I couldn’t figure out why Word was using them in some places and not in others as they’re not used consistently.  For this reason this is something that will need to be fixed after pasting into Word.  The simplest way is to turn markers on, select one of the non-breaking space characters then choose ‘replace’ from the menu, paste this character into the ‘Find what’ box and then put a regular space in the ‘replace with’ box.  There were a number of other smaller tweaks to make to the script, such as fixing the appearance of tildes for linear features, adjusting the number of spaces between the place-name and other text and changing the way parishes were displayed, which brings me to the end of this week’s report.  It will be another four-day week next week due to Easter Monday, and I intend to focus on the new Glasgow Medical Humanities resource for Gavin Miller, and some further Historical Thesaurus work.

Week Beginning 8th April 2019

I was on holiday last week, and for most of this week I attended the ‘English Historical Lexicography in the Digital Age’ conference in Bergamo, Italy.  On Monday and Tuesday I prepared for the conference, at which I was speaking about the Bilingual Thesaurus.  I also responded to a query regarding the SPADE project, had a further conversation with Ophira Gamliel about her project and did some app account management duties.  I spent Wednesday travelling to the conference, which then ran from Thursday through to Saturday lunchtime.  It was an excellent conference in a lovely setting.  It opened with a keynote lecture by Wendy Anderson which focussed primarily on the Mapping Metaphor project.  It was great to see the project’s visualisations again, and hear some of the research that can be carried out using the online resource, and the audience seemed interested in the project.  Another couple of potential expansions to the resource might be to link through to citations in the OED to analyse the context of a metaphorical usage, and to label categories as ‘concrete’ or ‘abstract’, to enable analysis of different metaphorical connections, such as concrete to abstract, or concrete to concrete.  One of the audience suggested showing clusters of words and their metaphorical connections using network diagrams, although I recall that we did look at such visualisations back in the early days of the project and decided against using them.

Wendy’s session was followed by a panel on historical thesauri.  Marc gave a general introduction to historical thesauri, which was interesting and informative – apparently these aren’t just historical thesauri, they are ‘Kay-Samuels’ thesauri.  Marc also suggested we write a sort of ‘best practice’ guide for creating historical thesauri, which I thought sounded like a very good idea.  After that there was a paper about the Bilingual Thesaurus given by Louise Sylvester and me about.  I think this all went very well, but I can’t really comment further on something I presented.  The next paper was given by Fraser and Rhona Alcorn about the new Historical Thesaurus of Scots project.  It was good to hear their talk as I learnt some new details about the project, which I’m involved with.  Rhona mentioned that the existing printed Scots Thesaurus doesn’t have any dates, and mostly focusses on rural life, so although it will be useful to a certain extent the project needs to be much broader in scope and more historically focussed.  The project is due to end in January and I’ll be creating some sort of interface in December / January.  Fraser mentioned that one possible idea is to look for words in the dictionary definitions that are also present in the HT category path in order to possibly put words into categories.  Other plans are took at cognate terms (e.g. ‘kirk’ and ‘church’), sound shifts (e.g. ‘stan’ to ‘stone’), variant spellings and expanded variant forms.  We also will need to find a way to automatically extract the dates from the DSL data too.

The final paper in the session was by Heather Pagan and was about using the semantic tags in the Anglo-Norman Dictionary to categorise entries.  The AND uses a range of semantic tags (e.g. ‘bot.’, ‘law’), but these are not used in every sense – only when clarification is needed.  The use of the tags is not consistent.  Lots of forms are used but not documented, and lists of tags only include those that are abbreviations.  The dictionary has been digitised and marked up in XML, with semantic tags marked as follows: <usage type=”zool.” />.  Multiple types can be associated with an entry and different variants have now been rationalised.  There are, however, some issues.  For example, sometimes other words appear in a bracket where a tag might be, even though it’s not a semantic tag, and also tags are not used when things are obvious – e.g. ‘sword’ is not tagged as a weapon.  There are also potential inconsistencies – ‘architecture’ vs ‘building’, ‘mathematical’ vs ‘arithmetic’, ‘maritime’ vs ‘naval’.  The AHRC funded a project to redevelop the tags, and it was decided that tags in modern English would be used as they are for a modern audience.  The project decided to use OED subject categories and ended up using 105 different tags.  These are not hierarchical, but allow for multiple tags to be applied to each word.  It is possible to browse the website by tags (http://www.anglo-norman.net/label-search.shtml) and to limit this by POS.  Heather ended by pointing out some of the biases that the use of tags has demonstrated – e.g. there is a tag for ‘female’ but not for ‘male’, and religion is considered ‘Christian’ by default.

The next panel was on semantic change in lexicography and consisted of three papers.  The first was about the use of the term ‘court language’ in different periods during 17th century revolutionary England.  The speaker discussed ‘lexical priming’, when words are primed for collocational use through encounters in speech and writing, and also ‘priming drift’ when the meaning of the words changes.  The source data was taken from EEBO and powered by CQPWeb and an initial search was on the collocations of ‘language’. There were lots of negative adjective collocates due to the polemic nature of the texts.  ‘Smooth Language’ was looked at, and how its use changed from being associated negatively with the court and monarchy (meaning falsehood and fake) to being viewed as positive (e.g. sophisticated, elegant).  The term ‘court language’ followed a similar path.

The next speaker looked at the use of Indian keyword used by English women travel writers in the 19th century.  The speaker talked of ‘languaging’ – the changes within a language with a focus on the language activity of speakers rather than on the language system.  The speaker looked at the ‘Hobson-Jobson’ Anglo-Indian Dictionary and noticed there were no references to women travel writers as sources.  The speaker created a corpus of travel books by women (about 1.3 million words) consisting of letter, recollections and narratives, but no literary texts.  These were all taken from Google Books and Project Gutenberg, and analysis of the corpus was undertaken using Wordsmith, comparing results to the Corpus of Later Modern English (15m tokens) as a reference corpus.  This included the Victorian Women Writers project.  Results were analysed using concordances, clusters and n-grams.

The last speaker of the day discussed semantic variation in the use of words to refer to North American ‘Indians’ from 1584 to 1724.  The speaker suggested there was ‘overlexicalisation’ during this period – many quasi-synonymous terms.  The speaker created a corpus based on the Jamestown Digital Archive, consisting of 650,000 words over 6 subcorpora of 25 years.  Analysis was done using Sketchengine.  The 5 most frequent terms were Indian, savage, inhabitant, heathen and native and the speaker showed graphs of frequency.  The use of words was compared to quotations in the OED and the speaker categorised use of the terms in the corpus as more ‘positive’, ‘neutral’ or ‘negative’.  E.g. the use of ‘Indian’ is generally more neutral than negative, but there are peaks of more negative uses during periods of crisis, such as Bacon’s Rebellion in 1676.  The use of ‘savage’ was mostly negative while ‘heathen’ was used mainly in a religious sense until 1676.  The speaker also noted how ‘inhabitant’ and ‘native’ ended up shifting to refer to the European settlers in the late 1600s.

Day two of the conference began with a talk about the definition of a legal term that is currently in dispute in the US, tracing its usage back through the documentary evidence.  The speaker used the Lexicons of Early Modern English, which looks to be a useful resource.  The next speaker was Rachel Fletcher, a PhD student at Glasgow, who discussed how to deal with texts on the boundary between the Old English and Middle English periods.  This is a fundamental issue for a period dictionary, but it is difficult to decide what is OE and what isn’t.  The Dictionary of Old English uses evidence from manuscripts after 1150, e.g. attestation of spellings, and it is up to the user to decide which words they want to consider as OE. DOE links through to the Corpus of Old English so you can look at dates and authors and see all usage.  The speaker stated that now that many of the resources are available digitally it’s easier to switch from one resource to another, and track entries between dictionaries.  Boundaries can be more fuzzy and period changes are more of a continuum than previously, which is a good thing.

The next talk was a keynote lecture by Susan Rennie about the annotated Jamieson.  Susan wasn’t at the conference in person but gave her talk via Skype, which mostly went ok, although there were times when it was difficult to hear properly.  Susan discussed Jamieson’s dictionary of the Scottish Language, completed in 1808.  It was the first completed dictionary of Scots and was a landmark in historical lexicography.  Susan discussed her ‘Annotated Jamieson’ project and the impact Jamieson’s dictionary had on later dictionaries such as the DSL.

The next speaker was the conference organiser, Marina Dossena, who gave a paper about the lexicography of Scots.  She pointed out that in the late 19th century Scots was seen as dying out, and in fact this view had been around for centuries, tracing it back to Pinkerton in 1786, who considered Scots good in poetry but unacceptable in general use.  The speaker pointed out that Scots is at the intersection of monolingual and bilingual lexicography, and that Scots has no dictionary where both headwords and definitions are in Scots.  The final speaker of the morning session looked at the stigmatisation of phonological change in 19th century newspapers, and the role newspapers and ‘letters to the editor’ played in stigmatising certain pronunciations.  The speaker used the Eighteenth-Century English Phonology Database (ECEP) as a source.

After lunch there were six half-hour papers without a break, including an hour-long keynote lecture, which was a pretty intense and exhausting afternoon.  The first speaker in the session discussed letters written by women who wished to give up their babies in 18th century England.  These letters were sent to a ‘foundling’ hospital in London, and were sent (mainly) by young, lower class, unmarried women living in London, but who may have come from elsewhere.  Most letters were not written directly by the women, but were signed (often with a cross) by them and the differed in formality and length.  The speaker analysed 63 such petitions signed by single mothers from 1773 to 1799 that were sent to the governors of the hospital.  There were around 100 women a week trying to give their children to the hospital.  The speaker discussed some of the terms used for a baby being born, and how these were frequently in a passive tense, e.g. ‘be delivered of child’ appeared 18 times.  The speaker also showed screenshots of the Historical Thesaurus timeline, which was good to see.

The following speaker looked at how childhood became a culturally constructed life stage during 16th and 17th century England.  The speaker used the OED and HT for data, showing how in the 16th century children became thought of as autonomous human beings for the first time.  Different categories for child were analysed, including foetus, infant, child, boy and girl.  Some 101 senses over 8 25-year periods were looked at.  From OE up to the 15th century words for child were more limited, and exhibited no emotion.  Children were seen as offspring or were defined by their role, e.g. ‘page’, ‘groom’.  During the 16th and 17th Centuries substages come in and there is more emotional colouring to the language, including lots of animal metaphors and some plant ones.

The next speaker discussed a dictionary of homonymic proper names that is in production, focussing on some examples from British and American English, using data from the English Pronouncing Dictionary and the Longman Pronunciation Dictionary, and after this speaker there followed a keynote lecture about the Salamanca Corpus.  This talk looked specifically at 18th century Northern English, but gave an introduction to the Salamanca Corpus too.  It is a collection of regional texts from the early modern period to the 20th century, covering the years 1500 to 1950.  It consists of manuscripts, comments of contemporary individuals, dictionaries and glossaries, the literary use of dialect, dialectal literature and (from the 19th century onwards) philological studies.  The speaker pointed out how the literary use of dialect starts with Chaucer and the Master of Wakefield in the 14th century, and at this time it wasn’t used for humour.  It became more common in the 16th century as a means of characterisation, generally for humorous intend, with the main dialect forms being Kentish, Devonshire, Lancashire and Yorkshire.  The speaker then looked at the 18th century Northern section of the corpus, looking at some specific texts and giving some examples, noting that the section is quite small (about 160,000 words) and is almost all Yorkshire and Lancashire.

The following speaker introduced the online English Dialect Dictionary.  The printed version was released in 6 volumes from 1898-1905, and the online version has digitised this and made it searchable.  The period covered in 1700-1900 and there are about 70,000 entries.  A word must have been in use after 1700 and for there to be some written evidence of its use for it to be included.  The final speaker looked at how some of the data from the EDD had been originally compiled, specifically Oxfordshire dialect words, with the speaker pointing out that the Oxfordshire dialect is one of the least researched dialects in Britain.  The speaker discussed the role of correspondents in compiling the material.  There are 750 listed in the dictionary, but were likely many more than this.  They answered questions about usage and were recruited via newspapers and local dialect societies.  The distribution of correspondents varies across the country, with Yorkshire best represented (167) followed by Lancashire (62).  Oxfordshire only had 28.

On the third day there was a single keynote lecture about the historical lexicography of Canadian English, looking at the second edition of the Dictionary of Canadianisms on Historical Principles (DCHP-2), which is available online.  The speaker noted that it was only in the 1920s and 30s that the first native born generation of people in Vancouver appeared, and contrasted this to the history of Europe.  The sheer size of Canada as opposed to Europe was also shown.  The speaker discussed the geographical spread of dialect terms, both in the provinces of Canada and across the world.  The speaker used Google’s data to look at usage in different geographical areas based on the top-level domains of sites.  After this keynote there were some final remarks and discussions and the conference drew to a close.

There were some very interesting papers at the conference, and it was particularly nice to see how the Historical Thesaurus and the Dictionary of the Scots Language are being used for research.

Week Beginning 25th March 2019

I spent quite a bit of time this week helping members of staff with research proposals.  Last week I met with Ophira Gamliel in Theology to discuss a proposal she’s putting together and this week I wrote an initial version of a Data Management Plan for her project, which took a fair amount of time as it’s a rather multi-faceted project.  I also met with Kirsteen McCue in Scottish Literature to discuss a proposal she’s putting together, and I spent some time after our meeting looking through some of the technical and legal issues that the project is going to encounter.

I also added three new pages to Matthew Creasey’s transcription / translation case study for his Decadence and Translation project (available here: https://dandtnetwork.glasgow.ac.uk/recreations-postales/) and sorted out some user account issues for the Place-names of Kircudbrightshire project and prepared an initial version of my presentation for the conference I’m speaking at in Bergamo the week after next.

I also helped Fraser to get some data for the new Scots Thesaurus project he’s running.  This is going to involve linking data from the DSL to the OED via the Historical Thesaurus, so we’re exploring ways of linking up DSL headwords to HT lexemes initially, as this will then give us a pathway to specific OED headwords once we’ve completed the HT/OED linking process.

My first task was to create a script that returned all of the monosemous forms in the DSL, which Fraser suggested would be words that only have one ‘sense’ in their entries.  The script I wrote goes through the DSL data and picks out all of the entries that have one <sense> tag in their XML.  For each of these it then generates a ‘stripped’ form using the same algorithm that I created for the HT stripped fields (e.g. removing non alphanumeric characters).  It then looks through the HT lexemes for an exact match of the HT lexeme ‘stripped’ field.  If there is exactly one match then data about the DSL word and the matching HT word is added to the table.

For DOST there are 42177 words with one sense, and of these 2782 are monosemous in the HT and for SND there are 24085 words with one sense, and of these 1541 are monosemous in the HT.  However, there are a couple of things to note.  Firstly, I have not added in a check for Part of speech as the DSL POS field is rather inconsistent, often doesn’t even contain data and where there are multiple POSes there is no consistent way to split them up.  Sometimes a comma is used, sometimes a space.  A POS generally ends with a full stop, but not in forms like ‘n.1’ and ‘n.2’.  Also, the DSL uses very different terms to the HT for POS, so without lots of extra work mapping out which corresponds to which it’s not possible to automatically match up an HT and a DSL POS.  But as there are only a few thousand rows it should be possible to manually pick out the good ones.

Secondly, a word might have one sense but have two completely separate entries in the same POS, so as things currently stand the returned rows are not necessarily ‘monosemous’.  See for example ‘bile’ (http://dsl.ac.uk/results/bile) which has four separate entries in SND that are nouns, plus three supplemental entries, so even though an individual entry for ‘bile’ contains one sense it is clearly not monosemous.  After further discussions with Fraser I updated my script to count the number of times a DSL headword with one sense appears as a separate headword in the data.  If the word is a DOST word and it appears more than once in DOST this number is highlighted in red.  If it appears at all in SND the number is highlighted in red.  For SND words it’s the same but reversed.  There is rather a lot of red in the output, so I’m not sure how useful the data is going to be, but it’s a start.  I also generated lists of DSL entries that contain the text ‘comb.’ and ‘attrb.’ as these will need to be handled differently.

All of the above took up most of the week, but I did have a bit of time to devote to HT/OED linking issues, including writing up my notes and listing action items following last Friday’s meeting and beginning to tick off a few of the items from this list.  Pretty much all I managed to do was linked to the issue of HT lexemes with identical details appearing in multiple categories, and updating the output of an existing script to make it more useful.

Point 2 on my list was “I will create a new version of the non-unique HT words (where a word with the same ‘word’, ‘startd’ and ‘endd’ in multiple categories) to display how many of these are linked to OED words and how many aren’t“.  I updated the script to add in a yes/no column for where there are links.  I’ve also added in additional columns that display the linked OED lexeme’s details.  Of the 154428 non-unique words 129813 are linked.

Point 3 was “I will also create a version of the script that just looks at the word form and ignores dates”.  I’ve decided against doing this as just looking at word form without dates is going to lead to lots of connections being made where they shouldn’t really exist (e.g. all the many forms of ‘strike’).

Point 4 was “I will also create a version of the script that notes where one of the words with the same details is matched and the other isn’t, to see whether the non-matched one can be ticked off” and this has proved both tricky to implement and pretty useful.  Tricky because a script can’t just compare the outputted forms sequentially – each identical form needs to be compared with every other.  But as I say, it’s given some good results.  There are 9056 of words that aren’t matched but probably should be, which could potentially be ticked off.  Of course, this isn’t going to affect the OED ‘ticked off’ stats, but rather the HT stats.  I’ve also realised that this script currently doesn’t take POS into consideration – it just looks at word form, firstd and lastd, so this might need further work.

I’m going to be on holiday next week and away at a conference for most of the following week, so this is all from me for a while.