I focussed on two new projects that were needing my input this week, as well as working on some more established projects and attending an event on Friday. One of the new projects was the pilot project for Matthew Sangster’s Books and Borrowing project, which I started working on last week. After some manual tweaking of the script I had created to work with the data, and some manual reworking of the data itself I managed to get 8138 rows out of 8199 uploaded, with most of the rows that failed to be uploaded being those where blank cells were merged together – almost all being rows denoting blank pages or other rows that didn’t contain an actual record. I considered fixing the rows that failed to upload but decided that as this is still just a test version of the data there wasn’t really much point in me spending time doing so, as I will be getting a new dataset from Matthew later on in the summer anyway.
I also created and executed a few scripts that will make the data more suitable for searching and browsing. This has included writing a script that takes the ‘lent’ and ‘returned’ dates and splits these up into separate day, month and year columns, converting the month to a numeric value too. This will make it easier to search by dates, or order results by dates (e.g. grouping by a specific month or searching within a range of time). Note that where the given date doesn’t exactly fit the pattern of ‘dd mmm yyyy’ the new date fields remain unpopulated – e.g. things like ‘[blank]’ or ‘[see next line]’. There are also some dates that don’t have a day, and a few typos (e.g. ’fab’ instead of ‘feb’) that I haven’t done anything about yet.
I’ve also extracted the professor names from the ‘prof1, prof2 and prof3’ columns and have stored them as unique entries in a new ‘profs’ table. So for example ‘Dr Leechman’ only appears once in the table, even though he appears multiple times as either prof 1, 2 or 3 (225 times, in fact). There are 123 distinct profs in this table, although these will need further work as some of these are undoubtedly duplicates with slightly different forms. I’ve also created a joining table that joins each prof to each record. This matches up the record ID with the unique ID for each prof, and also stores whether the prof was listed as ‘1,2 or 3’ for each record, in case this is of any significance.
Similarly, I’ve extracted the unique normalised names from the records and have stored these in a separate table. There are 862 unique student names, and a further linking table associates each of these with one or more record. I will also need to split the student names into forename and surname in order to generate a browse list of students. It might not be possible to do this fully automatically as names like ‘Robert Stirling Junior’ and ‘Robert Stirling Senior’ would then end up with ‘Junior’ and ‘Senior’ as surnames. I guess a list of professors listed by surname would also be needed too.
I have also processed the images of the 3 manuscripts that appear in the records (2,3 and 6). This involved running a batch script to rotate the images of the manuscripts that are to be read as landscape rather than portrait and renaming all files to remove spaces. I’ve also passed the images through the Zoomify tileset generator so now have tiles at various zoom levels for all of the images. I’m not going to be able to do any further work on this pilot project until the end of July, but it’s good to get some of the groundwork done.
The second new project I worked on this week was Alison Wiggins’s Account Books project. Alison had sent me an Access database containing records relating to the letters of Mary, Queen of Scots and she wanted me to create an online database and content management system out of this, to enable several researchers to work on the data together. Alison wanted this to be ready to use in July, which meant I had to try and get the system up and running this week. I spent about two days importing the data and setting up the content management system, split across 13 related database tables. Facilities are now in place for Alison to create staff accounts and to add, edit, delete and associate information about archives, documents, editions, pages, people and places.
On Wednesday this week I had a further meeting with Marc and Fraser about the HT / OED data linking task. In preparation for the meeting I spent some time creating some queries that generated some statistics about the data. There are 223250 matched categories, and of these there 161378 where the number of HT and OED words are the same and 100% of words match. There are 19990 categories where the number of HT and OED words are the same but not all words match. There are 12796 categories where the number of HT words is greater than the number of OED words and 100% of OED words match and 5878 categories where the number of HT words is greater than the number of OED words and less than 100% of OED words match. There are 16077 categories where the number of HT words is less than the number of OED words and 100% of HT words match, and in these categories there are 18909 unmatched OED lexemes that are ‘revised’. Finally, there are 7131 categories where the number of HT words is less than the number of OED words and less than 100% of HT words match, and in these categories there are 19733 unmatched OED lexemes that are ‘revised’. Hopefully these statistics will help when it comes to deciding what dates to adopt from the OED data.
On Friday I went down to Lancaster University for the Encyclopedia of Shakespeare’s Language Symposium. This was a launch event for the project and there were sessions on the new resources that the project is going to make available, specifically the Enhanced Shakespearean Corpus. This corpus consists of three parts. The ‘First Folio Plus’ is a corpus of the first folio of plays from 1623, plus a few extra plays. It has additional layers of annotation and tagging – tagging such as for part of speech and annotation such as social annotation (e.g. who was speaking to whom), gender and a social ranking. Part of speech was tagged using an adapted version of the CLAWS tagger and spelling variation was regularised using VARD2.
The second part is a corpus of comparative plays. This is a collection of plays by other authors from the same time period, which allows the comparison of Shakespeare’s language to that of his contemporaries. The ‘first folio’ has 38 plays from 1589-1613 while the comparative plays corpus has 46 plays from 1584-1626. Both are just over 1 million words in size and look at similar genres (comedy, tragedy, history) and a similar mixture of verse and prose.
The third part is the EEBO-TCP segment, which is about 300 million words over 5700 texts, which is about a quarter of EEBO. It doesn’t include Shakespeare’s texts but includes texts from five broad domains (literary, religious, administrative, instructional and informational) and many genres and allows researchers to tap into the meanings triggered in the minds of Elizabethan audiences.
The corpus uses CQPWeb, which was developed by Andrew Hardie at Lancaster, and who spoke about the resource at the event. As we use this software for some projects at Glasgow it was good to see it demonstrated and get some ideas as to how it is being used for this new project. There were also several short papers that demonstrated the sorts of research that can be undertaken using the new corpus and the software. It was an interesting event and I’m glad I attended it.
Also this week I engaged in several email discussions with DSL people about working with the DSL data, advised someone who is helping Thomas Clancy get his proposal together, scheduled an ArtsLab session on research data, provided the RNSN people with some information they need for some printed materials they’re preparing and spoke to Gerry McKeever about an interactive map he wants me to create for his project.