With the strike action over (for now, at least) I returned to a full week of work, and managed to tackle a few items that had been pending for a while. I’d been asked to write a Technical Plan for an AHRC application for Faye Hammill in English Literature, but since then the changeover from four-page, highly structured Technical Plans to two-page more free-flowing Data Management Plans has taken place. This was a good opportunity to write an AHRC Data Management Plan, and after following the advice on the AHRC website(http://www.ahrc.ac.uk/documents/guides/data-management-plan/) and consulting the additional documentation on the DCC’s DMPonline tool (https://dmponline.dcc.ac.uk/) I managed to write a plan that covered all of the points. There are still some areas where I need further input from Faye, but we do at least have a first draft now.
I also created a project website for Anna McFarlane’s British Academy funded project. The website isn’t live yet, so I can’t include the URL here, but Anna is happy with how it looks, which is good. After sorting that out I then returned to the REELS project. I created the endpoints in the API that would allow the various browse facilities we had agreed upon to function, and then built these features in the front-end. It’s now possible to (for example) list all sources and see which has the most place-names associated with it, or bring up a list of all of the years in which historical forms were first attested.
I spent quite a bit of time this week working on the extraction of words and their thematic headings from EEBO for the Linguistic DNA project. Before the strike I’d managed to write a script that went through a single file and counted up all of the occurrences of words, parts of speech and associated thematic headings, but I was a little confused that there appeared to be thematic heading data in column 6 and also column 10 of the data files. Fraser looked into this and figured out that the most likely thematic heading appeared in column 10, while other possible ones appeared in column 6. This was a rather curious way to structure the data, but once I knew about it I could set my script to focus on column 10, as we’re only interested in the most likely thematic heading.
I updated my script to insert data into a database rather than just hold things temporarily in an array, and I also wrapped the script in another function that then applied the processing to every file in a directory rather than just a single file. With this in place I set the script running on the entire EEBO directory. I was unsure whether running this on my desktop PC would be fast enough, but thankfully the entire dataset was processed in just a few hours.
My script finished processing all 14590 files that I had copied from the J drive to my local PC, resulting in whopping 70,882064 rows entered into my database. Everything seemed to be going very well, but Fraser wasn’t sure I had all of the files, and he was correct. Having checked the J drive, there were 25,368 items, so when I had copied the files across the process must have silently failed at some point. And even more annoyingly it didn’t fail in an orderly manner. E.g. the earliest file I have on my PC is A00018 while there are several earlier ones on the J drive.
I copied all of the files over again and decided that rather then dropping the database and started from scratch I’d update my script to check to see whether a file had already been processed, meaning that only the missing 10,000 or so would be dealt with. However, in order to do this the script would need to query a 70 million row database for the ‘filename’ column, which didn’t have an index. I began the process of creating an index, but indexing 70 million rows took a long time – several hours, in fact. I almost gave up and inserted all the data again from scratch, but the thing is I knew I would need this index in order to query the data anyway, so I decided to persevere. Thankfully the index finally finished building and I could then run my script to insert the missing 10,000 files, a process that took a bit longer as the script now had to query the database and also update the index as well as insert the data. But finally all 25,368 files were processed, resulting in 103,926,008 rows in my database.
The script and the data are currently located on my desktop PC, but if Fraser and Marc want to query it I’ll need to get this migrated to a web server of some sort, so I contacted Chris about this. Chris said he’d sort a temporary solution out for me, which is great. I then set to work writing another script that would extract summary information for the thematic headings and insert this into another table. After running the script this table now contains a total count of each word / part of speech / thematic heading across the entire EEBO collection. Where a lemma appears with multiple parts of speech these are treated as separate entities and are not added together. For example, ‘AA Creation NN1’ has a total count of 4609 while ‘AA Creation NN2’ has a total count of 19, and these are separate rows in the table.
Whilst working with the data I noticed that a significant amount of it is unusable. Of the almost 104 million rows of data, over 20 million have been given the heading ’04:10’ and a lot of these are words that probably could have been cleaned up before the data was fed into the tagger. A lot of these are mis-classified words that have an asterisk or a dash at the start. If the asterisk / dash had been removed then the word could have been successfully tagged. E.g. there are 88 occurrences of ‘*and’ that have been given the heading ’04:10’ and part of speech ‘FO’. Basically about a fifth of the dataset is an unusable thematic heading, and much of this is data that could have been useful if the data had been pre-processed a little more thoroughly.
Anyway, after tallying up the frequencies across all texts I then wrote a script to query this table and extract a ‘top 10’ list of lemma / pos combinations for each of the 3,972 headings that are used. The output has one row per heading and a column for each of the top 10 (or less if there are less than 10). This currently has the lemma, then the pos in brackets and the total frequency across all 25,000 texts after a bar, as follows: christ (NP1) | 1117625. I’ve sent this to Fraser and once he gets back to me I’ll proceed further.
In addition to the above big tasks, I also dealt with a number of smaller issues. Thomas Widmann of SLD had asked me to get some DSL data from the API for him, so I sent that on to him. I updated the ‘favicon’ for the SPADE website, I fixed a couple of issues for the Medical Humanities Network website, and I dealt with a couple of issues with legacy websites: For SWAP I deleted the input forms as these were sending spam to Carole. I also fixed an encoding issue with the Emblems websites that had crept in when the sites had been moved to a new server.
I also heard this week that IT Services are going to move all project websites to HTTPS from HTTP. This is really good news as Google has started to rank plain HTTP sites lower than HTTPS sites, plus Firefox and Chrome give users warnings about HTTP websites. Chris wanted to try migrating one of my sites to HTTPS and we did this for the Scots Corpus. There were some initial problems with the certificate not working for the ‘www’ subdomain but Chris quickly fixed this and everything appeared to be working fine. Unfortunately, although everything was fine within the University network, the University’s firewall was blocking HTTPS requests from external users, meaning no-one outside of the University network could access the site. Thankfully someone contacted Wendy about this and Chris managed to get the firewall updated.
I also did a couple of tasks for the SCOSYA project, and spoke to Gary about the development of the front-end, which I think is going to need to start soon. Gary is going to try and set up a meeting with Jennifer about this next week. On Friday afternoon I attended a workshop about digital editions that Sheila Dickson in German had organised. There were talks about the Cullen project, the Curious Travellers project, and Sheila’s Magazin zur Erfahrungsseelenkunde project. It was really interesting to hear about these projects and their approaches to managing transcriptions.