With the strike action over (for now, at least) I returned to a full week of work, and managed to tackle a few items that had been pending for a while. I’d been asked to write a Technical Plan for an AHRC application for Faye Hammill in English Literature, but since then the changeover from four-page, highly structured Technical Plans to two-page more free-flowing Data Management Plans has taken place. This was a good opportunity to write an AHRC Data Management Plan, and after following the advice on the AHRC website(http://www.ahrc.ac.uk/documents/guides/data-management-plan/) and consulting the additional documentation on the DCC’s DMPonline tool (https://dmponline.dcc.ac.uk/) I managed to write a plan that covered all of the points. There are still some areas where I need further input from Faye, but we do at least have a first draft now.
I also created a project website for Anna McFarlane’s British Academy funded project. The website isn’t live yet, so I can’t include the URL here, but Anna is happy with how it looks, which is good. After sorting that out I then returned to the REELS project. I created the endpoints in the API that would allow the various browse facilities we had agreed upon to function, and then built these features in the front-end. It’s now possible to (for example) list all sources and see which has the most place-names associated with it, or bring up a list of all of the years in which historical forms were first attested.
I spent quite a bit of time this week working on the extraction of words and their thematic headings from EEBO for the Linguistic DNA project. Before the strike I’d managed to write a script that went through a single file and counted up all of the occurrences of words, parts of speech and associated thematic headings, but I was a little confused that there appeared to be thematic heading data in column 6 and also column 10 of the data files. Fraser looked into this and figured out that the most likely thematic heading appeared in column 10, while other possible ones appeared in column 6. This was a rather curious way to structure the data, but once I knew about it I could set my script to focus on column 10, as we’re only interested in the most likely thematic heading.
I updated my script to insert data into a database rather than just hold things temporarily in an array, and I also wrapped the script in another function that then applied the processing to every file in a directory rather than just a single file. With this in place I set the script running on the entire EEBO directory. I was unsure whether running this on my desktop PC would be fast enough, but thankfully the entire dataset was processed in just a few hours.
My script finished processing all 14590 files that I had copied from the J drive to my local PC, resulting in whopping 70,882064 rows entered into my database. Everything seemed to be going very well, but Fraser wasn’t sure I had all of the files, and he was correct. Having checked the J drive, there were 25,368 items, so when I had copied the files across the process must have silently failed at some point. And even more annoyingly it didn’t fail in an orderly manner. E.g. the earliest file I have on my PC is A00018 while there are several earlier ones on the J drive.
I copied all of the files over again and decided that rather then dropping the database and started from scratch I’d update my script to check to see whether a file had already been processed, meaning that only the missing 10,000 or so would be dealt with. However, in order to do this the script would need to query a 70 million row database for the ‘filename’ column, which didn’t have an index. I began the process of creating an index, but indexing 70 million rows took a long time – several hours, in fact. I almost gave up and inserted all the data again from scratch, but the thing is I knew I would need this index in order to query the data anyway, so I decided to persevere. Thankfully the index finally finished building and I could then run my script to insert the missing 10,000 files, a process that took a bit longer as the script now had to query the database and also update the index as well as insert the data. But finally all 25,368 files were processed, resulting in 103,926,008 rows in my database.
The script and the data are currently located on my desktop PC, but if Fraser and Marc want to query it I’ll need to get this migrated to a web server of some sort, so I contacted Chris about this. Chris said he’d sort a temporary solution out for me, which is great. I then set to work writing another script that would extract summary information for the thematic headings and insert this into another table. After running the script this table now contains a total count of each word / part of speech / thematic heading across the entire EEBO collection. Where a lemma appears with multiple parts of speech these are treated as separate entities and are not added together. For example, ‘AA Creation NN1’ has a total count of 4609 while ‘AA Creation NN2’ has a total count of 19, and these are separate rows in the table.
Whilst working with the data I noticed that a significant amount of it is unusable. Of the almost 104 million rows of data, over 20 million have been given the heading ’04:10’ and a lot of these are words that probably could have been cleaned up before the data was fed into the tagger. A lot of these are mis-classified words that have an asterisk or a dash at the start. If the asterisk / dash had been removed then the word could have been successfully tagged. E.g. there are 88 occurrences of ‘*and’ that have been given the heading ’04:10’ and part of speech ‘FO’. Basically about a fifth of the dataset is an unusable thematic heading, and much of this is data that could have been useful if the data had been pre-processed a little more thoroughly.
Anyway, after tallying up the frequencies across all texts I then wrote a script to query this table and extract a ‘top 10’ list of lemma / pos combinations for each of the 3,972 headings that are used. The output has one row per heading and a column for each of the top 10 (or less if there are less than 10). This currently has the lemma, then the pos in brackets and the total frequency across all 25,000 texts after a bar, as follows: christ (NP1) | 1117625. I’ve sent this to Fraser and once he gets back to me I’ll proceed further.
In addition to the above big tasks, I also dealt with a number of smaller issues. Thomas Widmann of SLD had asked me to get some DSL data from the API for him, so I sent that on to him. I updated the ‘favicon’ for the SPADE website, I fixed a couple of issues for the Medical Humanities Network website, and I dealt with a couple of issues with legacy websites: For SWAP I deleted the input forms as these were sending spam to Carole. I also fixed an encoding issue with the Emblems websites that had crept in when the sites had been moved to a new server.
I also heard this week that IT Services are going to move all project websites to HTTPS from HTTP. This is really good news as Google has started to rank plain HTTP sites lower than HTTPS sites, plus Firefox and Chrome give users warnings about HTTP websites. Chris wanted to try migrating one of my sites to HTTPS and we did this for the Scots Corpus. There were some initial problems with the certificate not working for the ‘www’ subdomain but Chris quickly fixed this and everything appeared to be working fine. Unfortunately, although everything was fine within the University network, the University’s firewall was blocking HTTPS requests from external users, meaning no-one outside of the University network could access the site. Thankfully someone contacted Wendy about this and Chris managed to get the firewall updated.
I also did a couple of tasks for the SCOSYA project, and spoke to Gary about the development of the front-end, which I think is going to need to start soon. Gary is going to try and set up a meeting with Jennifer about this next week. On Friday afternoon I attended a workshop about digital editions that Sheila Dickson in German had organised. There were talks about the Cullen project, the Curious Travellers project, and Sheila’s Magazin zur Erfahrungsseelenkunde project. It was really interesting to hear about these projects and their approaches to managing transcriptions.
This was the fourth and final week of strike action and I was therefore off work all week.
This was the third week of the strike action and I therefore only worked on Friday. I started the day making a couple of further tweaks to the ‘Storymap’ for the RNSN project. I’d inadvertently uploaded the wrong version of the data just before I left work last week, which meant the embedded audio players weren’t displaying, so I fixed that. I also added a new element language to the REELS database and added the new logo to the SPADE project website (see http://spade.glasgow.ac.uk/).
With these small tasks out of the way I spent the rest of the day on Historical Thesaurus and Linguistic DNA duties. For the HT I had previously created a ‘fixed’ header that appears at the top of the page if you start scrolling down, so you can always see what it is you’re looking at, and also quickly jump to other parts of the hierarchy. You can also click on a subcategory to select it, which adds the subcategory ID to the URL, allowing you to quickly bookmark or cite a specific subcategory. I made this live today, and you can test it out here: http://historicalthesaurus.arts.gla.ac.uk/category/#id=157035. I also fixed a layout bug that was making the quick search box appear in less than ideal places on certain screen widths and I also updated the display of the category and tree on narrow screens: Now the tree is displayed beneath the category information and a ‘jump to hierarchy’ button appears. This in combination with the ‘top’ button makes navigation much more easy on narrow screens.
I then started looking at the tagged EEBO data. This is a massive dataset (about 50Gb of text files) that contains each word on a subset of EEBO that has been semantically tagged. I need to extract frequency data from this dataset – i.e. how many times each tag appears both in each text and overall. I have initially started to tackle this using PHP and MySQL as these are the tools I know best. I’ll see how feasible it is to use such an approach and if it’s going to take too long to process the whole dataset I’ll investigate using parallel computing and shell scripts, as I did for the Hansard data. I managed to get a test script working that managed to go through one of the files in about a second, which is encouraging. I did encounter a bit of a problem processing the lines, though. Each line is tab delimited and rather annoyingly, PHP’s fgetcsv function doesn’t treat ‘empty’ tabs as separate columns. This was giving me really weird results as if a row had any empty tabs the data I was expecting to appear in columns wasn’t there. Instead I had to use the ‘explode’ function on each line, splitting it up by the tab character (\t), and this thankfully worked. I still need confirmation from Fraser that I’m extracting the right columns, as strangely there appear to be thematic heading codes in multiple columns. Once I have confirmation I’ll be able to set the script running on the whole dataset (once I’ve incorporated the queries for inserting the frequency data into the database I’ve created).
This week was the second week of the UCU strike action, meaning I only worked on Thursday and Friday. Thing were further complicated by the heavy snow, meaning the University was officially closed on Wednesday to Friday. However, I usually work from home on Thursdays anyway, so just worked as I would normally do. And on Friday I travelled into work without too much difficulty in order to participate in some meetings that had been scheduled.
I spent most of Thursday working on the REELS project, making tweaks to the database and content management system and working on the front end. I updated the ‘parts of speech’ list that’s used for elements, adding in ‘definite article’ and ‘preposition’, and also added in the full text in addition to the abbreviations to avoid any confusion. Last week I added ‘unknown’ to the elements database, with ‘na’ for the language. Carole pointed out that ‘na’ was appearing as the language when ‘unknown’ was selected, which it really shouldn’t do, so I updated the CMS and the front-end to ensure that this is hidden. I also wrote a blog post about the technical development of the front end. It’s not gone live yet but once it has I’ll link through to it. I also updated the quick search so that it only searches current place-names, elements and grid references, and I’ve fixed the ‘altitude’ field in the advanced search so that you can enter more than 4 characters into it.
In addition to this I spent some of the day catching up with emails and I also gave Megan Coyer detailed instructions on how to use Google Docs to perform OCR on an image based PDF file. This is a pretty handy trick to know and it works very well, even on older printed documents (so long as the print quality is pretty good). Here’s how you go about it:
You need to go to Google Drive (https://drive.google.com) then drag and drop the PDF into there, which keeps it as a PDF. Then right click on the thumbnail of the PDF and select ‘Open With…’ and then select Google Docs and it converts it into text (a process which can take a while depending on the size of your PDF). You can then save the file, download it as a Word file etc.
After trudging through the snow on Friday morning I managed to get into my office for 9am, and worked through until 5 without a lunch break as I had so much to try and do. At 10:30 I had a meeting with Jane Stuart-Smith and Eleanor Lawson about revamping the Seeing Speech website. I spent about an hour before this meeting going through the website and writing down a list of initial things I’d like to improve, and during our very useful two-hour meeting we went through this list, and discussed some other issues as well. It was all very helpful and I think we all have a good idea of how to proceed with the developments. Jane is going to try and apply for some funding to do the work, so it’s not something that will be tackled straight away, but I should be able to make good progress with it once I get the go-ahead.
I went straight from this meeting to another one with Marc and Fraser about updates to the Historical Thesaurus and work on the Linguistic DNA project. This was another useful and long meeting, lasting at least another two hours. I can’t really go into much detail about what was discussed here, but I have a clearer idea now of what needs to be done for LDNA in order to get frequency data from the EEBO texts, and we have a bit of a roadmap for future Historical Thesaurus updates, which is good.
After these meetings I spent the rest of the day working on an updated ‘Storymap’ for Kirsteen’s RNSN project. This involved stitching together four images of sheet music to use as a ‘map’ for the story, updating the position of all of the ‘pins’ so they appeared in the right places, updating the images used in the pop-ups, embedding some MP3 files in the pop-ups and other such things. Previously I was using the ‘make a storymap’ tools found here: https://storymap.knightlab.com/ which meant all our data was stored on a Google server and referenced files on the Knightlab servers. This isn’t ideal for longevity, as if anything changes either at Google or Knightlab then our feature breaks. Also, I wanted to be able to tweak the code and the data. For these reasons I instead downloaded the source code and added it to our server, and grabbed the JSON datafile generated by the ‘make a’ tool and added this to our server too. This allowed me to update the JSON file to make an HTML5 Audio player work in the pop-ups and it will hopefully allow me to update the code to make images in the pop-ups clickable too.
This week marked the start of the UCU’s strike action, which I am participating in. This meant that I only worked from Monday to Wednesday. It was quite a horribly busy week as I tried to complete some of the more urgent things on my ‘to do’ list before the start of the strike, while other things I had intended to complete unfortunately had to be postponed. I spent some time on Monday writing a section containing details about the technical methodology for a proposal Scott Spurlock is intending to submit to the HLF. I can’t really say too much about it here, but it will involve crowd sourcing and I therefore had to spend time researching the technologies and workflows that might work best for the project and then writing the required text. Also on Monday I discovered that the AHRC does now have some guidance on its website about the switchover from Technical Plans to Data Management Plans. There are some sample materials and accompanying support documentation, which is very helpful. This can currently be found here: http://www.ahrc.ac.uk/peerreview/peer-review-updates-and-guidance/ although this doesn’t look like it will be a very permanent URL. Thankfully there will be a transition period up to the 29th of March when proposals can be submitted with either a Technical Plan or a DMP. This will make things easier for a few projects I’m involved with.
Also on Monday Gary Thoms contacted me to say there were some problems with the upload facilities for the SCOSYA project, so I spent some time trying to figure out what was going on there. What has happened is that Google seem to have restricted access to their geocoding API, which the upload script connects to in order to get the latitude and longitude of the ‘display town’. Instead of returning data, Google was returning an error saying we had exceeded our quota of requests. This was because previously I was just connecting to their API without registering for an API key, which used to work just fine but now is intermittent. Keep refreshing this page: https://maps.googleapis.com/maps/api/geocode/json?address=Aberdour+scotland and you’ll see it returns data sometimes and an error about exceeding the quota other times.
After figuring this out I created an API account for the project with Google. If I pass the key they gave me in the URL this now bypasses the restrictions. We are allowed up to 2,500 requests a day and up to 5000 requests in 100 seconds (that’s what they say – not sure how that works if you’re limited to 2,500 a day) so we shouldn’t encounter a quota error again.
Thankfully the errors Gary was encountering with a second file turned out to be caused by typos in the questionnaire – an invalid postcode was given. There were issues with a third questionnaire, which was giving an error on upload without stating what the error was, which was odd as I’d added in some fairly comprehensive error handling. After some further investigation it turned out to be caused by the questionnaire containing a postcode that didn’t actually exist. In order to get the latitude and longitude for a postcode my scripts connect to an external API which then returns the data in the ever so handy JSON format. However, a while ago the API I was connecting to started to go a bit flaky and for this reason I added in a connection to a second external API if the first one gave a 404. But now the initial API I used has completely gone offline, and is taking ages to even return a 404, which was really slowing down the upload script. Not only that but the second API didn’t handle ‘unknown’ postcode errors in the same way. The first API returned a nice error message but the second one just returned an empty JSON file. This meant my error handler wasn’t picking up that there was a postcode error and thus giving no feedback. I have now completely dropped the first API and connect directly to the second one, which speeds up the upload script dramatically. I have also updated my error handlers so it knows how to handle an empty JSON file from this API.
On Tuesday I fixed a data upload error with the Thesaurus of Old English, spoke to Graeme about the AHRC’s DMPs and spent the morning working on the Advanced Search for the REELS project. Last week I had completed the API for the advanced search and had started on the front end, and this week I managed to complete the front end for the search, including auto-complete fields where required, and supplying facilities to export the search results in CSV and JSON format. There was a lot more to this task than I’m saying here but the upshot is that we have a search facility that can be used to build up some pretty complex queries.
On Tuesday afternoon we had a project meeting for the REELS project where I demonstrated the front end facilities and we discussed some further updates that would be required for the content management system. I tackled some of these on Wednesday. The biggest issue was with adding place-name elements to historical forms. If you created a new element through the page where elements are associated with historical forms an error was encountered that caused the entire script to break and display a blank page. Thankfully after a bit of investigation I figured out what was causing this and fixed it. I also implemented the following:
- Added gender to elements
- Added ‘Epexegetic’ to the ‘role’ list
- When adding new elements to a place-name or historical form no language is selected by default, meaning entering text into the ‘element’ field searches all languages. The language appears in brackets after each element in the returned list. Once selected the element’s language is then selected in the ‘language’ list. You can still select a language before typing in an element to limit the search to that specific language
- All traces of ‘ScEng’ have been removed
- I’d noticed that when no element order was specified when you returned to the ‘manage elements’ page the various elements would sometime just appear in a random order. I’ve made it so that if no element order is entered the order is always the order in which the elements were originally added.
- When a historical form has been given elements these now appear in the table of historical forms on the ‘edit place’ page, so you can tell which forms already have elements (and what they are) without needing to load the edit a historical form page.
- Added an ‘unknown’ element. As all elements need a language I’ve assigned this to ‘Not applicable (na)’ for now.
Also on Wednesday I had to spend some time investigating why an old website of mine wasn’t displaying characters properly. This was caused by the site being moved to a new server a couple of weeks ago. It turned out to be caused by the page fragments (of which there are several thousand) being encoded as ANSI when the need to be UTF8. I thought it would be a simple task to batch process the files to convert them butI’m afraid doing something as simple as batch converting from ANSI to UTF8 is proving to be stupidly difficult. I still haven’t found a way to do it. I tried following the example in Powershell here: https://superuser.com/questions/113394/free-ansi-to-utf8-multiple-files-converter
But it turns out you can only convert to UTF8 with BOM, which adds in bad characters to the start of the file as displayed on the website. And there’s no easy way to get it without BOM, as discussed here: https://stackoverflow.com/questions/5596982/using-powershell-to-write-a-file-in-utf-8-without-the-bom
I then followed some of the possible methods listed here: https://gist.github.com/dogancelik/2a88c81d309a753cecd8b8460d3098bc UTFCast used to offer a ‘lite’ version for free that would have worked, but now they only offer the paid version, plus a demo. I’ve installed the demo but it only allows conversion to UTF8 with BOM as well. I got a macro working in Notepad++ but it turns out macros are utterly pointless as you can’t set them to run on multiple files at once – you need to open each file and then play the macro each time. I also installed the python script plugin for Notepad++ and tried to run the script listed on the above page but nothing happens at all – not even an error message. It was all very frustrating and I had to give up due to a lack of time. Graeme (who was also involved in this project back in the day) had an old program that can do the batch converting and he gave me a copy so I’ll try this when I get the chance.
So that was my three-day week. Next week I’ll be on strike on Monday to Wednesday so will be back at work on Thursday.