Week Beginning 22nd October 2018

I returned to work on Monday after being off last week.  As usual there were a bunch of things waiting for me to sort out when I got back, so most of Monday was spent catching up with things.  This included replying to Scott Spurlock about his Crowdsourcing project, responding to a couple of DSL related issues, updating access restrictions on the SPADE website, reading through the final versions of the DMP and other documentation for Matt Sangster and Katie Halsey’s project, updating some details on the Medical Humanities Network website, responding to a query about the use of the Thesaurus of Old English and speaking to Thomas Clancy about his Iona proposal.

With all that out of the way I returned to the OED / HT data linking issues for the Historical Thesaurus.  In my absence last week Marc and Fraser had made some further progress with the linking, and had made further suggestions as to what strategies I should attempt to implement next.  Before I left I was very much in the middle of working on a script that matched words and dates, and I hadn’t had time to figure out why this script was bringing back no matches.  It turns out the HT ‘fulldate’ field was using long dashes, whereas I was joining the OED GHT dates with a short dash.  So all matches failed.  I replaced the long dashes with short ones and the script then displayed 2733 ‘full matches’ (where every stripped lexeme and its dates match) and 99 ‘partial matches’ (where more than 6 and 80% match both dates and stripped lexeme text).  I also added in a new column that counts the number of matches not including dates.

Marc had alerted me to an issue where the number of OED matches was coming back as more than 100% so I then spent some time trying to figure out what was going on here.  I updated both the ‘with dates’ and ‘no date check’ versions of the lexeme pattern matching scripts to add in the text ‘perc error’ to any percentage that’s greater than 100, to more easily search for all occurrences.  There are none to be found in the script with dates, as matches are only added to the percentage score if their dates match too.  On the ‘no date check’ script there are several of these ‘perc error’ rows and they’re caused for the most part by a stripped form of the word being identical to an existing non-stripped form.  E.g. there are separate lexemes ‘she’ and ‘she-‘ in the HT data, and the dash gets stripped, so ‘she’ in the OED data ends up matching two HT words.  There are some other cases that look like errors in the original data, though.  E.g. in OED catid 91505 severity there’s the HT word ‘hard (OE-)’ and ‘hard (c1205-)’ and we surely shouldn’t have this word twice.  Finally there are some forms where stripping out words results in duplicates – e.g. ‘pro and con’ and ‘pro or con’ both end up as ‘pro con’ in both OED and HT lexemes, leading to 4 matches where there should only be 2.  There are no doubt situations where the total percentage is pushed over the 80% threshold or to 100% by a duplicate match – any duplicate matches where the percentage doesn’t get over 100 are not currently noted in the output.  This might need some further work. Or, as I previously said, with the date check incorporated the duplicates are already filtered out, so it might not be so much of an issue.

I also then moved on to a new script that looks at monosemous forms.  This script gets all of the unmatched OED categories that have a POS and at least one word and for each of these categories it retrieves all of the OED words.  For each word the script queries the OED lexeme table to get a count of the number of times the word appears.  Note that this is the full word, not the ‘stripped’ form, as the latter might end up with erroneous duplicates, as mentioned above.  Each word, together with its OED date and GHT dates (in square brackets) and a count of the number of times it appears in the OED lexeme table is then listed.  If an OED word only appears once (i.e. is monosemous) it appears in bold text.  For each of these monosemous words the script then queries the HT data to find out where and how many times each of these words appears in the unmatched HT categories.  All queries keep to the same POS but otherwise look at all unmatched categories, including those without an OEDmaincat.  Four different checks are done, with results appearing in different columns: HT words where full word (not the stripped variety) matches and the GHT start date matches the HT start date; failing that, HT words where the full word matches but the dates don’t; failing either of these, HT words where the stripped forms of the words match and the dates match; failing all these, HT words where the stripped forms match but the dates don’t.  For each of these the HT catid, OEDmaincat (or the text ‘No Maincat’ if there isn’t one), subcat, POS, heading, lexeme and fulldate are displayed.  There are lots of monosemous words that just don’t appear in the HT data.  These might be new additions or we might need to try pattern matching.  Also, sometimes words that are monosemous in the OED data are polysemous in the HT data.  These are marked with a red background in the data (as opposed to green for unique matches).  Examples of these are ‘sedimental’, ‘meteorologically’, ‘of age’.  Any category that has a monosemous OED word that is polysemous in the HT has a red border.  I also added in some stats below the table.  In our unmatched OED categories there are 24184 monosemous forms.  There are 8086 OED categories that have at least one monosemous form that matches exactly one HT form.  There are 220 OED monosemous forms that are polysemous in the HT.  Now we just need to decide how to use this data.

Also this week I looked into an issue one of the REELS team was having when accessing the content management system (it turns out that some anti-virus software was mislabelling the site as having some kind of phishing software in it), and responded to a query about the Decadence and Translation Network website I’d set up.  I also started to look at sourcing some Data Management Plans for an Arts Lab workshop that Dauvit Broun has asked me to help with next week.  I also started to prepare my presentation for the Digital Editions workshop next week, which took a fair amount of time.  I also met with Jennifer Smith and a new member of the SCOSYA project team in Friday morning to discuss the project and to show the new member of staff how the content management system works.  It looks like my involvement with this project might be starting up again fairly soon.

On Tuesday Jeremy Smith contacted me to ask me to help out with a very last minute proposal that he is putting together.  I can’t say much about the proposal, but it had a very tight deadline and required rather a lot of my time from the middle of the week onwards (and even into the weekend).  This involved lots of email exchanges, time spent reading documentation, meeting with Luca, who might be doing the technical work for the project if it gets funded, and writing a Data Management Plan for the project.  This all meant that I was unable to spend time working on other projects I’d hoped to work on this week, such as the Bilingual Thesaurus.  Hopefully I’ll have time to get back into this next week, once the workshops are out of the way.

Week Beginning 3rd September 2018

It was back to normality this week after last week’s ICEHL conference.  I had rather a lot to catch up with after being out of the office for four days last week and spending the fifth writing up my notes.  I spent about a day thinking through the technical issues for an AHR proposal Matthew Sangster is putting together and then writing a first version of the Data Management Plan.  I also had email conversations with Bryony Randall and Dauvit Broun about workshops they’re putting together that they each want me to participate in.  I responded to a query from Richard Coates at Bristol who is involved with the English Place-Name Society about a database related issue the project is experiencing, and I also met with Luca a couple of times to help him with an issue related to using OpenStreetMap maps offline.  Luca needed to set up a version of map-based interface he has created that needs to work offline, so he needed to download the map tiles for offline use.  He figured out that this is possible with the Marble desktop mapping application (https://marble.kde.org/) but couldn’t figure out where the map tiles were stored.  I helped him to figure this out, and also to fix a couple of JavaScript issues he was encountering.  I was concerned that he’d have to set up a locally hosted map server for his JavaScript to connect to, but thankfully it turns out that all of the processing is done at the JavaScript end, and all you need is the required directory /subdirectory structure for map tiles and the PNG images themselves stored in this structure.  It’s good to know for future use.

I also responded to queries from Sarah Phelan regarding the Medical Humanities Network and Kirsteen McCue about her Romantic National Song Network.  Eleanor Lawson also got in touch with some text for one of the redesigned Seeing Speech website pages, so I added that.  It also transpired that she had sent me a document containing lots of other updates in June, but I’d never received the email.  It turns out she had sent it to a Brian Aitken at her own institution (QMU) rather than me.  She sent the document on to me again and I’ll hopefully have some time to implement all of the required changes next week.

I also investigated an issue Thomas Clancy is having with his Saints Places website.  The Google Maps used throughout the website are no longer working.  After some investigation it would appear that Google is now charging for using its maps service.  You can view information here: https://cloud.google.com/maps-platform/user-guide/.  So you now have to set up an account with a credit card associated with it to use Google Maps on your website.  Google offer $200 worth of free usage, and I believe you can set a limit that would mean if usage goes over that amount the service is blocked until the next monthly period.  Pricing information can be found here: https://cloud.google.com/maps-platform/pricing/sheet/.  The maps on the Saints website are ‘Dynamic Maps’, and although the information is pretty confusing I think the table on the above page says that the $200 of free credit would cover 28,000 loads of a map on the Saints website per month (the cost is $7 per 1000 loads), and every time a user loads a page with a map on it this is one load, so one user looking at several records will log multiple map loads.

This isn’t something I can fix and it has worrying implications for projects that have fixed periods of funding but need to continue to be live for years or decades after the period of funding.  It feels like a very long time since Google’s motto was “Don’t be evil” and I’m very glad I moved over to using the Leaflet mapping library rather than Google a few years ago now.

I also spent a bit of time making further updates to the new Place-names of Kirkcudbrightshire website, creating some place-holder pages for the public website, adding in the necessary logos and a background map image, updating the parish three-letter acronyms in the database and updating the map in the front-end so that it defaults to showing the right part of Scotland.

I was engaged in some App related duties this week too, communicating with Valentina Busin in MVLS about publishing a student-created app.  Pamela Scott in MVLS also contacted me to say that her ‘Molecular Methods’ app had been taken off the Android App store.  After logging into the UoG Android account I found a bunch of emails from Google saying that about 6 of our apps had been taken down because they didn’t include a ‘child-directed declaration’.  Apparently this is a new thing that was introduced and you have to tick a checkbox in the Developer console to say whether your app is primarily aimed at under 13 year-olds.  Once that’s done your app gets added back to the store.  I did this for the required apps and all was put right again about an hour later.

I spent about a day this week working on Historical Thesaurus duties.  I set up a new ‘colophon’ page that will list all of the technologies we use on the HT website and I also returned to the ongoing task of aligning the HT and OED data.  I created new fields for the HT and OED category and word tables to contain headings / words that are stripped of all non-alphanumeric characters (including spaces) and also all occurrences of ‘ and ‘ and ‘ or ‘ (with spaces round them).  I also converted the text into all lower case.  This means a word such as “in spite of/unþonc/maugre/despite one’s teeth” will be stored in the field as “inspiteofunþoncmaugredespiteonesteeth”.  The idea is that it will be easier to compare HT and OED data with such extraneous information stripped out.  With this in place I then ran a script that goes through all of the unmatched categories and finds any where the oedmaincat matches OED path, subcat matches OED sub, the pat of speech matches and the ‘stripped’ headings match.  This has identified 1556 new matches, which I’ve now logged in the database.  This brings the total unmatched HT categories down to 10,478 (of which  1679 have no oedmaincat and presumably can’t be matched).  The total unmatched OED categories is 13,498 (of which 8406 have no pos and so will probably never match an HT category).  There are also a further 920 potential matches where the oedmaincat matches the path, the pos matches and the ‘stripped’ headings match, but the subcat numbers are different.  I’ll need to speak to Marc and Fraser about these next week.

I spent most of Friday working on the setting up the system for the ‘Records of Govan Old’ crowdsourcing site for Scott Spurlock.  Although it’s not completely finished things are beginning to come together.  It’s a system that’s based on the ‘Scripto’ crowdsourcing tool (http://scripto.org/) that uses Omeka and MediaWiki to manage data and versioning.  The interface I’ve set up is pretty plain at the moment but I’ve set up a couple of sample pages with placeholder text (Home and About).  It’s also possible to browse collections – currently there is only one collection (Govan old images) but this could be used to have different collections for different manuscripts, for example.  You can then view items in the collection, or from the menu choose ‘browse items’ to access all of them.

For now there are only two sample images in the system, which are images from a related manuscript that Scott previously gave me.  Users can create a user accounts via MediaWiki and then if you then go to the ‘Browse items’ page then select one of the images to transcribe you can view the image in a zoomable / panable image viewer, view any existing transcription that’s been made, view the history of changes made and if you press the ‘edit’ link a section will open that allows you to edit the transcription and add your own.

I’ve added in a bunch of buttons that place tags in the transcription area when they’re clicked on.  They’re TEI tags so eventually (hopefully) we’ll be able to shape the texts into valid TEI XML documents.  All updates made by users are tracked and you can view all previous versions of the transcriptions, so if anyone comes along and messes things up it’s easy to revert to an earlier version.  There’s also an admin interface where you can view the pages and ‘protect’ them, which prevents future edits being made by anyone other than admin users.

There’s still a lot to be done with this.  For example, at the moment it’s possible to add any tags and HTML to the transcription, which we want to prevent for security reasons as much as anything else.  The ‘wiki’ that sits behind the transcription interface (which you see when creating an account) is also open for users to edit and mess up so that needs to be locked down too.  I also want to update the item lists so that it displays which items have not be transcribed, which have been started and which have been ‘protected’, to make it easier for users to find something to work on.    I need to get the actual images that we’ll use in the tool before I do much more with this, I reckon.

Week Beginning 19th March 2018

With the strike action over (for now, at least) I returned to a full week of work, and managed to tackle a few items that had been pending for a while.  I’d been asked to write a Technical Plan for an AHRC application for Faye Hammill in English Literature, but since then the changeover from four-page, highly structured Technical Plans to two-page more free-flowing Data Management Plans has taken place.  This was a good opportunity to write an AHRC Data Management Plan, and after following the advice on the AHRC website(http://www.ahrc.ac.uk/documents/guides/data-management-plan/) and consulting the additional documentation on the DCC’s DMPonline tool (https://dmponline.dcc.ac.uk/) I managed to write a plan that covered all of the points.  There are still some areas where I need further input from Faye, but we do at least have a first draft now.

I also created a project website for Anna McFarlane’s British Academy funded project.  The website isn’t live yet, so I can’t include the URL here, but Anna is happy with how it looks, which is good.  After sorting that out I then returned to the REELS project.  I created the endpoints in the API that would allow the various browse facilities we had agreed upon to function, and then built these features in the front-end.  It’s now possible to (for example) list all sources and see which has the most place-names associated with it, or bring up a list of all of the years in which historical forms were first attested.

I spent quite a bit of time this week working on the extraction of words and their thematic headings from EEBO for the Linguistic DNA project.  Before the strike I’d managed to write a script that went through a single file and counted up all of the occurrences of words, parts of speech and associated thematic headings, but I was a little confused that there appeared to be thematic heading data in column 6 and also column 10 of the data files.  Fraser looked into this and figured out that the most likely thematic heading appeared in column 10, while other possible ones appeared in column 6.  This was a rather curious way to structure the data, but once I knew about it I could set my script to focus on column 10, as we’re only interested in the most likely thematic heading.

I updated my script to insert data into a database rather than just hold things temporarily in an array, and I also wrapped the script in another function that then applied the processing to every file in a directory rather than just a single file.  With this in place I set the script running on the entire EEBO directory.  I was unsure whether running this on my desktop PC would be fast enough, but thankfully the entire dataset was processed in just a few hours.

My script finished processing all 14590 files that I had copied from the J drive to my local PC, resulting in whopping 70,882064 rows entered into my database.  Everything seemed to be going very well, but Fraser wasn’t sure I had all of the files, and he was correct.  Having checked the J drive, there were 25,368 items, so when I had copied the files across the process must have silently failed at some point.  And even more annoyingly it didn’t fail in an orderly manner.  E.g. the earliest file I have on my PC is A00018 while there are several earlier ones on the J drive.

I copied all of the files over again and decided that rather then dropping the database and started from scratch I’d update my script to check to see whether a file had already been processed, meaning that only the missing 10,000 or so would be dealt with.  However, in order to do this the script would need to query a 70 million row database for the ‘filename’ column, which didn’t have an index.  I began the process of creating an index, but indexing 70 million rows took a long time – several hours, in fact.  I almost gave up and inserted all the data again from scratch, but the thing is I knew I would need this index in order to query the data anyway, so I decided to persevere.  Thankfully the index finally finished building and I could then run my script to insert the missing 10,000 files, a process that took a bit longer as the script now had to query the database and also update the index as well as insert the data.  But finally all 25,368 files were processed, resulting in 103,926,008 rows in my database.

The script and the data are currently located on my desktop PC, but if Fraser and Marc want to query it I’ll need to get this migrated to a web server of some sort, so I contacted Chris about this.  Chris said he’d sort a temporary solution out for me, which is great.  I then set to work writing another script that would extract summary information for the thematic headings and insert this into another table.  After running the script this table now contains a total count of each word / part of speech / thematic heading across the entire EEBO collection.  Where a lemma appears with multiple parts of speech these are treated as separate entities and are not added together.  For example, ‘AA Creation NN1’ has a total count of 4609 while ‘AA Creation NN2’ has a total count of 19, and these are separate rows in the table.

Whilst working with the data I noticed that a significant amount of it is unusable.  Of the almost 104 million rows of data, over 20 million have been given the heading ’04:10’ and a lot of these are words that probably could have been cleaned up before the data was fed into the tagger.  A lot of these are mis-classified words that have an asterisk or a dash at the start.  If the asterisk / dash had been removed then the word could have been successfully tagged.  E.g. there are 88 occurrences of ‘*and’ that have been given the heading ’04:10’ and part of speech ‘FO’.  Basically about a fifth of the dataset is an unusable thematic heading, and much of this is data that could have been useful if the data had been pre-processed a little more thoroughly.

Anyway, after tallying up the frequencies across all texts I then wrote a script to query this table and extract a ‘top 10’ list of lemma / pos combinations for each of the 3,972 headings that are used.  The output has one row per heading and a column for each of the top 10 (or less if there are less than 10).  This currently has the lemma, then the pos in brackets and the total frequency across all 25,000 texts after a bar, as follows: christ (NP1) | 1117625.  I’ve sent this to Fraser and once he gets back to me I’ll proceed further.

In addition to the above big tasks, I also dealt with a number of smaller issues.  Thomas Widmann of SLD had asked me to get some DSL data from the API for him, so I sent that on to him.  I updated the ‘favicon’ for the SPADE website, I fixed a couple of issues for the Medical Humanities Network website, and I dealt with a couple of issues with legacy websites:  For SWAP I deleted the input forms as these were sending spam to Carole.  I also fixed an encoding issue with the Emblems websites that had crept in when the sites had been moved to a new server.

I also heard this week that IT Services are going to move all project websites to HTTPS from HTTP.  This is really good news as Google has started to rank plain HTTP sites lower than HTTPS sites, plus Firefox and Chrome give users warnings about HTTP websites.  Chris wanted to try migrating one of my sites to HTTPS and we did this for the Scots Corpus.  There were some initial problems with the certificate not working for the ‘www’ subdomain but Chris quickly fixed this and everything appeared to be working fine.  Unfortunately, although everything was fine within the University network, the University’s firewall was blocking HTTPS requests from external users, meaning no-one outside of the University network could access the site.  Thankfully someone contacted Wendy about this and Chris managed to get the firewall updated.

I also did a couple of tasks for the SCOSYA project, and spoke to Gary about the development of the front-end, which I think is going to need to start soon.  Gary is going to try and set up a meeting with Jennifer about this next week.  On Friday afternoon I attended a workshop about digital editions that Sheila Dickson in German had organised.  There were talks about the Cullen project, the Curious Travellers project, and Sheila’s Magazin zur Erfahrungsseelenkunde project.  It was really interesting to hear about these projects and their approaches to managing transcriptions.

Week Beginning 22nd January 2018

I spent much of this week working on the REELS project.  My first task was to implement a cross-reference system for place-names in the content management system.  When a researcher edits a place they now see a new ‘X-Refs’ field between ‘Research Notes’ and ‘Pronunciation’.  If they start typing the name of a place an autocomplete list appears featuring matching place-names and their current parishes.  Clicking on a name and then pressing the ‘edit’ button below the form then saves the new cross reference.  Multiple cross references can be added by pressing the ‘add another’ button to the right of the field.  When a cross reference has been added it is listed in this section as a link, allowing the researcher to jump to the referenced place-name and it’s also possible to delete a cross reference by pressing on the ‘delete’ button next to the place-name.  Cross references are set up to work both ways.  If the researcher adds a reference from ‘Abbey Park’ to ‘Abbey Burn’ then whenever s/he views the ‘Abbey Burn’ record the cross reference to ‘Abbey Park’ will also display, and deleting a reference from one place-name deletes it from the other too – i.e. one-way references can’t exist.

I also fixed an issue with the CMS that Eila had alerted me to: historical forms that have no start dates but do feature end dates weren’t being listed with any dates at all.  It turned out I’d set things up so that dates were only processed if a start date was present, which was pretty easy to rectify.  For the rest of my time on the project I wrote a specification document for the front end features I will be developing for the project.  This took up a lot of the week, as I had to spend time thinking about the features the front end will include and how things like the search, browse and map will interoperate and function.  This has also involved trying out the various existing place-name resources and thinking about which aspects of these sites work or don’t work so well.

My initial specification document is just over 3000 words long and describes the features that the front end will include, the sorts of search and browse options that will be available, which fields will be displayed, how the map interface will work and such things. I emailed it to the rest of the team on Friday for feedback, which I will hopefully get during next week.  It is just an initial idea of how things work and once I actually get down to developing the site things might change, but it’s useful to get things down in writing at this stage just in case there’s anything I’ve missed or people would prefer features to work differently.  I hope to begin development of the features next week.

Also this week I spent a bit of time on the RNSN project.  I switched a few things around on the website and I also began working with some slides that Brianna had sent me.  We are going to make ‘stories’ about particular songs, and we’d decided to investigate a couple of existing tools in order to do this.  The first is a timeline library (https://timeline.knightlab.com/) while the second is similar to a timeline only works with maps instead (https://storymap.knightlab.com/).  Initially I created a timeline based on the slides, but I quickly realised that there weren’t really enough different dates across the slides for this to work very well.  There weren’t any specific places mentioned in the slides either, so it seemed like the storymap library wouldn’t be a good fit either.  However, I then remembered that storymap can be set up to work with images rather than a map as a base layer, allowing you to ‘pin’ slides onto specific parts of the image (see https://storymap.knightlab.com/gigapixel/).  Brianna sent me an image of a musical score that she wanted to use as a background image and I followed the steps required to create a tileset from this and set it up for use with the library.  The image wasn’t really of high enough quality to be used for this purpose, but as a test it worked just fine.  I then created the required slides, attached them to the image, added in images and sound files and we then had a test version of a story up and running.  It’d going to need some further work before it can be published, but it’s good to know that this approach is going to work.

I also had some Burns related duties to attend to this week, what with Burns’ Night being on Thursday.  We added some new songs to the Burns website (http://burnsc21.glasgow.ac.uk/) and I dealt with a request to use our tour maps on another blog (see https://blog.historicenvironment.scot/2018/01/burns-nicht/).

I met with Luca this week to discuss how he’s using Exist DB, XQuery and other XML technologies in order to create the Curious Travellers website.  I hadn’t realised that it was possible to use these technologies without any other server-side scripting language, but apparently it is possible for Exist to handle all of the page requests and output date in the required format for users (e.g. HTML or even JSON formatted data).  It was very interesting to learn a bit about how these technologies work.  We also had a chat about Joanna’s project, and I had an email conversation with her about how I might be involved in the project.

I made some further tweaks to the NRECT website for Stuart Gillespie, responded to a query from Megan Coyer about the management of the Medical Humanities Network website and met with Anna McFarlane to discuss putting together a blog for her new project.  I also updated all of the WordPress sites to the latest version as a new security release was made available this week.

Week Beginning 31st October 2016

This week was rather an unusual one due to my son being ill with the winter vomiting virus.  I had to take Tuesday off as annual leave at short notice to look after him and then I also ended up having to work from home on Wednesday in addition to my usual Thursday.  I spent quite a bit of Monday creating a new version of the ‘Essentials of Old English’ app that fixed an issue with the Glossary.  I’d started on this last Friday but ended up spending ages just installing updates and getting to the point where I could build new versions of the app.  Thankfully I managed to get this sorted on Monday, although it still took an awfully long time to build, test and deploy the iOS and Android versions.  However, they are now both available (version 1.2) through the App and Play stores now.  I also spent a couple of hours on Monday replying to a rather detailed email for the REELs project and also spent a bit of time getting some of the Hansard sample data to Marc.

I was off work on Tuesday, which unfortunately meant rescheduling the meeting I was supposed to have with Marc and Fraser about updating the Historical Thesaurus with new data from the OED people.  This is going to take place next Monday now instead.  As I haven’t heard back from the SCOSYA people since the last updates I made I decided to hold off with any new developments here until after we have our next meeting (also next Monday), so this gave me some time I could spend on the ‘Basics of English Metre’ app.  I spent most of Wednesday, Thursday and some of Friday afternoon on this, and I’m very pleased to say I have now completed a first draft of the app.  IT took a bit of time to get back into developing the app as it had been a while since I last worked on it.  There were also some rather tricky exercises that I needed to implement as part of Unit 3 of the app, which took some planning, testing and refining.  It feels very satisfying to have all of the exercises fully operational now, though.  The next stage will be to get people to test it, make required updates, create a ‘web’ version with the University website’s look and feel and then start the arduous wrapping and deploying of the actual app versions.  It feels like the end is in sight now, though, and I’ve already started to think about what old resource to tackle next.  Having said that, I still need to make Android versions of the ‘Grammar’ and ‘ARIES’ apps first.

On Friday I dealt with some issues relating to the University’s iOS developer account, created some text for Marc in relation to getting access to a collection of historical textual data and created a new feature for the Medical Humanities Network that lists the email addresses of all members in a format that can be pasted into Outlook.  Next week I’ll return to SCOSYA work and will no doubt be working on the Historical Thesaurus again.

Week Beginning 24th October 2016

I had a very relaxing holiday last week and returned to work on Monday.  When I got back to work I spent a bit of time catching up with things, going through my emails, writing ‘to do’ lists and things like that and once that was out the way I settled back down into some actual work.

I started off with Mapping Metaphor.  Wendy had noticed a bug with the drop-down ‘show descriptors’ buttons in the search results page, which I swiftly fixed.  Carole had also completed work on all of the Old English metaphor data, so I uploaded this to the database.  Unfortunately, this process didn’t go as smoothly as previous data uploads due to some earlier troubles with this final set of data (this data was the dreaded ‘H27’ data, which originally was one category but which was split into several smaller categories, which caused problems for Flora’s Access database that the researchers were using).

Rather that updating rows the data upload added new ones, and this was because the ordering of cat1 and cat2 appeared to have been reversed since stage 4 of the data processing.  For example, in the database cat1 is ‘1A16’ and cat2 is ‘2C01’ but in the spreadsheet these are the other way round.  Thankfully this was consistently the case, so once identified it was easy to rectify the problem.  For Old English we now have a complete set of metaphorical connections, consisting of 2488 and 4985 example words.  I also fixed a slight bug in the category ordering for OE categories and replied to Wendy about a query she had received regarding access to the underlying metaphor data.  After that I updated a few FAQs and all was complete.

Also this week I undertook some more AHRC work, which took up the best part of a day, and I replied to a request from Gavin Miller about a Medical Humanities Network mailing list.  We’ve agreed a strategy to implement such a thing, which I hope to undertake next week.  I also chatted to Chris about migrating the Scots Corpus website to a new server.  The fact that the underlying database is PostGreSQL rather than MySQL is causing a few issues here, but we’ve come up with a solution to this.

I spent a couple of days this week working on the SCOSYA project, continuing with the updates to the ‘consistency data’ views that I had begun before I went away.  I added an option to the page that allows staff to select which ‘code parents’ they want to include in the output.  This defaults to ‘all’ but you can narrow this down to any of them as required.  You can also select or deselect ‘all’ which ticks / unticks all the boxes.  The ‘in-browser table’ view now colour codes the codes based on their parent, with the colour assigned to a parent listed across the top of the page.  The colours are randomly assigned each time the page loads so if two colours are too similar a user can reload the page and different ones will take their place.

Colour coding is not possible in the CSV view as CSV files are plain text and can’t have any formatting.  However, I have added the colour coding to the chart view, which colours both the bars and the code text based on each code’s parent.  I’ve also added in a little feature that allows staff to save the charts.

I then added in the last remaining required feature to the consistency data page, namely making the figures for ‘percentage high’ and ‘percentage low’ available in addition to ‘percentage mixed’.  In the ‘in-browser’ and CSV table views these appear as new rows and columns alongside ‘% mixed’, giving you the figures for each location and for each code.  In the Chart view I’ve updated the layout to make a ‘stacked percentage’ bar chart.  Each bar is the same height (100) but is split into differently coloured sections to reflect the parts that are high, low and mixed.  I’ve made ‘mixed’ appear at the bottom rather than between high and low as mixed is most important and it’s easier to track whatever is at the bottom.  This change in chart style does mean that the bars are no longer colour coded to match the parent code (as three colours are now needed per bar), but the x-axis label still has the parent code colour so you can still see which code belongs to each parent.

I spent most of the rest of the week working with the new OED data for the Historical Thesaurus.  I had previously made a page that lists all of the HT categories and notes which OED categories match up, or if there are no matching OED categories.  Fraser had suggested that it would be good to be able to approach this from the other side – starting with OED categories and finding which ones have a matching HT category and which ones don’t.  I created such a script, and I also update both this and the other script so that the output would either display all of the categories or just those that don’t have matches (as these are the ones we need to focus on).

I then focussed on creating a script that matches up HT and OED words for each category where the HT and OED categories match up.  What the script does is as follows:

  1. Finds each HT category that has a matching OED category
  2. Retrieves the lists of HT and OED words in each
  3. For each HT word displays it and the HT ‘fulldate’ field
  4. For each HT word it then checks to see if an OED word matches.  This checks the HT’s ‘wordoed’ column against the OED’s ‘ght_lemma’ column and also the OED’s ‘lemma’ column (as I noticed sometimes the ‘ght_lemma’ column is blank but the ‘lemma’ column matches)
  5. If an OED word matches the script displays it and its dates (OED ‘sortdate’ (=start) and ‘enddate’)
  6. If there are any additional OED words in the category that haven’t been matched to an HT word these are then displayed


Note that this script has to process every word in the HT thesaurus and every word in the OED thesaurus data so it’s rather a lot of data.  I tried running it on the full dataset but this resulted in Firefox crashing.  And Chrome too.  For this reason I’ve added a limit on the number of categories that are processed.  By default the script starts at 0 and processes 10,000 categories.  ‘Data processing complete’ appears at the bottom of the output so you can tell it’s finished, as sometimes a browser will just silently stop processing.  You can look at a different section of the data by passing parameters to it – ‘start’ (the row to start at) and ‘rows’ (the number of rows to process).  I’ve tried it with 50,000 categories at it worked for me, but any more than that may result in a crashed browser.  I think the output is pretty encouraging.  The majority of OED words appear to match up, and for the OED words that don’t I could create a script that lists these and we could manually decide what to do with them – or we could just automatically add them, but there are definitely some in there that should match – such as HT ‘Roche(‘s) limit’ and OED ‘Roche limit’.  After that I guess we just need to figure out how we handle the OED dates.  Fraser, Marc and I are meeting next week to discuss how to take this further.

I was left with a bit of time on Friday afternoon which I spent attempting to get the ‘essentials of Old English’ app updated.  A few weeks ago it was brought to my attention that some of the ‘C’ entries in the glossary were missing their initial letters.  I’ve fixed this issue in the source code (it took a matter of minutes) but updating the app even for such a small fix takes rather a lot longer than this.  First of all I had to wait until gigabytes of updates had been installed for MacOS and XCode, as I hadn’t used either for a while.  After that I had to update Cordova, and then I had to update the Android developer tools.  Cordova kept failing to build my updated app because it said I hadn’t accepted the license agreement for Android, even though I had!  This was hugely frustrating, and eventually I figured out the problem was I had the Android tools installed in two different locations.  I’d updated Android (and accepted the license agreements) in one place, but Cordova uses the tools installed in the other location.  After realising this I made the necessary updates and finally my project built successfully.  Unfortunately about three hours of my Friday afternoon had by that point been used up and it was time to leave.  I’ll try to get the app updated next week, but I know there are more tedious hoops to jump through before this tiny fix is reflected in the app stores.


Week Beginning 8th August 2016

This was my first five-day week in the office for rather a long time, what with holidays and conferences.  I spent pretty much all of Monday and some of Tuesday working on the Technical Plan for a proposal Alison Wiggins is putting together.  I can’t really go into any details here at this stage, but the proposal is shaping up nicely and the relatively small technical component is now fairly clearly mapped out.  Fingers crossed that it receives funding.  I spent a small amount of time on a number of small-scale tasks for different project, such as getting some content from the DSL server for Ann Ferguson and fixing a couple of issues with the Glasgow University Guardian that central IT services had contacted me about.  I also emailed Scott Spurlock in Theology to pass on my notes from the crowdsourcing sessions of DH2016, as I thought they might be of some use to him, and I had an email conversation with Gerard McKeever in Scottish Literature about a proposal he is putting together that has a small technical component he wanted advice on.  I also had an email conversation with Megan Coyer about the issues relating to her Medical Humanities Network site.

The remainder of the week was split between two projects.  First up is the Scots Syntax Atlas project.  Last week I began working through a series of updates to the content management system for the project.  This week I completed the list of items that I’d agreed to implement for Gary when we met a few weeks ago.  This consisted of the following:

  1. Codes can now be added via ‘Add Code’.  This now includes an option to select attributes for the new code too
  2. Attributes can now be added via ‘Add Attribute’.  This allows you to select the codes to apply the attribute to.
  3. There is a ‘Browse attributes’ page which lists all attributes and the number of codes associated with each.
  4. Clicking on an attribute in this list displays the code associations and allows you to edit the attribute – both its name and associated codes
  5. There is a ‘Browse codes’ page that lists the codes, the number of questionnaires each code appears in, the attributes associated with each code and the example sentences for each code.
  6. Clicking on a code in this list brings up a page for the code that features a list of its attributes and example sentences, plus a table containing the data for every occurrence of this code in a questionnaire, including some information about each questionnaire, a link through to the full questionnaire page, plus the rating information.  You can order the table by clicking on the headings.
  7. Through this page you can edit the attributes associated with the code
  8. Through this page you can also add / edit example sentences for the code.  This allows you to supply both the ‘Q code’ and the sentence for as many sentences as are required.
  9. I’ve also updated the ‘browse questionnaires’ page to make the ‘interview date’ the same ‘yyyy-mm-dd’ format as the upload date, to make it easier to order the table by this column in a meaningful way.

With all of this out of the way I can now start on developing the actual atlas interface for the project, although I need to meet with Gary to discuss exactly what this will involve. I’ve arranged to meet with him next Monday.

The second project I worked on was the Edinburgh Gazetteer project for Rhona Brown.  I set up the WordPress site for the project website, through which the issues of the Gazetteer will be accessible, as will the interactive map of ‘reform societies’.  I’ve decided to publish these via a WordPress plugin that I’ll create for the project, as it seemed the easiest way to integrate the content with the rest of the WordPress site.  The plugin won’t have any admin interface component, but will instead focus on providing the search and browse interface for the issues of the Gazetteer and the map, via a WordPress shortcode.

I tackled the thorny issue of OCR for the Gazetteer’s badly printed pages again this week.  I’m afraid it’s looking like this is going to be hopeless.  I should really have looked at some of the pages whilst we were preparing the proposal because if I’d seen the print quality then I would never have suggested OCR as a possibility.  I think the only way to extract the text in a useable way will be manual transcription.  We might be able to get the images online and then instigate some kind of rudimentary crowd-sourcing approach.  There aren’t that many pages (325 broadsheet pages in total) so it might be possible.

I tried three different OCR packages – Tesseract (which Google uses for Google Books), ABBYY Finereader, and Omipage Pro (these are considered to be the best OCR packages available).  I’m afraid none of them give usable results.  The ABBYY one looks to me to be the best, but I would still consider it unusable, even for background search purposes, and it would probably take more time to manually correct it than it would to just transcribe the page manually.

Here is one of the better sections that was produced by ABBYY:

“PETioN^c^itlzensy your intentiofofoubtlefs SS^Q c.bferve a dignified con du& iti this important Caiife. You wife to difcuft and to decide with deliberation; My opinion refpe&ing inviolability is well known. I declare^ my principled &t a time when a kind Of fu- perftitious refpcftjVfiasgenerallyentetfoinedforthisin¬violability, yet I think .that you ought to treat a qtief- tion of fo much’magnitude diftin&ly -from all ..flfoers. i A number of writings had already appeared, all. of ’ which are eagerly read and -compared  */;.,- France, *”1t”

Here is the same section in Tesseract:

“Pz”‘-rzo,\1.—a“.’€:@i1;izens, your iiitenziogzcloubtlefs is to

c:‘oferv’e_ a dig1]lfiQia-COI1£‘l_lX€,l’.l_l) this important ‘ca_ufe.

You with to ‘clil’cii’fs_and to decide with deliberation‘.

My opinion refpeéling inviolability is Well l”l°“’“–


red my principles atra, tiine when a kind of in-


‘us refpc&_jw:as gener-allAy_ Efained for tl1isin-


.3331’ y, yet–E tllivllkrtllgt .y'{ou_6,ugl1l’— l° ‘Feat ‘$1_Fl”e{‘

t¢o;aof_fo‘inuch magnitude diitinélly from all ‘filters-

, X number of wiitiiigs had already nap” ared, all. of

‘ill tell’ are eagerly read and compared Fl‘,-“‘“-“ea “=1”

“Europe haveitl-ieir eyesup 0 i g 53‘. “Ure-”


And here it is outputted from Omnipage:

“PETIet\-.” Citizens, your intention doubtlefs is to cufeive a dignified conduct it, this important eaufe. You wifil to cffcufs and to decide with deliberation. fly opinion rcfncaing inviolability is well known. I declared my principles it a time when a kind of fu¬

i               tcitained for this in¬Pcrftitioas tc.pc~t tva,gcncrilly en

vioiabilit)•, yet I tlftok:that you ought to treata quef¬tic»t of fo much magnitude d!Stin4ly from all others. A number of writings had already appeared, all of whidi are eagerly read anti compared,     France, ail Europe I:ave their eyes Upon- you m this great ca ufe.”

As an experiment I manually transcribed the page myself, timing how long it took. Here is how the section should read:

“Petition- “Citizens, your intention doubtless is to observe a dignified conduct in this important cause.  You wish to discuss and to decide with deliberation.  My opinion respecting inviolability is well known.  I declared my principles at a time when a kind of superstitious respect was generally entertained for this inviolability, yet I think that you ought to treat a question of so much magnitude distinctly from all others. A number of writings had already appeared, all of which are eagerly read and compared.  France, all Europe have their eyes upon you in this great cause.”

It took about 100 minutes to transcribe the full page.  As there are 325 images then full transcription would take 32,500 minutes, which is about 541 hours.  Working solidly for 7 hours a day on this would mean full transcription would take one person about 77 and a half days, which is rather a long time.  I wonder if there might be members of the public who would be interested enough in this to transcribe a page or two?  It might be more trouble than it’s worth to pursue this, though.  I will return to the issue of OCR, and see if anything further can be done, for example training the software to recognise long ‘s’, but I decided to spend the rest of the week working on the browse facility for the images instead.

I created three possible interfaces for the project website, and after consulting Rhona I completed an initial version of the interface, which incorporates the ‘Edinburgh Gazetteer’ logo with inverted colours (to get away from all that beige that you end up with so much of when dealing with digitising old books and manuscripts).  Rhona and I also agreed that I would create a system for associating keywords with each page, and I created an Excel spreadsheet through which Rhona could compile these.

I also created an initial interface for the ‘browse issues’ part of the site.  I based this around the OpenLayers library, which I configured to use tiled versions of the scanned images that I created using an old version of Zoomify that I had kicking around.  This allows users to pan around the large images of each broadsheet page and zoom in on specific sections to enable reading.

I created a ‘browse’ page for the issues, split by month.  There are thumbnails of the first page of each, which I generated using ImageMagick and a little PHP script.  Further PHP scripts extracted dates from the image filenames, created database records, renamed the images, grouped images into issues and things like that.

You can jump to a specific month by pressing on the buttons at the top of the ‘browse’ page, and clicking on a thumbnail opens the issue at the first page.

When you’ve loaded a page the image is loaded into the ‘zoom and pan’ interface.  I might still rework this so it uses the full page width and height as on wide monitors there’s an awful lot of unused white space at the moment.  The options above the image allow you to navigate between pages (if you’re on page one of an issue the ‘previous’ button takes you to the last page of the previous issue.  If you’re on the last page of the issue the ‘next’ button takes you to page one of the next issue).  And I added in other buttons that allow you to load the full image and return to the Gazetteer index page.

All in all it’s been a very productive week.




Week Beginning 1st August 2016

This was a very short week for me as I was on holiday until Thursday.  I still managed to cram a fair amount into my two days of work, though.  On Thursday I spent quite a bit of time dealing with emails that had come in whilst I’d been away.  Carole Hough emailed me about a slight bug in the Old English version of the Mapping Metaphor website.  With the OE version all metaphorical connections are supposed to default to a strength of ‘both’ rather than ‘strong’ like with the main site.  However, when accessing data via the quick and advanced search the default was still set to ‘strong’, which was causing some confusion as this was obviously giving different results to the browse facilities, which defaulted to ‘both’.  Thankfully it didn’t take long to identify the problem and fix it.  I also had to update a logo for the ‘People’s Voice’ project website, which was another very quick fix.  Luca Guariento, who is the new developer for the Curious Travellers project, emailed me this week to ask for some advice on linking proper names in TEI documents to a database of names for search purposes and I explained to him how I am working with this for the ‘People’s Voice’ project, which has similar requirements.  I also spoke to Megan Coyer about the ongoing maintenance of her Medical Humanities Network website and fixed an issue with the MemNet blog, which I was previously struggling to update.  It would appear that the problem was being caused by an out of date version of the sFTP helper plugin, as once I updated that everything went smoothly.

I also set up a new blog for Rob Maslen, who wants to use it to allow postgrad students and others in the University to post articles about fantasy literature.  I also managed to get Rob’s Facebook group integrated with the blog for his fantasy MLitt course.  I’ve also got the web space set up for Rhona’s Edinburgh Gazetteer project, and extracted all of the images for this project too.  I spent about half of Friday working on the Technical Plan for the proposal Alison Wiggins is putting together and I now have a clearer picture of how the technical aspects of the project should fit together.  There is still quite a bit of work to do on this document, however, and a number of further questions I need to speak to Alison about before I can finish things off.  Hopefully I’ll get a first draft completed early next week, though.

The remainder of my short working week was spent on the SCOSYA project, working on updates to the CMS.  I added in facilities to create codes and attributes through the CMS, and also to browse these types of data.  This includes facilities to edit attributes and view which codes have which attributes and vice-versa.  I also began work on a new page for displaying data relating to each code – for example which questionnaires the code appears in.  There’s still work to be done here, however, and hopefully I’ll get a chance to continue with this next week.

Week Beginning 22nd February 2016

I divided my time this week primarily between three projects: REELS, The People’s Voice and the Mapping Metaphor follow-on project. For REELS I continued with the content management system. After completing the place-name element management systems last week I decided this week to begin to tackle the bigger issue of management scripts for place-names themselves. This included migrating parish details into the database from a spreadsheet that Eila had previously sent me and migrating the classification codes from the Fife place-name database. I began work on the script that will process the addition of a new place-name record, creating the form that project staff will fill in, including facilities to add any number of map sheet records.

I initially included facilities to associate place-name elements with this ‘add’ form, which proved to be rather complicated. A place-name may have any number of elements and these might already exist in our element database. I created an ‘autocomplete’ facility whereby a user starts to type an element and the system queries the database and brings back a list of possible matching items. This was complicated by the fact that elements have different languages, and the list that’s returned should be different depending on what language has been selected. There are also many fields that the user needs to complete for each element, more so if the element doesn’t already exist in the database. I began to realise that including all of this in one single form would be rather too overwhelming for users and decided instead to split the creation and management of place-names across multiple forms. The ‘Add’ page would allow the user to create the ‘core’ record, which wouldn’t include place-name elements and historical forms. These materials will instead be associated with the place-name via the ‘browse place-names’ table, with separate pages specifically for elements and historical forms. Hopefully this set-up will be straightforward to use.

After reaching this decision I shelved the work I’d done on associating place-name elements and instead set to work on completing the ‘core’ place-name data upload form. This led me onto another interesting task. The project will be recording grid references for places, and I had previously worked out that it would be possible for a script to automatically generate the latitude and longitude from this figure, which in turn would allow for altitude to be retrieved from Google Maps. I used a handy PHP based library available here: http://www.jstott.me.uk/phpcoord/ to generate the latitude and longitude from the grid reference and then I integrated a Google Map in order to get the altitude (or elevation as Google calls it) based on instructions found here: https://developers.google.com/maps/documentation/javascript/elevation. By the end of the week I had managed to get this working, other than actually storing the altitude data, which I should hopefully be able to get sorted next week.

For The People’s Voice project I had an email conversation with the RA Michael Shaw about the structure of the database. Michael had met with Catriona to discuss the documentation I had previously created relating to the database and the CSV template form. Michael had sent me some feedback and this week I created a second version of the database specification, the template form and the accompanying guidelines based on this feedback. I think we’re pretty much in agreement now on how to proceed and next week I hope to start on the content management system for the project.

For Metaphor in the Curriculum I continued with my work to port all of the visualisation views from relying on server-side data and processing to a fully client-side model instead. Last week I had completed the visualisation view and had begun on the tabular view. This week I managed to complete the tabular view, the card view and also the timeline view. Although that sentence was very quick to read, actually getting all of this done took some considerable time and effort, but it is great to get it all sorted, especially as I had some doubts earlier on as to whether it would even be possible. I still need to work on the interface, which I haven’t spent much time adapting for the App yet. I also managed to complete the textual ‘browse’ feature this week as well, using jQuery Mobile’s collapsible lists to produce an interface that I think works pretty well. I still haven’t tackled the search facilities yet, which is something I hope to start on next week.

In addition to this I attended a meeting with the Burns people, who are working towards publishing a new section on the website about song performance. We discussed where the section should go, how it should function and how the materials will be published. It was good to catch up with the team again. I also had a chat with David Shuttleton about making some updates to the Cullen online resource, which I am now responsible for. I spent a bit of time going through the systems and documentation and getting a feel for how it all fits together. I also made a couple of small tweaks to the Medical Humanities Network website to ensure that people who sign up have some connection to the University.

Week Beginning 8th February 2016

It’s been another busy week, but I have to keep this report brief as I’m running short of time and I’m off next Monday. I came into work on Monday to find that the script I had left executing on the Grid to extract all of the Hansard data had finished working successfully! It left me with a nice pile of text files containing SQL insert statements – about 10Gb of them. As we don’t currently have a server on which to store the data I instead started a script executing that runs each SQL insert command on my desktop PC and puts the data into a local MySQL database. Unfortunately it looks like it’s going to take a horribly long time to process the data. I’m putting the estimate at about 229 days.

My arithmetic skills are sometimes rather flaky so here’s how I’m working out the estimate. My script is performing about 2000 inserts a minute. There are about 1200 output files and based on the ones I’ve looked at they contain about 550,000 lines each. 550,000 x 1200 = 660,000,000 lines in total. This figure divided by 2000 gives the number of minutes it would take (330,000). Divide this by 60 gives the number of hours (5,500). Divide this by 24 gives the number of days (229). My previous estimate for doing all of the processing and uploading on my desktop PC was more than 2 years, so using the Grid has speeded things up enormously, but we’re going to need something more than my desktop PC to get all of the data into a usable form any time soon. Until we get a server for the database there’s not much more I can do.

On Tuesday this week we had a REELS team meeting where we discussed some of the outstanding issues relating to the structure of the database (amongst other things). This was very useful and I think we all now have a clear idea of how the database will be structured and what it will be able to do. After the meeting I wrote up and distributed an updated version of my database specification document and I also worked with some map images to create a more pleasing interface for the project website (it’s not live yet though, so no URL). Later in the week I also created the first version of the database for the project, based on the specification document I’d written. Things are progressing rather nicely at this stage.

I spent a bit of time fixing some issues that had cropped up with other projects. The Medical Humanities Network people wanted a feature of the site tweaked a little bit, so I did this. I also fixed an issue with the lexeme upload facility of the Scots Corpus, which was running into some maximum form size limits. I had a funeral to attend on Thursday afternoon so I was away from work for that.

The rest of the week was spendton Mapping Metaphor duties. Ellen had sent me a new batch of stage 5 data to upload, so I got that set up. We now have 15,762 metaphorical connections, down from 16,378 (some have been recategorised as ‘noise’). But there are now 6561 connections that have sample lexemes, up from 5407. Including first lexemes and other lexemes, we now have 16,971 sample lexemes in the system, up from 12,851. We had a project meeting on Friday afternoon, and it was good to catch up with everyone. I spent the remainder of the week working on the app version of the visualisations. I’m making some good progress with the migration to a ‘pure javascript’ system now. The top-level map is now complete, including links, changes in strength, circles and the card popup. I’ve also begun working on the drilldown visualisations, getting the L2 / L3 hybrid view mostly working (but not yet expanding categories), the info-box containing the category descriptors working and the card for the hybrid view working. I’m feeling a bit more positive about getting the app version completed now, thankfully.