Week Beginning 25th July 2016

This week was another four-day week for me as I’d taken the Friday off.  I will also be off until Thursday next week.  I was involved in a lot of different project and had a few meeting this week.  Wendy contacted me this week with a couple of queries regarding Mapping Metaphor.  One part of this was easy – adding a new downloadable material to the ‘Metaphoric’ website.  The involved updating the ZIP files and changing a JSON file to make the material findable in the ‘browse’ feature. The other issue was a bit more troublesome.  In the Mapping Metaphor ‘browse’ facilities in the main site, the OE site and the ‘Metaphoric’ site Carole had noticed that the number of metaphorical connections given for the top level categories didn’t match up with the totals given for the level two categories within these top level ones.  E.g. Browse view gives the External World total as 13115, but adding up the individual section totals comes to 17828.

It took quite a bit of investigation to figure out what was causing this discrepancy.  But I finally figured out how to make the totals consistent and applied the update to the main site, the OE site and the Metaphoric website (but not the app as I’ll need to submit a new version to the stores to get the change implemented here).

There were inconsistencies in the totals at both the top level and level 2.  These were caused by metaphorical connections that include links within a category only being counted once (e.g. a connection from Category 1 to Category 2 counts as 2 ‘hits’ – one for Category 1 and another for Category 2 but a connection from Category 1 to another Category 1 only counts as one ‘hit’.  This was also true for Level 2 categories – e.g. 1A to 1B is a ‘hit’ for each category but 1A to another 1A is only one ‘hit’.

It could be argued that this is an acceptable way to count things, but in our browse page we have to go from the bottom up as we display the number of metaphorical connections each Level 3 category is involved in.  Here’s another example:

2C has 2 categories, 2C01 and 2C02.  2C01 has 127 metaphorical connections and 2C02 has 141, making a total of 268 connections.  However, one of these connections is between 2C01 and 2C02, so in the Level 2 count ‘how many connections are there involving a 2C category in either cat1 or cat2’ this connection was only being counted once, meaning the 2C total was only showing 267 connections instead of 268.

It could be argued that 2C does only have 267 metaphorical connections, but as our browse page shows the individual number of connections for each Level 3 category we need to include these ‘duplicates’ otherwise the numbers for levels 1 and 2 don’t match up.

Perhaps using the term ‘metaphorical connections’ on the browse page is misleading.  We only have a total of 15,301 ‘metaphorical connections’ in our database.  What we’re actually counting on the browse page is the number of times a category appears in a metaphorical connection, as either cat1, cat2 or both.  But at least the figures used are now consistent.

On Monday I had a meeting with Gary Thoms to discuss further developments of the Content Management System for the SCOSYA project.  We agreed that I would work on a number of different tasks for the CMS.  This includes adding a new field to the template and ensuring the file upload scripts can process this, adding a facility to manually enter a questionnaire into the CMS rather than uploading a spreadsheet, adding example sentences and ‘attributes’ to the questionnaire codes and providing facilities in the CMS for these to be managed and creating some new ‘browse’ facilities to access the data.  It was a very useful meeting and after writing up my notes from it I set to work on some of the tasks.  By the end of my working week I had updated the file upload template, the database and the pages for viewing and editing questionnaires in the CMS.  I had also created the database tables and fields necessary for holding information about example sentences and attributes and I created the ‘add record’ facility.  There is still quite a lot to do here, and I’ll return to this after my little holiday.  I’ll also need to get started on the actual map interface for the data too – the actual ‘atlas’.

On Tuesday I had a meeting with Rob Maslen to discuss a new website he wants to set up to allow members of the university to contribute stories and articles involving fantasy literature.  We also discussed his existing website and some possible enhancements to this.  I’ll aim to get these things done over the summer.

Last week Marc had contacted me about a new batch of Historical Thesaurus data that had been sent to us from the OED people and I spent a bit of time this week looking at the data.  The data is XML based and I managed to figure out how it all fits together but as of yet I’m having trouble seeing how it relates to our HT data.

For example ‘The Universe (noun)’ in the OED data has an ID of 1628 and a ‘path’ of ‘01.01’, which looks like it should correspond to our hierarchical structure, but in our system ‘The Universe (noun)’ has the number ‘01.01.10 n’.  Also the words listed in the HT data for this category are different to ours.  We have the Old English words, which are not part of the OED data, but there are other differences too, e.g. The OED data has ‘creature’ but this is not in the HT data.  Dates are different too, e.g. in our data ‘World’ is ‘1390-‘ while in the OED data it’s ‘?c1200’.

It doesn’t look to me like there is anything in the XML that links to our primary keys – at least not the ones in the online HT database.  The ID in the XML for ‘The Universe (noun)’ is 1628 but in our system the ID for this category is 5635.  The category with ID 1628 in our system is ‘Pool :: artificially confined water :: contrivance for impounding water :: weir :: place of’ which is rather different to ‘The Universe’!

I’ve also checked to see whether there might by an ID for each lexeme that is the same as our ‘HTID’ field (if there was then we could get to the category ID from this) but alas there doesn’t seem to be either.  For example, the lexeme ‘world’ has a ‘refentry’ of ‘230262’ but this is the HTID for a completely different word in our system.  There are ‘GHT’ (Glasgow Historical Thesaurus) tags for each word but frustratingly an ID isn’t one of them – only original lemma, dates and roget category.  I hope aligning the data is going to be possible as it’s looking more than a little tricky from my initial investigation.  I’m going to meet with Marc and Fraser later in the summer to look into this in more detail.

On Wednesday I met with Rhona Brown from Scottish Literature to discuss a project of hers that is just starting and that I will be doing the technical work for.  The project is a small grant funded by the Royal Society of Edinburgh and its main focus is to create a digital edition of the Edinburgh Gazetteer, a short-lived but influential journal that was published in the 1790s.

The Mitchell has digitised the journal and this week I managed to see the images for the first time.  Our original plan was to run the images through OCR software in order to get some text that would be used behind the scenes for search purposes, with the images being the things the users will directly interact with.  However, now I’ve seen the images I’m not so sure this approach is going to work as the print quality of the original materials is pretty poor.  I tried running one of the images  through Tesseract, which is the OCR engine Google uses for its Google Books project and the results are not at all promising.  Practically every word is wrong, although it looks like it has at least identified multiple columns – in places anyway.  However, this is just a first attempt and there are various things I can do to make the images more suitable and possibly to ‘train’ the OCR software too.  I will try other OCR software as well.  We are also going to produce an interactive map of various societies that emerged around this time so I created an Excel template and some explanatory notes for Rhona to use to compile the information.  I also contacted Chris Fleet of the NLS Maps department about the possibility of reusing the base map from 1815 that he very kindly helped us to use for the Burns highland tour feature.  Chris got back to me very quickly to say this would be find, which is great.

On Wednesday I also met with Frank Hopfgartner from HATII to discuss an idea he has had to visualise a corpus of German radio plays.  We discussed various visualisation options and technologies, the use of corpus software and topic modelling and hopefully some of this was useful to him.  I also spent some time this week chatting to Alison Wiggins via email about the project she is currently putting together.  I am going to write the Technical Plan for the proposal so we had a bit of a discussion about the various technical aspects and how things might work.  This is another thing that I will have to prioritise when I get back from my holidays.  It’s certainly been a busy few days.