Week Beginning 22nd August 2016

On Monday I completed upgrading the 22 WordPress sites that I’m responsible for.  I ran into a couple of issues with one of the sites that took a little while to get to the bottom of (I ended up having to manually upgrade the files rather than just pressing the nice and easy ‘update’ button) but I got there in the end.  I spent most of the rest of the day preparing materials for my PDR.  Thankfully this blog is really useful for remembering just what I’ve done over the past year, so I could just read through posts and pull out some useful information.  For example, I’ve worked on almost 20 different projects in the past year and I’ve contributed technical sections to about 10 external research proposals.

On Tuesday Gary and I had our ‘SCOSYA atlas requirements’ meeting that had been postponed from last week due to Gary being ill.  It was a good meeting and we worked out an initial set of requirements for a prototype version of the Atlas.  After the meeting I prepared a requirements document based on our discussions and sent it to Gary for feedback.

On Wednesday I spent a lot of time finishing off my PDR materials and submitted them ahead of my PDR meeting next week.  Other than that I gave some advice to Jennifer Smith, Kirsteen McCue, Megan Coyer and Gavin Miller, who had all emailed me with questions.

I spent pretty much all of the rest of the week working on the prototype version of the SCOSYA Atlas interface.  This will initially only be available through the CMS, so I can’t provide any links to it here.  The Atlas is going to be based around the Leaflet.js mapping library, so my first task was to get this working.  This turned out to be a slightly frustrating process as the ‘Quick Start’ guide is not especially clear in certain places.  These were all linked to the setting up of a tile layer for the map.  I had expected to be able to just point Leaflet at OpenStreetMap (which it turns out you can do) but Leaflet seem to want you to get tile images via another provider called Mapbox.  Mapbox in turn uses the OpenStreetMap tile images, but allows you more freedom to customise the look and feel of the map.  I decided to try the Mapbox approach which required setting up a (free) Mapbox account.  Rather worryingly Mapbox only allows 50,000 map views per month for free.  I’m hoping this isn’t going to cause an issue with the Atlas, but if it does I’ll be able to change one line of code and point directly at OpenStreetMap instead.

The confusion with using Mapbox when following the Leaflet instructions is it requires you to have a ‘mapbox project id’ and looking round the mapbox interface I just couldn’t figure out where I would set this up.  Eventually I realised that the only way to get one of these was to go into the ‘classic’ part of the mapbox website as this seems to be a feature that is no longer being used.  It was only afterwards that I realised the ‘your.mapbox.project.id’ text in the leaflet quick start guide was clickable and would have taken me to the right place straight away.  The next bit of confusion came by the positioning of the access token.  You have to paste this into the following URL: https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token={accessToken} but what I didn’t realise was that you have to remove the braces {} from around it in order for it to work.  Thankfully with these two bits of confusion out of the way I finally had a working map.

What I wanted to do initially was have a map that displayed the locations of the questionnaires as points on the map.  As latitude and longitude data is already stored in the system for each questionnaire (automatically generated based on postcode) it wasn’t too tricky to implement this.  However, I decided that I would create a proper API (Application Programming Interface) for the project, rather than an ad-hoc bunch of scripts that spit out JSON.  In order to do this I did a bit of research into APIs and the RESTful approach (Representation State Transfer).  I read a handy beginners guide to REST APIs: http://www.andrewhavens.com/posts/20/beginners-guide-to-creating-a-rest-api/ and then found a handy little site that talked through the basics of making such a thing using PHP: http://coreymaynard.com/blog/creating-a-restful-api-with-php/

After following the above I managed to create an API endpoint that spits out all of the locations of the questionnaires, without accepting any parameters.  This is a good starting point and it was all I needed for my first map.  I’ll be expanding the functionality of the project API as I proceed with the Atlas.

With the data now available in the always handy JSON format I was able to update my JavaSript file to populate the map with the locations of the questionnaires.  I decided to use fuzzy circles rather than map ‘pins’ because the locations aren’t exact points.  This is a good approach to take that I learned from one of the DH2016 sessions I attended.  I’ve also colour coded the circles based on the number of questionnaires that have been completed at each place (between 1 and 4), plus I added in toggles so users can turn each level on or off – e.g. to show you just those locations that have 4 questionnaires.  I added in the leaflet label plugin to allow the names and postcodes of the circles to appear when the mouse hovers over them and users can click on a marker to view the town name and postcode too.

I’ve also added in a feature that allows the exact location and zoom level of the map to be bookmarked or shared (using the leaflet hash plugin https://github.com/mlevans/leaflet-hash).  For example if you’re zoomed in on Dundee and you want to share this you can copy the URL in the address bar and clicking on this link will take you straight back there (Assuming you’re logged into the CMS, of course).  I’ll need to update this feature to also take into account which map views are currently active too.  Providing this facility was another useful ‘best practice’ nugget from DH2016!

I also added in a ‘display options’ menu that animates in from the left when you press the ‘menu’ button in the top-left of the map (using the ‘slidemenu’ plugin https://github.com/unbam/Leaflet.SlideMenu).  This is where the search and browse facilities are going to go, but I haven’t got round to implementing this yet.  I’m going to have to rework the JavaScript behind the map before I go much further as what I’ve created so far was just a test based on one single view.  I’ll be starting on this on Monday.  Working with Leaflet (once I actually got it going) is proving to be a lot of fun.  Here’s a screenshot of the ‘work in progress’ to finish off:

scosya-leaflet

Week Beginning 15th August 2016

I was expecting to spend a fair amount of this week working on the Scots Syntax Atlas project, beginning to develop the atlas interface following a meeting with Gary on Monday.  Unfortunately, Gary was off sick and we had to postpone our meeting, which meant I could get started on the Atlas as Gary and I need to meet to agree just how it should work.  However, I did meet with Flora, who is working as an administrator for the project.  I created a user account for her and showed her how to use the content management system.  I spent most of the rest of the week improving my XML and TEI skills, which is something I’ve been meaning to do for a long time and was prompted by Alison Wiggins’ proposal that she’s putting together, which I would be doing the technical work for and which would involve a lot of TEI stuff.

Alison had previously given me three pages of transcriptions of a late 16th century account book that she had created using rudimentary XML.  What I wanted to do was figure out how best to convert this to TEI XML – what elements and attributes should be used, what TEI modules would be required, and other such things.  I had previously got to grips with the basics of TEI, XML and the Oxygen XML editor in the winter, whilst doing some work for the People’s Voice project, so I had the materials from this to get me started.

Initially I just played around with elements and attributes using Notepad++ and referencing the TEI documentation (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/), but I had a chat with Graeme Cannon (HATII’s XML expert) and he reminded me that if I used Oxygen I could start a new document and select a TEI P5 template and the editor would then give helpful warnings and other feedback about which elements and attributes could go where.  This was hugely useful advice.  Graeme also provided some very helpful advice on structuring the document, which was much appreciated.

After playing around with the structure of the text and figuring out which elements and attributes might suit I used the TEI Roma tool http://tei.oucs.ox.ac.uk/Roma/ to generate a bespoke RELAXNG schema for the transcription, which contained just the TEI modules that were required.  I say ‘bespoke’ but actually the ‘TEI for Manuscript Description’ template that Roma provided was pretty much exactly what I required.  After generating the RNG file I associated it with my XML file in place of the more generic (and externally hosted) TEI schema and I also created a nice little CSS stylesheet for use in Oxygen’s ‘Author view’, which I set up to display separate pages, individual account entries with their own border, proper names in bold and things like that.

Going through the original transcript brought up lots of questions and I had an email conversation with Alison over the course of the week where we discussed these.  The transcription included the noting of scribal hands, and I included this in the TEI transcription using the TEI ‘hand’ attribute, which in turn links to a section of the TEI header called ‘handNotes’, where each hand and information about it is defined.  The transcription included currency values in the form of pounds, shillings and pence so I used a ‘num’ element for this, using the attribute ‘type’ to specify ‘LSD’ (there will be other currency types such as crowns and groats) with the actual value contained in the ‘n’ attribute, split by a decimal point – e.g. n=”6.5.2” is six pounds, five shillings and two pence.

Alison wanted to categorise each entry into one or more categories and for this I created a simple taxonomy using the TEI ‘taxonomy’ tag.  The taxonomy was located in the TEI header, within the ‘classDecl’ element, which in turn was within the ‘encodingDesc’ element.  I made a very simple, non-hierarchical taxonomy, consisting of a list of categories, which in turn had IDs and names.  I then linked to one or more of these categories from the entry ‘div’ elements.  The most obvious attribute to link from the ‘div’ to the taxonomy seemed to me to be the ‘ref’ attribute, or possibly the ‘target’ attribute.  However, neither of these attributes can be used with a ‘div’ element by default in TEI.  Instead I chose to use the ‘ana’ attribute (which is for denoting analysis of text).  It didn’t feel quite right to use this attribute, but it does sort of make sense as placing entries in categories is analysis of some sort and the attribute did allow for multiple IDs to be specified in it.

The other main aspect of the mark-up was proper names – people and places.  I decided to tag these using the ‘name’ element, with the ‘type’ attribute use to state whether the name is a person or a place.  There are more specific elements, such as ‘persName’, but you need to use a specific TEI module to incorporate these and I wanted to keep things simple.  There is quite a lot of information associated with names, such as titles, forenames, surnames, gender etc, so I decided that rather than store all of this in the XML file I would create the information as an Excel spreadsheet, with an ‘ID’ column in this linking in to the ‘ref’ attribute of the ‘name’ element.  This worked out pretty well, resulting in a spreadsheet containing about 50 names for the three page transcription, some of which appeared multiple times in the text.

Working with these transcriptions has been a really useful experience.  I’ve wanted to gain more experience with TEI XML for a number of years now, and while working on The People’s Voice project last winter was a great start, working on the account book transcriptions has really improved my understanding of how TEI and mark-up in general works.

I met with Alison on Friday and we went through the transcriptions, following a set of guidelines that I’d put together as a Word document over the course of the week.  She seems pretty happy with how things are working out and will now go off and transcribe more pages using the schema and guidelines that I have created.  No doubt she will encounter things that I haven’t covered this week, but when she does a brief meeting should hopefully allow us to decide what course of action to take.

Other than this main task I spent a bit of time this week compiling a list of all of the WordPress sites I’ve set up for staff over the years.  It turns out there are 22 of them, which is rather a lot!  I’m going to use this list to ensure that I regularly upgrade the WordPress versions for each site and ensure all is well with the sites.  I began doing this on Friday, getting through about half of them. I noticed that one site’s database was much larger that it should have been, which led me to discover that one of the member of staff’s accounts had been hijacked (almost certainly due to an easy to crack password being used) and obfuscated Javascript had been added to all of the pages, which seemed to be attempting to redirect the pages to malicious servers.  I cleaned up the site and reset the user’s password.  I’ll keep a more regular check on these kinds of things in future, I reckon.

 

Week Beginning 8th August 2016

This was my first five-day week in the office for rather a long time, what with holidays and conferences.  I spent pretty much all of Monday and some of Tuesday working on the Technical Plan for a proposal Alison Wiggins is putting together.  I can’t really go into any details here at this stage, but the proposal is shaping up nicely and the relatively small technical component is now fairly clearly mapped out.  Fingers crossed that it receives funding.  I spent a small amount of time on a number of small-scale tasks for different project, such as getting some content from the DSL server for Ann Ferguson and fixing a couple of issues with the Glasgow University Guardian that central IT services had contacted me about.  I also emailed Scott Spurlock in Theology to pass on my notes from the crowdsourcing sessions of DH2016, as I thought they might be of some use to him, and I had an email conversation with Gerard McKeever in Scottish Literature about a proposal he is putting together that has a small technical component he wanted advice on.  I also had an email conversation with Megan Coyer about the issues relating to her Medical Humanities Network site.

The remainder of the week was split between two projects.  First up is the Scots Syntax Atlas project.  Last week I began working through a series of updates to the content management system for the project.  This week I completed the list of items that I’d agreed to implement for Gary when we met a few weeks ago.  This consisted of the following:

  1. Codes can now be added via ‘Add Code’.  This now includes an option to select attributes for the new code too
  2. Attributes can now be added via ‘Add Attribute’.  This allows you to select the codes to apply the attribute to.
  3. There is a ‘Browse attributes’ page which lists all attributes and the number of codes associated with each.
  4. Clicking on an attribute in this list displays the code associations and allows you to edit the attribute – both its name and associated codes
  5. There is a ‘Browse codes’ page that lists the codes, the number of questionnaires each code appears in, the attributes associated with each code and the example sentences for each code.
  6. Clicking on a code in this list brings up a page for the code that features a list of its attributes and example sentences, plus a table containing the data for every occurrence of this code in a questionnaire, including some information about each questionnaire, a link through to the full questionnaire page, plus the rating information.  You can order the table by clicking on the headings.
  7. Through this page you can edit the attributes associated with the code
  8. Through this page you can also add / edit example sentences for the code.  This allows you to supply both the ‘Q code’ and the sentence for as many sentences as are required.
  9. I’ve also updated the ‘browse questionnaires’ page to make the ‘interview date’ the same ‘yyyy-mm-dd’ format as the upload date, to make it easier to order the table by this column in a meaningful way.

With all of this out of the way I can now start on developing the actual atlas interface for the project, although I need to meet with Gary to discuss exactly what this will involve. I’ve arranged to meet with him next Monday.

The second project I worked on was the Edinburgh Gazetteer project for Rhona Brown.  I set up the WordPress site for the project website, through which the issues of the Gazetteer will be accessible, as will the interactive map of ‘reform societies’.  I’ve decided to publish these via a WordPress plugin that I’ll create for the project, as it seemed the easiest way to integrate the content with the rest of the WordPress site.  The plugin won’t have any admin interface component, but will instead focus on providing the search and browse interface for the issues of the Gazetteer and the map, via a WordPress shortcode.

I tackled the thorny issue of OCR for the Gazetteer’s badly printed pages again this week.  I’m afraid it’s looking like this is going to be hopeless.  I should really have looked at some of the pages whilst we were preparing the proposal because if I’d seen the print quality then I would never have suggested OCR as a possibility.  I think the only way to extract the text in a useable way will be manual transcription.  We might be able to get the images online and then instigate some kind of rudimentary crowd-sourcing approach.  There aren’t that many pages (325 broadsheet pages in total) so it might be possible.

I tried three different OCR packages – Tesseract (which Google uses for Google Books), ABBYY Finereader, and Omipage Pro (these are considered to be the best OCR packages available).  I’m afraid none of them give usable results.  The ABBYY one looks to me to be the best, but I would still consider it unusable, even for background search purposes, and it would probably take more time to manually correct it than it would to just transcribe the page manually.

Here is one of the better sections that was produced by ABBYY:

“PETioN^c^itlzensy your intentiofofoubtlefs SS^Q c.bferve a dignified con du& iti this important Caiife. You wife to difcuft and to decide with deliberation; My opinion refpe&ing inviolability is well known. I declare^ my principled &t a time when a kind Of fu- perftitious refpcftjVfiasgenerallyentetfoinedforthisin¬violability, yet I think .that you ought to treat a qtief- tion of fo much’magnitude diftin&ly -from all ..flfoers. i A number of writings had already appeared, all. of ’ which are eagerly read and -compared  */;.,- France, *”1t”

Here is the same section in Tesseract:

“Pz”‘-rzo,\1.—a“.’€:@i1;izens, your iiitenziogzcloubtlefs is to

c:‘oferv’e_ a dig1]lfiQia-COI1£‘l_lX€,l’.l_l) this important ‘ca_ufe.

You with to ‘clil’cii’fs_and to decide with deliberation‘.

My opinion refpeéling inviolability is Well l”l°“’“–

 

red my principles atra, tiine when a kind of in-

 

‘us refpc&_jw:as gener-allAy_ Efained for tl1isin-

 

.3331’ y, yet–E tllivllkrtllgt .y'{ou_6,ugl1l’— l° ‘Feat ‘$1_Fl”e{‘

t¢o;aof_fo‘inuch magnitude diitinélly from all ‘filters-

, X number of wiitiiigs had already nap” ared, all. of

‘ill tell’ are eagerly read and compared Fl‘,-“‘“-“ea “=1”

“Europe haveitl-ieir eyesup 0 i g 53‘. “Ure-”

 

And here it is outputted from Omnipage:

“PETIet\-.” Citizens, your intention doubtlefs is to cufeive a dignified conduct it, this important eaufe. You wifil to cffcufs and to decide with deliberation. fly opinion rcfncaing inviolability is well known. I declared my principles it a time when a kind of fu¬

i               tcitained for this in¬Pcrftitioas tc.pc~t tva,gcncrilly en

vioiabilit)•, yet I tlftok:that you ought to treata quef¬tic»t of fo much magnitude d!Stin4ly from all others. A number of writings had already appeared, all of whidi are eagerly read anti compared,     France, ail Europe I:ave their eyes Upon- you m this great ca ufe.”

As an experiment I manually transcribed the page myself, timing how long it took. Here is how the section should read:

“Petition- “Citizens, your intention doubtless is to observe a dignified conduct in this important cause.  You wish to discuss and to decide with deliberation.  My opinion respecting inviolability is well known.  I declared my principles at a time when a kind of superstitious respect was generally entertained for this inviolability, yet I think that you ought to treat a question of so much magnitude distinctly from all others. A number of writings had already appeared, all of which are eagerly read and compared.  France, all Europe have their eyes upon you in this great cause.”

It took about 100 minutes to transcribe the full page.  As there are 325 images then full transcription would take 32,500 minutes, which is about 541 hours.  Working solidly for 7 hours a day on this would mean full transcription would take one person about 77 and a half days, which is rather a long time.  I wonder if there might be members of the public who would be interested enough in this to transcribe a page or two?  It might be more trouble than it’s worth to pursue this, though.  I will return to the issue of OCR, and see if anything further can be done, for example training the software to recognise long ‘s’, but I decided to spend the rest of the week working on the browse facility for the images instead.

I created three possible interfaces for the project website, and after consulting Rhona I completed an initial version of the interface, which incorporates the ‘Edinburgh Gazetteer’ logo with inverted colours (to get away from all that beige that you end up with so much of when dealing with digitising old books and manuscripts).  Rhona and I also agreed that I would create a system for associating keywords with each page, and I created an Excel spreadsheet through which Rhona could compile these.

I also created an initial interface for the ‘browse issues’ part of the site.  I based this around the OpenLayers library, which I configured to use tiled versions of the scanned images that I created using an old version of Zoomify that I had kicking around.  This allows users to pan around the large images of each broadsheet page and zoom in on specific sections to enable reading.

I created a ‘browse’ page for the issues, split by month.  There are thumbnails of the first page of each, which I generated using ImageMagick and a little PHP script.  Further PHP scripts extracted dates from the image filenames, created database records, renamed the images, grouped images into issues and things like that.

You can jump to a specific month by pressing on the buttons at the top of the ‘browse’ page, and clicking on a thumbnail opens the issue at the first page.

When you’ve loaded a page the image is loaded into the ‘zoom and pan’ interface.  I might still rework this so it uses the full page width and height as on wide monitors there’s an awful lot of unused white space at the moment.  The options above the image allow you to navigate between pages (if you’re on page one of an issue the ‘previous’ button takes you to the last page of the previous issue.  If you’re on the last page of the issue the ‘next’ button takes you to page one of the next issue).  And I added in other buttons that allow you to load the full image and return to the Gazetteer index page.

All in all it’s been a very productive week.

 

 

 

Week Beginning 1st August 2016

This was a very short week for me as I was on holiday until Thursday.  I still managed to cram a fair amount into my two days of work, though.  On Thursday I spent quite a bit of time dealing with emails that had come in whilst I’d been away.  Carole Hough emailed me about a slight bug in the Old English version of the Mapping Metaphor website.  With the OE version all metaphorical connections are supposed to default to a strength of ‘both’ rather than ‘strong’ like with the main site.  However, when accessing data via the quick and advanced search the default was still set to ‘strong’, which was causing some confusion as this was obviously giving different results to the browse facilities, which defaulted to ‘both’.  Thankfully it didn’t take long to identify the problem and fix it.  I also had to update a logo for the ‘People’s Voice’ project website, which was another very quick fix.  Luca Guariento, who is the new developer for the Curious Travellers project, emailed me this week to ask for some advice on linking proper names in TEI documents to a database of names for search purposes and I explained to him how I am working with this for the ‘People’s Voice’ project, which has similar requirements.  I also spoke to Megan Coyer about the ongoing maintenance of her Medical Humanities Network website and fixed an issue with the MemNet blog, which I was previously struggling to update.  It would appear that the problem was being caused by an out of date version of the sFTP helper plugin, as once I updated that everything went smoothly.

I also set up a new blog for Rob Maslen, who wants to use it to allow postgrad students and others in the University to post articles about fantasy literature.  I also managed to get Rob’s Facebook group integrated with the blog for his fantasy MLitt course.  I’ve also got the web space set up for Rhona’s Edinburgh Gazetteer project, and extracted all of the images for this project too.  I spent about half of Friday working on the Technical Plan for the proposal Alison Wiggins is putting together and I now have a clearer picture of how the technical aspects of the project should fit together.  There is still quite a bit of work to do on this document, however, and a number of further questions I need to speak to Alison about before I can finish things off.  Hopefully I’ll get a first draft completed early next week, though.

The remainder of my short working week was spent on the SCOSYA project, working on updates to the CMS.  I added in facilities to create codes and attributes through the CMS, and also to browse these types of data.  This includes facilities to edit attributes and view which codes have which attributes and vice-versa.  I also began work on a new page for displaying data relating to each code – for example which questionnaires the code appears in.  There’s still work to be done here, however, and hopefully I’ll get a chance to continue with this next week.