Week Beginning 8th August 2016

This was my first five-day week in the office for rather a long time, what with holidays and conferences.  I spent pretty much all of Monday and some of Tuesday working on the Technical Plan for a proposal Alison Wiggins is putting together.  I can’t really go into any details here at this stage, but the proposal is shaping up nicely and the relatively small technical component is now fairly clearly mapped out.  Fingers crossed that it receives funding.  I spent a small amount of time on a number of small-scale tasks for different project, such as getting some content from the DSL server for Ann Ferguson and fixing a couple of issues with the Glasgow University Guardian that central IT services had contacted me about.  I also emailed Scott Spurlock in Theology to pass on my notes from the crowdsourcing sessions of DH2016, as I thought they might be of some use to him, and I had an email conversation with Gerard McKeever in Scottish Literature about a proposal he is putting together that has a small technical component he wanted advice on.  I also had an email conversation with Megan Coyer about the issues relating to her Medical Humanities Network site.

The remainder of the week was split between two projects.  First up is the Scots Syntax Atlas project.  Last week I began working through a series of updates to the content management system for the project.  This week I completed the list of items that I’d agreed to implement for Gary when we met a few weeks ago.  This consisted of the following:

  1. Codes can now be added via ‘Add Code’.  This now includes an option to select attributes for the new code too
  2. Attributes can now be added via ‘Add Attribute’.  This allows you to select the codes to apply the attribute to.
  3. There is a ‘Browse attributes’ page which lists all attributes and the number of codes associated with each.
  4. Clicking on an attribute in this list displays the code associations and allows you to edit the attribute – both its name and associated codes
  5. There is a ‘Browse codes’ page that lists the codes, the number of questionnaires each code appears in, the attributes associated with each code and the example sentences for each code.
  6. Clicking on a code in this list brings up a page for the code that features a list of its attributes and example sentences, plus a table containing the data for every occurrence of this code in a questionnaire, including some information about each questionnaire, a link through to the full questionnaire page, plus the rating information.  You can order the table by clicking on the headings.
  7. Through this page you can edit the attributes associated with the code
  8. Through this page you can also add / edit example sentences for the code.  This allows you to supply both the ‘Q code’ and the sentence for as many sentences as are required.
  9. I’ve also updated the ‘browse questionnaires’ page to make the ‘interview date’ the same ‘yyyy-mm-dd’ format as the upload date, to make it easier to order the table by this column in a meaningful way.

With all of this out of the way I can now start on developing the actual atlas interface for the project, although I need to meet with Gary to discuss exactly what this will involve. I’ve arranged to meet with him next Monday.

The second project I worked on was the Edinburgh Gazetteer project for Rhona Brown.  I set up the WordPress site for the project website, through which the issues of the Gazetteer will be accessible, as will the interactive map of ‘reform societies’.  I’ve decided to publish these via a WordPress plugin that I’ll create for the project, as it seemed the easiest way to integrate the content with the rest of the WordPress site.  The plugin won’t have any admin interface component, but will instead focus on providing the search and browse interface for the issues of the Gazetteer and the map, via a WordPress shortcode.

I tackled the thorny issue of OCR for the Gazetteer’s badly printed pages again this week.  I’m afraid it’s looking like this is going to be hopeless.  I should really have looked at some of the pages whilst we were preparing the proposal because if I’d seen the print quality then I would never have suggested OCR as a possibility.  I think the only way to extract the text in a useable way will be manual transcription.  We might be able to get the images online and then instigate some kind of rudimentary crowd-sourcing approach.  There aren’t that many pages (325 broadsheet pages in total) so it might be possible.

I tried three different OCR packages – Tesseract (which Google uses for Google Books), ABBYY Finereader, and Omipage Pro (these are considered to be the best OCR packages available).  I’m afraid none of them give usable results.  The ABBYY one looks to me to be the best, but I would still consider it unusable, even for background search purposes, and it would probably take more time to manually correct it than it would to just transcribe the page manually.

Here is one of the better sections that was produced by ABBYY:

“PETioN^c^itlzensy your intentiofofoubtlefs SS^Q c.bferve a dignified con du& iti this important Caiife. You wife to difcuft and to decide with deliberation; My opinion refpe&ing inviolability is well known. I declare^ my principled &t a time when a kind Of fu- perftitious refpcftjVfiasgenerallyentetfoinedforthisin¬violability, yet I think .that you ought to treat a qtief- tion of fo much’magnitude diftin&ly -from all ..flfoers. i A number of writings had already appeared, all. of ’ which are eagerly read and -compared  */;.,- France, *”1t”

Here is the same section in Tesseract:

“Pz”‘-rzo,\1.—a“.’€:@i1;izens, your iiitenziogzcloubtlefs is to

c:‘oferv’e_ a dig1]lfiQia-COI1£‘l_lX€,l’.l_l) this important ‘ca_ufe.

You with to ‘clil’cii’fs_and to decide with deliberation‘.

My opinion refpeéling inviolability is Well l”l°“’“–

 

red my principles atra, tiine when a kind of in-

 

‘us refpc&_jw:as gener-allAy_ Efained for tl1isin-

 

.3331’ y, yet–E tllivllkrtllgt .y'{ou_6,ugl1l’— l° ‘Feat ‘$1_Fl”e{‘

t¢o;aof_fo‘inuch magnitude diitinélly from all ‘filters-

, X number of wiitiiigs had already nap” ared, all. of

‘ill tell’ are eagerly read and compared Fl‘,-“‘“-“ea “=1”

“Europe haveitl-ieir eyesup 0 i g 53‘. “Ure-”

 

And here it is outputted from Omnipage:

“PETIet\-.” Citizens, your intention doubtlefs is to cufeive a dignified conduct it, this important eaufe. You wifil to cffcufs and to decide with deliberation. fly opinion rcfncaing inviolability is well known. I declared my principles it a time when a kind of fu¬

i               tcitained for this in¬Pcrftitioas tc.pc~t tva,gcncrilly en

vioiabilit)•, yet I tlftok:that you ought to treata quef¬tic»t of fo much magnitude d!Stin4ly from all others. A number of writings had already appeared, all of whidi are eagerly read anti compared,     France, ail Europe I:ave their eyes Upon- you m this great ca ufe.”

As an experiment I manually transcribed the page myself, timing how long it took. Here is how the section should read:

“Petition- “Citizens, your intention doubtless is to observe a dignified conduct in this important cause.  You wish to discuss and to decide with deliberation.  My opinion respecting inviolability is well known.  I declared my principles at a time when a kind of superstitious respect was generally entertained for this inviolability, yet I think that you ought to treat a question of so much magnitude distinctly from all others. A number of writings had already appeared, all of which are eagerly read and compared.  France, all Europe have their eyes upon you in this great cause.”

It took about 100 minutes to transcribe the full page.  As there are 325 images then full transcription would take 32,500 minutes, which is about 541 hours.  Working solidly for 7 hours a day on this would mean full transcription would take one person about 77 and a half days, which is rather a long time.  I wonder if there might be members of the public who would be interested enough in this to transcribe a page or two?  It might be more trouble than it’s worth to pursue this, though.  I will return to the issue of OCR, and see if anything further can be done, for example training the software to recognise long ‘s’, but I decided to spend the rest of the week working on the browse facility for the images instead.

I created three possible interfaces for the project website, and after consulting Rhona I completed an initial version of the interface, which incorporates the ‘Edinburgh Gazetteer’ logo with inverted colours (to get away from all that beige that you end up with so much of when dealing with digitising old books and manuscripts).  Rhona and I also agreed that I would create a system for associating keywords with each page, and I created an Excel spreadsheet through which Rhona could compile these.

I also created an initial interface for the ‘browse issues’ part of the site.  I based this around the OpenLayers library, which I configured to use tiled versions of the scanned images that I created using an old version of Zoomify that I had kicking around.  This allows users to pan around the large images of each broadsheet page and zoom in on specific sections to enable reading.

I created a ‘browse’ page for the issues, split by month.  There are thumbnails of the first page of each, which I generated using ImageMagick and a little PHP script.  Further PHP scripts extracted dates from the image filenames, created database records, renamed the images, grouped images into issues and things like that.

You can jump to a specific month by pressing on the buttons at the top of the ‘browse’ page, and clicking on a thumbnail opens the issue at the first page.

When you’ve loaded a page the image is loaded into the ‘zoom and pan’ interface.  I might still rework this so it uses the full page width and height as on wide monitors there’s an awful lot of unused white space at the moment.  The options above the image allow you to navigate between pages (if you’re on page one of an issue the ‘previous’ button takes you to the last page of the previous issue.  If you’re on the last page of the issue the ‘next’ button takes you to page one of the next issue).  And I added in other buttons that allow you to load the full image and return to the Gazetteer index page.

All in all it’s been a very productive week.