Week Beginning 22nd January 2018

I spent much of this week working on the REELS project.  My first task was to implement a cross-reference system for place-names in the content management system.  When a researcher edits a place they now see a new ‘X-Refs’ field between ‘Research Notes’ and ‘Pronunciation’.  If they start typing the name of a place an autocomplete list appears featuring matching place-names and their current parishes.  Clicking on a name and then pressing the ‘edit’ button below the form then saves the new cross reference.  Multiple cross references can be added by pressing the ‘add another’ button to the right of the field.  When a cross reference has been added it is listed in this section as a link, allowing the researcher to jump to the referenced place-name and it’s also possible to delete a cross reference by pressing on the ‘delete’ button next to the place-name.  Cross references are set up to work both ways.  If the researcher adds a reference from ‘Abbey Park’ to ‘Abbey Burn’ then whenever s/he views the ‘Abbey Burn’ record the cross reference to ‘Abbey Park’ will also display, and deleting a reference from one place-name deletes it from the other too – i.e. one-way references can’t exist.

I also fixed an issue with the CMS that Eila had alerted me to: historical forms that have no start dates but do feature end dates weren’t being listed with any dates at all.  It turned out I’d set things up so that dates were only processed if a start date was present, which was pretty easy to rectify.  For the rest of my time on the project I wrote a specification document for the front end features I will be developing for the project.  This took up a lot of the week, as I had to spend time thinking about the features the front end will include and how things like the search, browse and map will interoperate and function.  This has also involved trying out the various existing place-name resources and thinking about which aspects of these sites work or don’t work so well.

My initial specification document is just over 3000 words long and describes the features that the front end will include, the sorts of search and browse options that will be available, which fields will be displayed, how the map interface will work and such things. I emailed it to the rest of the team on Friday for feedback, which I will hopefully get during next week.  It is just an initial idea of how things work and once I actually get down to developing the site things might change, but it’s useful to get things down in writing at this stage just in case there’s anything I’ve missed or people would prefer features to work differently.  I hope to begin development of the features next week.

Also this week I spent a bit of time on the RNSN project.  I switched a few things around on the website and I also began working with some slides that Brianna had sent me.  We are going to make ‘stories’ about particular songs, and we’d decided to investigate a couple of existing tools in order to do this.  The first is a timeline library (https://timeline.knightlab.com/) while the second is similar to a timeline only works with maps instead (https://storymap.knightlab.com/).  Initially I created a timeline based on the slides, but I quickly realised that there weren’t really enough different dates across the slides for this to work very well.  There weren’t any specific places mentioned in the slides either, so it seemed like the storymap library wouldn’t be a good fit either.  However, I then remembered that storymap can be set up to work with images rather than a map as a base layer, allowing you to ‘pin’ slides onto specific parts of the image (see https://storymap.knightlab.com/gigapixel/).  Brianna sent me an image of a musical score that she wanted to use as a background image and I followed the steps required to create a tileset from this and set it up for use with the library.  The image wasn’t really of high enough quality to be used for this purpose, but as a test it worked just fine.  I then created the required slides, attached them to the image, added in images and sound files and we then had a test version of a story up and running.  It’d going to need some further work before it can be published, but it’s good to know that this approach is going to work.

I also had some Burns related duties to attend to this week, what with Burns’ Night being on Thursday.  We added some new songs to the Burns website (http://burnsc21.glasgow.ac.uk/) and I dealt with a request to use our tour maps on another blog (see https://blog.historicenvironment.scot/2018/01/burns-nicht/).

I met with Luca this week to discuss how he’s using Exist DB, XQuery and other XML technologies in order to create the Curious Travellers website.  I hadn’t realised that it was possible to use these technologies without any other server-side scripting language, but apparently it is possible for Exist to handle all of the page requests and output date in the required format for users (e.g. HTML or even JSON formatted data).  It was very interesting to learn a bit about how these technologies work.  We also had a chat about Joanna’s project, and I had an email conversation with her about how I might be involved in the project.

I made some further tweaks to the NRECT website for Stuart Gillespie, responded to a query from Megan Coyer about the management of the Medical Humanities Network website and met with Anna McFarlane to discuss putting together a blog for her new project.  I also updated all of the WordPress sites to the latest version as a new security release was made available this week.

Week Beginning 15th January 2018

I worked on a number of projects and gave advice to several members of staff this week.  Megan Coyer sent me an example document that she will need to perform OCR on in order to extract the text from the digitised images.  The document is a periodical from Proquest’s British periodicals collection and was a PDF containing digitised images.  I was hoping that the full text would be indexed in the PDF and allow searching using Acrobat’s search facility (as is possible with some supposedly image based PDFs) but unfortunately this was not the case.  Proquest’s website states that ‘All of this material is available in page image format with fully searchable text. Users can filter results by article type and download articles as either PDFs or JPEG page images’ so it would appear that they limit the fully searchable text to their own system, and the only outputs they make available to download are purely image based.  Megan needs access to the full text so we’re going to have to do our own OCR.

I downloaded a free OCR package based on the Tesseract engine used by Google Books (https://github.com/A9T9/Free-Ocr-Windows-Desktop/releases) and experimented with the document.   The software allows PDFs to be OCRed, but when I ran the first page of the PDF through the software the results were terrible, resulting in a text file that was completely unusable.  This didn’t look promising at all, but via a subscription to Proquest from the University library we can access the actual image files.  I downloaded the first page and running this through the OCR software was a huge improvement, with only a few very minor errors cropping up.  I’m guessing this is because the images contained in the PDF are of a much lower resolution than the actual image files that are available, although there may be other factors involved too.  But whatever the reason, it looks like it will be possible to extract the text, which is very promising.

On Monday I met with Honor Riley, the RA on The People’s Voice project to discuss the poems database and how we will add song recordings to the database, and also to the main project website.  It was a useful meeting and we figured out a method that should work.  On Tuesday I implemented the methods we had agreed upon the previous day.  Honor had given me an initial batch of song recordings and I converted these from WAV to MP3 and uploaded them to the project’s WordPress site.  I then updated the poems database front end I’d previously created to include an HTML5 audio player that references the MP3’s URL (if one is included for the poem’s record) and also links through to a page about the song that will be set up via WordPress.  I also updated the poem database front end to include a new facility that will allow poems to be browsed for by publication.  That should be most of the technical things completed ahead of the project’s launch next month.

On Tuesday afternoon I attended a meeting for the REELS project.  It’s been a while since I’ve been involved with this project, but we’ve reached the stage where I will need to start working on the front end for the project’s data.  We had a long and very useful meeting where we discussed the requirements for the front end – the sorts of search and browse facilities that we want to include and how the map interface should work.  We also looked at a few existing map-based place-name websites to get some ideas from those.  I’m hoping to be able to start work on the front-end next week.

There were a few other tweaks I needed to make to the REELS content management system.  Simon had encountered an issue with a special character not being saved in the database.  It was the ‘e-caudata’ character (ę) and even though I had the table set up as UTF-8 this was still failing to insert properly.

It turns out MySQL only supports UTF-8 characters that take up a maximum of 3 bytes by default, and this character takes up 4 bytes.  What’s needed instead is to set MySQL up to run in ‘utf8mb4’ (multi-byte 4) mode.  But setting the collation alone didn’t fix this for me, I had to convert the character set to utf8mb4 as well:

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

There’s a very handy page about this here: https://mathiasbynens.be/notes/mysql-utf8mb4

I also added in a new column to the ‘browse place-names’ page in the content management system for displaying former parishes, which makes things easier for the project team.

I bumped into Jane during the week and we talked a bit about updates to a couple of project websites.  I also replied to an email from Helen Kingstone in English Literature who wanted some advice on corpus linguistics tools.  I also did a bit of App admin duties as colleagues in MVLS are in the process of creating a new app and needed my help to set up a new user account.  I also arranged to meet with Anna McFarlane to discuss a website for a research project she’s putting together, and I responded to an email from Joanna Kopaczyk about the proposal she’s currently putting together.

I spent a fun few hours on Friday morning designing a little website for Stuart Gillespie, that will accompany his printed volume ‘Newly Recovered English Classical Translations 1600-1800’.  The Annexe to the volume is going to be made available from this website, and Stuart had sent me a copy o the volume’s cover, so I could take some design cues from it.  I created a nice little responsive interface that I think complements the printed volume very well.  I can’t link to it yet, though, as the site currently doesn’t feature any content.

I spent the rest of the week on Historical Thesaurus duties.  My first task was to create an updated version of one of my scripts for showing links between HT and OED categories, specifically the one that brings back categories that match and displays lists of words within each that either match or don’t.  Previously the script would just bring back all of the matches, or a subset (e.g. the first 1000), but Fraser wanted a version that you could request a specific category and both it and its subcats would be returned.  I managed to create such a script fairly quickly and it seems to fit the bill.

I also tweaked the ‘fixed header’ I created last week.  I’d made it so that if you select a subcat then the ID of that subcat replaces the maincat ID in the address bar.  However, if you deselect the subcat the ID does not revert, when really it makes sense for it to do so.  A swift update to the code and it now does.  Much better.  I also updated the font used in the header at Marc’s suggestion.  All I need now is final approval and this new feature can go live.

I then continued to investigate how to add in arrows and curved edges to the timeline.  By updating the timeline code I figured out how to add in circles and squares to the beginning and end of a timeline shape.  By rotating the square 45 degrees I could make it into a diamond than when positioned next to the shape poked out as if it was an arrow.  This rotating took some figuring out as with D3.js you can’t just rotate a shape after it has been added, otherwise all subsequent shapes appear on this rotated line as well.  Instead you need to specify the ‘x’ and ‘y’ coordinates and the rotation at the point of creation in the same call, like so:


var x = getXPos(d,i) – 10;

var y = getStackPosition(d,i) + 10;

return “translate(“+x+”,”+y+”) rotate(-45)”;


My initial version with different ends had arrows at the left hand side of all timeline shapes and circles at the right hand side, but with this proof of concept in place I could then add in a filter to only add in the shapes when required.  This meant updating the structure of the data that is fed to the timeline code to add in new fields for whether a timeline shape is ‘ante’ or ‘circa’ (or neither).  It took some time to update the script that generates the data to account for all of the various ‘ac’ fields in the database and figure out whether these apply to the start date or end date, but I got there in the end.  I also had to work with the D3.js ‘filter’ option, which is what needs to be used instead of a regular ‘if’ statement.  E.g. this is how I check to see whether a start date is ‘ante’ or is Old English (both of which need arrows to the left):

.filter(function(d){ var check = false; if(d.starting_ac == “a” || d.starting_time == -27454550400000) check = true; return check;})

With this in place I then had a timeline with shapes that have different ends depending on the types of dates that are present, and I must say I’m pretty pleased with how this is working out, as I was worried I wouldn’t be able to get this feature working.  Here’s a screenshot:

Note that there are further things to do.  For example, some dates have an ‘ante’ end date, but I’m not currently sure how these should be handled.  Also using dots for single years means it’s not possible to differentiate between ‘circa’ single dates and regular single dates.  Marc, Fraser and I will need to meet again to consider how best to deal with these instances.

My final task for the week was to look into sorting the timeline.  Currently the timeline is listed by date, but we want an option to list if alphabetically by word as well.  I managed to get a rudimentary sort function working, but as of yet it redraws the whole timeline when the sort option is selected and I’d rather it animated the moving of rows instead, which would look a lot nicer.  This might take some work, though.

Week Beginning 8th January 2018

This was my first full, five-day week back after the Christmas holidays, and I spent the majority of it continuing to work on the new timeline visualisation for the Historical Thesaurus, plus some other interface updates that were proposed during the meeting Marc, Fraser and I had last week.  I managed to make quite a bit of progress on the visualisation and also the way in which dates are stored in the underlying database.  The HT has many different date fields, but the main ones are ‘firstd’, ‘midd’, and ‘lastd’.  Each of these has a second ‘b’ field where a potential second, later date can be added, which gives (for example) ‘1400/50’ as a date.  These ‘b’ fields generally (but not always) contain dates as two, or even one-digit numbers, so in the previous example the ‘b’ field just holds ‘50’ and not ‘1450’.  If a date was ‘1400/6’ the ‘b’ field might just have a ‘6’ in it, while if a date was 1395/1410 all four digits would be stored in the ‘b’ field.  The current setup is therefore inconsistent and makes it difficult for scripts to work with and we decided to update the ‘b’ fields to always use four digits.  I wrote a script to do this, and successfully updated all of the ‘b’ dates.  I also then updated the timeline visualisation to always use the ‘b’ date for the end date of a timeline, if it existed.  I then wrote two further scripts, one to check that all ‘b’ dates are actually after the main dates (it turns out there are a handful that aren’t, or are identical to the main date), and the other to list all of the words that have a ‘b’ date that is less then five years away from the main date, as in such cases it is likely that the date should actually just be a ‘circa’ instead.

I also wrote some further checking scripts for dates, including one to pull out all occasions where the fields connecting dates together (with can either be a dash to indicate a range or a plus to indicate separate occurrences) have two dashes in a row, or where there is a final dash where the word is set as ‘current’.  These are probably errors as it means two ranges are next to each other, which shouldn’t happen.  E.g. ‘1200-1400-1600’, or ‘1600-1800-‘ don’t make much sense.  Another date checking script I wrote was to find all words that have a ‘plus’ connecting dates together (e.g. ‘1400 + 1800’) where the amount of time between the two dates is less than 150 years.  There was a rule when compiling the HT that if there were less than 150 years between dates these shouldn’t be treated as a ‘plus’ gap.  There were quite a few words that had a gap of less than 150 years and I send the resulting output of my script to Fraser and Marc for them to check through.

Turning to the timeline script itself, I fixed a couple of outstanding issues from last week, namely the pop-ups were not appearing in the right place for words that had multiple date periods.  This is because I had assigned an ID to the word row rather than each individual block of time.  I had to update the way in which I was generating the data for the timeline, and tweak the timeline JavaScript a bit, but thankfully I got the pop-ups working properly.  I had also noticed that some ‘dot’ end dates were extending up to ‘current’, which meant something was wrong with my date processing algorithm.  It turned out I’d missed out an equals sign in my code, and adding this in sorted the issue.

An update to the HT website that Marc was keen to implement in addition to the timeline visualisations is a ‘fixed’ header for the category browse page.  Such a header would appear ‘fixed’ at the top of the screen as the user scrolls down the page, thus enabling the user to tell at a glance what category they are looking at, even when far down the page.  I’d implemented something similar to this for the DSL website a few years ago (e.g. go here and start scrolling down the page: http://dsl.ac.uk/entry/snd/dreich) so reckoned it would be pretty straightforward to do something similar for the HT.  It took a bit of time to get a test version working, as I had to create new, test versions of several files (e.g. JavaScript, CSS, API, PHP) in order to be able to play about without breaking the live site.

In the test version, when the top of the category heading section scrolls off the page the fixed header fades in, and when it scrolls into view again the fixed header fades out.  Currently the header takes up the full width of the screen and has the same background colour as the main HT banner.  I’ve also added in the HT logo, which you can click to return to the homepage.  It’s a bit fuzzy looking in Chrome (but not other browsers), though.  The heading displays the noun hierarchy for the current category, which reflects the tree structure that is currently open on the page.  You can click on any level in the hierarchy to jump to it.  The current category’s Catnum, PoS and Heading are also displayed.  After some helpful feedback from Fraser I also added in a means of selecting a subcategory and for the subcategory hierarchy to be added to the fixed header too, which works as follows:

  1. Clicking on a subcategory gives its box a yellow border, which I think is pretty useful as you can then scroll about the page and quickly find the thing you’re interested in again.
  2. Clicking on the box also replaces the ID in the URL with the subcat URL, so you can now much more easily bookmark a subcat, or share the URL.  Previously you had to open the ‘cite’ box for the subcat to get the URL for a specific subcat.
  3. Clicking on a highlighted subcat removes the highlighting, in case you don’t like the yellow.  Note that this does not currently reset the ID in the URL to the maincat URL, but I think I will update this.
  4. Highlighting a category adds the subcat hierarchy to the fixed header so you can see at a glance the pathway from the very top of the HT to the subcat you’re looking at.
  5. When you follow a URL to a subcat ID the subcat is automatically highlighted and the subcat hierarchy is automatically added to the fixed header, in addition to the page scrolling to the subcat (as it previously did).

I think this will all be very helpful to users, and although it is not currently live, here is a screenshot showing how it works:

Returning to the timeline, I have changed the x axis so that it now starts at 1100 rather than 1000.  The 1100 label now displays as ‘OE*’ and if you click on it you now get the same message that is displayed on the MM timeline, namely “The English spoken by the Anglo-Saxons before c.1150, with the earliest written sources c.700”.  OE words on the timeline are no longer displayed as dots but instead have rectangles starting at the left edge of the visualisation and ending at 1150.  Once I figure out how to add in curved and pointy ends these will be given a pointy arrow on the left and a curve on the right.  I also added in faint horizontal lines between the individual timelines, to help keep your eye in a line.  Here’s an example of how things currently look:

I also started to investigate how to add in these ‘curved’ and ‘pointy’ ends to the rectangles in the timeline.  This is going to be rather tricky to implement as it means reverse engineering and then extending the timeline library I’m using, and also trying to figure out just how to give rectangles curved edges in D3, or how to append an arrow to a rectangle.  I’ll also need to find a way to pass data about ‘circa’ and ‘ante’ dates to the timeline library.  Thankfully I made a bit of progress on all of this.  It turns out I can add any additional fields that I want to the timeline’s JSON structure, so adding in ‘circa’ fields etc. will not be a problem.  Also, the timeline library’s code is pretty well structured and easy to follow.  I’ve managed to update it so that it checks for my ‘circa’ fields (but doesn’t actually do anything about them yet).  Also, there are ways of giving rectangles rounded corners in D3 (e.g. https://bl.ocks.org/mbostock/3468167) so this might work ok (although it’s not quite so simple as I will need to extend the rectangle beyond its allotted space in the timeline before the curves start).  Arrows still might prove tricky, though.  I’ll continue with this next week.

Other than HT related work I did a few other bits and bobs.  I met with Graeme to discuss a UTF8 issue he was experiencing with a database of his.  I met with Megan Coyer to discuss an upcoming project that will involve OCR, I had a chat with Luca about a Technical Plan he is putting together, I responded to a request from Stuart Gillespie about a URL he needs to incorporate into a printed volume, I helped Craig Lamont out with an issue relating to Google Analytics for the ‘Edinburgh’s Enlightenment’ site we put together a while back, I tracked down some missing sound files for the SPADE project and read through and gave feedback on a document Rachel had written about setting up Polyglot, and I had a conversation with Eleanor Lawson and Jane Stuart-Smith about future updates to the Seeing Speech website.  All in all it’s been a pretty busy week.

Week Beginning 1st January 2018

I returned to work after the Christmas holidays on Thursday this week, and spent the day dealing with a few issues that had cropped up whilst I’d been away.  The DSL Advanced Search had stopped working on Wednesday this week.  I remembered that this had happened a few years ago and was caused by an issue with the Apache Solr search engine, which the advanced search uses.  Previously restarting the server had sorted the issue but this didn’t work this time.  Thankfully after speaking to Chris about this we realised that Solr runs on an Apache Tomcat server rather than the main Apache server software, and this had been updated the day before.  It would appear that the update had stopped Solr working, but restarting Tomcat got things working again.  I also made a minor tweak to the Scots Corpus for Wendy.

After that, and dealing with a few emails, I returned to the Historical Thesaurus timeline visualisations I’d created the day before the Christmas holidays.  I’d emailed Marc and Fraser about these before the holidays and they’d got back to me with some encouraging comments.  The initial visualisations I’d made only worked with the approximate start and end dates from the Thesaurus database – the ‘apps’ and ‘appe’ dates that give a single start and end date for each word.  However, the actual dates for lexemes are considerably more complicated than this.  In fact there are 18 fields relating to dates in the underlying database that allow different ranges of dates to be recorded, for example ‘OE + a1400/50–c1475 + a1746– now History’.  Writing an algorithm that could process every different possible permutation of the date fields proved to be rather tricky and took quite a bit of time to get my head around.  I managed to get an algorithm working by mid-morning on the Friday, and although this still needs quite a bit of detailed testing it does at least seem to work (and does work with the above example), giving a nice series of dots and dashes along a timeline.

Marc, Fraser and I met on Friday to discuss the timeline and how we might improve on it and integrate it into the site.  Our meeting lasted almost three hours and was very useful.  It looks like the feature I created just because I had some free time and I had wanted to experiment is going to be fully integrated with many aspects of the site.  The only downside is there is now a massive amount of additional functionality we want to implement, and I know I’m going to be pretty busy with other projects once the new year properly gets under way, so it might take quite a while to get all this up and running.  Still, it’s exciting, though.  Also whilst working through my algorithm I’d spotted some occurrences where the dates were wrong, in that they had a range of dates where a range did not make sense, e.g. ‘OE–c1200–a1500’.  I generated a few CSV files with such rows (there are a couple of hundred) and Marc and Fraser are going to try and sort them out.  After our lengthy meeting I started to add in pop-ups to the timeline, so that when you click on an item a pop-up opens displaying information about the word (e.g. the word and its full date text).  I still need to do some work on this, but it’s good to get the basics in place.  Here’s a screenshot showing the timeline using the full date fields and with a pop-up open: