Week Beginning 15th January 2018

I worked on a number of projects and gave advice to several members of staff this week.  Megan Coyer sent me an example document that she will need to perform OCR on in order to extract the text from the digitised images.  The document is a periodical from Proquest’s British periodicals collection and was a PDF containing digitised images.  I was hoping that the full text would be indexed in the PDF and allow searching using Acrobat’s search facility (as is possible with some supposedly image based PDFs) but unfortunately this was not the case.  Proquest’s website states that ‘All of this material is available in page image format with fully searchable text. Users can filter results by article type and download articles as either PDFs or JPEG page images’ so it would appear that they limit the fully searchable text to their own system, and the only outputs they make available to download are purely image based.  Megan needs access to the full text so we’re going to have to do our own OCR.

I downloaded a free OCR package based on the Tesseract engine used by Google Books (https://github.com/A9T9/Free-Ocr-Windows-Desktop/releases) and experimented with the document.   The software allows PDFs to be OCRed, but when I ran the first page of the PDF through the software the results were terrible, resulting in a text file that was completely unusable.  This didn’t look promising at all, but via a subscription to Proquest from the University library we can access the actual image files.  I downloaded the first page and running this through the OCR software was a huge improvement, with only a few very minor errors cropping up.  I’m guessing this is because the images contained in the PDF are of a much lower resolution than the actual image files that are available, although there may be other factors involved too.  But whatever the reason, it looks like it will be possible to extract the text, which is very promising.

On Monday I met with Honor Riley, the RA on The People’s Voice project to discuss the poems database and how we will add song recordings to the database, and also to the main project website.  It was a useful meeting and we figured out a method that should work.  On Tuesday I implemented the methods we had agreed upon the previous day.  Honor had given me an initial batch of song recordings and I converted these from WAV to MP3 and uploaded them to the project’s WordPress site.  I then updated the poems database front end I’d previously created to include an HTML5 audio player that references the MP3’s URL (if one is included for the poem’s record) and also links through to a page about the song that will be set up via WordPress.  I also updated the poem database front end to include a new facility that will allow poems to be browsed for by publication.  That should be most of the technical things completed ahead of the project’s launch next month.

On Tuesday afternoon I attended a meeting for the REELS project.  It’s been a while since I’ve been involved with this project, but we’ve reached the stage where I will need to start working on the front end for the project’s data.  We had a long and very useful meeting where we discussed the requirements for the front end – the sorts of search and browse facilities that we want to include and how the map interface should work.  We also looked at a few existing map-based place-name websites to get some ideas from those.  I’m hoping to be able to start work on the front-end next week.

There were a few other tweaks I needed to make to the REELS content management system.  Simon had encountered an issue with a special character not being saved in the database.  It was the ‘e-caudata’ character (ę) and even though I had the table set up as UTF-8 this was still failing to insert properly.

It turns out MySQL only supports UTF-8 characters that take up a maximum of 3 bytes by default, and this character takes up 4 bytes.  What’s needed instead is to set MySQL up to run in ‘utf8mb4’ (multi-byte 4) mode.  But setting the collation alone didn’t fix this for me, I had to convert the character set to utf8mb4 as well:

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

There’s a very handy page about this here: https://mathiasbynens.be/notes/mysql-utf8mb4

I also added in a new column to the ‘browse place-names’ page in the content management system for displaying former parishes, which makes things easier for the project team.

I bumped into Jane during the week and we talked a bit about updates to a couple of project websites.  I also replied to an email from Helen Kingstone in English Literature who wanted some advice on corpus linguistics tools.  I also did a bit of App admin duties as colleagues in MVLS are in the process of creating a new app and needed my help to set up a new user account.  I also arranged to meet with Anna McFarlane to discuss a website for a research project she’s putting together, and I responded to an email from Joanna Kopaczyk about the proposal she’s currently putting together.

I spent a fun few hours on Friday morning designing a little website for Stuart Gillespie, that will accompany his printed volume ‘Newly Recovered English Classical Translations 1600-1800’.  The Annexe to the volume is going to be made available from this website, and Stuart had sent me a copy o the volume’s cover, so I could take some design cues from it.  I created a nice little responsive interface that I think complements the printed volume very well.  I can’t link to it yet, though, as the site currently doesn’t feature any content.

I spent the rest of the week on Historical Thesaurus duties.  My first task was to create an updated version of one of my scripts for showing links between HT and OED categories, specifically the one that brings back categories that match and displays lists of words within each that either match or don’t.  Previously the script would just bring back all of the matches, or a subset (e.g. the first 1000), but Fraser wanted a version that you could request a specific category and both it and its subcats would be returned.  I managed to create such a script fairly quickly and it seems to fit the bill.

I also tweaked the ‘fixed header’ I created last week.  I’d made it so that if you select a subcat then the ID of that subcat replaces the maincat ID in the address bar.  However, if you deselect the subcat the ID does not revert, when really it makes sense for it to do so.  A swift update to the code and it now does.  Much better.  I also updated the font used in the header at Marc’s suggestion.  All I need now is final approval and this new feature can go live.

I then continued to investigate how to add in arrows and curved edges to the timeline.  By updating the timeline code I figured out how to add in circles and squares to the beginning and end of a timeline shape.  By rotating the square 45 degrees I could make it into a diamond than when positioned next to the shape poked out as if it was an arrow.  This rotating took some figuring out as with D3.js you can’t just rotate a shape after it has been added, otherwise all subsequent shapes appear on this rotated line as well.  Instead you need to specify the ‘x’ and ‘y’ coordinates and the rotation at the point of creation in the same call, like so:

.attr(“transform”,function(d,i){

var x = getXPos(d,i) – 10;

var y = getStackPosition(d,i) + 10;

return “translate(“+x+”,”+y+”) rotate(-45)”;

})

My initial version with different ends had arrows at the left hand side of all timeline shapes and circles at the right hand side, but with this proof of concept in place I could then add in a filter to only add in the shapes when required.  This meant updating the structure of the data that is fed to the timeline code to add in new fields for whether a timeline shape is ‘ante’ or ‘circa’ (or neither).  It took some time to update the script that generates the data to account for all of the various ‘ac’ fields in the database and figure out whether these apply to the start date or end date, but I got there in the end.  I also had to work with the D3.js ‘filter’ option, which is what needs to be used instead of a regular ‘if’ statement.  E.g. this is how I check to see whether a start date is ‘ante’ or is Old English (both of which need arrows to the left):

.filter(function(d){ var check = false; if(d.starting_ac == “a” || d.starting_time == -27454550400000) check = true; return check;})

With this in place I then had a timeline with shapes that have different ends depending on the types of dates that are present, and I must say I’m pretty pleased with how this is working out, as I was worried I wouldn’t be able to get this feature working.  Here’s a screenshot:

Note that there are further things to do.  For example, some dates have an ‘ante’ end date, but I’m not currently sure how these should be handled.  Also using dots for single years means it’s not possible to differentiate between ‘circa’ single dates and regular single dates.  Marc, Fraser and I will need to meet again to consider how best to deal with these instances.

My final task for the week was to look into sorting the timeline.  Currently the timeline is listed by date, but we want an option to list if alphabetically by word as well.  I managed to get a rudimentary sort function working, but as of yet it redraws the whole timeline when the sort option is selected and I’d rather it animated the moving of rows instead, which would look a lot nicer.  This might take some work, though.