Week Beginning 24th August 2020

I needed a further two trips to the dentist this week, which lost me some time due to my dentist being the other side of the city from where I live (but very handy for my office at work that I’m not currently allowed to use).  Despite these interruptions I managed to get a decent amount done this week.  For the Books and Borrowing project I processed the images of a register from Westerkirk library.  For this register I needed to stitch together the images of the left and right pages to make a single image, as each spread features a table that covers both pages.  As we didn’t want to have to manually join hundreds of images I wrote a script that did this, leaving a margin between the two images as they don’t line up perfectly.  I used the command-line tool Imagemagick to achieve this – firstly adding the margin to the left-hand image and secondly joining this to the right-hand image.  I then needed to generate tilesets of the images using Zoomify, but when I came to do so the converter processed the images the wrong way round – treating them as portrait rather than landscape and resulting in tilesets that were all wrong.  I realised that when joining the page images together the image metadata hadn’t been updated:  two portrait images were joined together to make one landscape image, but the metadata still suggested that the image was portrait, which confused the Zoomify converter.  I therefore had to run the images through Imagemagick again to strip out all of the metadata and then rotate the images 90 degrees clockwise, which resulted in a set of images I could then upload to the server.

Also this week I made some further tweaks to Matthew Sangster’s pilot project featuring the Glasgow Student data, which we will be able to go live with soon.  This involved adding in a couple of missing page images, fixing some encoding issues with Greek characters in a few book titles, fixing a bug that was preventing the links to pages from the frequency lists working, ensuring any rows that are to be omitted from searches were actually being omitted and adding in tooltips for the table column headers to describe what the columns mean.

I also made some progress with the redevelopment of the Anglo-Norman Dictionary.  I had a Zoom meeting with the editors on Wednesday, which went very well, and resulted in me making some changes to the system documentation I had previously written.  I also worked on an initial structure for the new dictionary website, setting up WordPress for the ancillary pages and figuring out how to create a WordPress theme that is based on Bootstrap.  This was something I hadn’t done before and it was a good learning experience.  It mostly went pretty smoothly, but getting a WordPress menu to use Bootstrap’s layout was a little tricky.  Thankfully someone has already solved the issue and has made the code available to use (see https://github.com/wp-bootstrap/wp-bootstrap-navwalker) so I could just integrate this with my theme.

I completed work on the theme and generated placeholder pages and menu items for all the various parts of the site.  The page design is just my temporary page design for now, looking very similar to the Books and Borrowing CMS design, but this will be replaced with something more suitable in time.  With this in place I regenerated the XML data from the existing CMS based on the final ‘entry_hash’ data I had.  This was even more successful than my first attempt with an earlier version of the data last week and resulted in all but 35 of the 54,025 dictionary entries being generated.  This XML has the same structure as the files being used by the editors, so we will now be able to standardise on this structure.

With the new data imported I then started work on an API for the site.  This will process all requests for data and will then return the data in either JSON or CSV format (with the front-end using JSON).  I created the endpoints necessary to make the ‘browse’ panel work – returning a section of the dictionary as headwords and links either based on entry ‘slugs’ (the URL-safe versions of headwords) or headword text, depending on whether the ‘browse up/down’ option or the ‘jump to’ option is chosen.  I also created an endpoint for displaying an entry, which returns all of the data for an entry including its full XML record.

I then began work on the ‘entry’ page in the front-end, focussing initially on the ‘browse’ feature.  By the end of the week this was fully operational, allowing the user to scroll up and down the list, select an item to load it or enter text into the ‘jump to’ box.  There’s also a pop-up where info about how to use the browse can be added.  The ‘jump to’ still needs some work as if you type fast into it it sometimes gets confused as to what content to show.  I haven’t done anything about displaying the entry yet, other than displaying the headword.  Currently the full versions of both the editor’s and the existing system XML are displayed.  Below is a screenshot of how things currently look:

My last task of the week for the AND was to write a script to extract all of the headwords, variants and deviants from the entries to enable the quick search to work.  I set the script running and by the time it had finished executing there were more than 150,000 entries in the ‘forms’ table I’d created.

Also this week I helped Rob Maslen to migrate his ‘City of Lost Books’ blog to a new URL, had a chat with the DSL people about updates to the search results page based on the work I did last week and had a chat with Thomas Clancy about three upcoming place-names projects.

I also returned to the Historical Thesaurus project and our ongoing attempts to extract dates from the Oxford English Dictionary in order to update the dates of attestation in the Historical Thesaurus.  Firstly, I noticed that there were some issues with end dates for ranged dates before 1000 and I’ve fixed these (there were about 50 or so).  Secondly, I noticed there are about 20 dates that don’t have a ‘year’ as presumably the ‘year’ attribute in the XML was empty.  Some of these I can fix (and I have), but others also have an empty ‘fullyear’ too, meaning the date tag was presumably empty in the XML and I therefore deleted these.

We still needed to figure out how to handle OED dates that have a dot in them.  These are sometimes used (well, used about 4,000 times) to show roughly where a date comes so that it is placed correctly in the sequence of dates (e.g. ’14..’ is given the year ‘1400’).  But sometimes a date has a dot and a specific year (e.g. ’14..’ but ‘1436’).  We figured out that this is to ensure the correct ordering of the date after an earlier specific date.  Fraser therefore wanted these dates to be ‘ante’ the next known date.  I therefore wrote a script that finds all lexemes that have at least one date that has a dot and a specific year, then for each of these lexemes it gets all of the dates in order.  Each date is displayed, with the ‘fullyear’ displayed first and the ‘year’ in brackets.  If the date is a ‘.’ date then it is highlighted in yellow.  For each of these the script then tries to find the next date in sequence that isn’t another ‘.’ date (as sometimes there are several).  If it finds one then the date becomes this row’s ‘year’ plus ‘a’.  If it doesn’t find one (e.g. if the ‘.’ date is the last date for the lexeme) then it retains the year from the ‘.’ date but with ‘a’ added.  Next week I will run this script to actually update the data and we will then move on to using the new OED data with the HT’s lexemes.