Week Beginning 24th August 2020

I needed a further two trips to the dentist this week, which lost me some time due to my dentist being the other side of the city from where I live (but very handy for my office at work that I’m not currently allowed to use).  Despite these interruptions I managed to get a decent amount done this week.  For the Books and Borrowing project I processed the images of a register from Westerkirk library.  For this register I needed to stitch together the images of the left and right pages to make a single image, as each spread features a table that covers both pages.  As we didn’t want to have to manually join hundreds of images I wrote a script that did this, leaving a margin between the two images as they don’t line up perfectly.  I used the command-line tool Imagemagick to achieve this – firstly adding the margin to the left-hand image and secondly joining this to the right-hand image.  I then needed to generate tilesets of the images using Zoomify, but when I came to do so the converter processed the images the wrong way round – treating them as portrait rather than landscape and resulting in tilesets that were all wrong.  I realised that when joining the page images together the image metadata hadn’t been updated:  two portrait images were joined together to make one landscape image, but the metadata still suggested that the image was portrait, which confused the Zoomify converter.  I therefore had to run the images through Imagemagick again to strip out all of the metadata and then rotate the images 90 degrees clockwise, which resulted in a set of images I could then upload to the server.

Also this week I made some further tweaks to Matthew Sangster’s pilot project featuring the Glasgow Student data, which we will be able to go live with soon.  This involved adding in a couple of missing page images, fixing some encoding issues with Greek characters in a few book titles, fixing a bug that was preventing the links to pages from the frequency lists working, ensuring any rows that are to be omitted from searches were actually being omitted and adding in tooltips for the table column headers to describe what the columns mean.

I also made some progress with the redevelopment of the Anglo-Norman Dictionary.  I had a Zoom meeting with the editors on Wednesday, which went very well, and resulted in me making some changes to the system documentation I had previously written.  I also worked on an initial structure for the new dictionary website, setting up WordPress for the ancillary pages and figuring out how to create a WordPress theme that is based on Bootstrap.  This was something I hadn’t done before and it was a good learning experience.  It mostly went pretty smoothly, but getting a WordPress menu to use Bootstrap’s layout was a little tricky.  Thankfully someone has already solved the issue and has made the code available to use (see https://github.com/wp-bootstrap/wp-bootstrap-navwalker) so I could just integrate this with my theme.

I completed work on the theme and generated placeholder pages and menu items for all the various parts of the site.  The page design is just my temporary page design for now, looking very similar to the Books and Borrowing CMS design, but this will be replaced with something more suitable in time.  With this in place I regenerated the XML data from the existing CMS based on the final ‘entry_hash’ data I had.  This was even more successful than my first attempt with an earlier version of the data last week and resulted in all but 35 of the 54,025 dictionary entries being generated.  This XML has the same structure as the files being used by the editors, so we will now be able to standardise on this structure.

With the new data imported I then started work on an API for the site.  This will process all requests for data and will then return the data in either JSON or CSV format (with the front-end using JSON).  I created the endpoints necessary to make the ‘browse’ panel work – returning a section of the dictionary as headwords and links either based on entry ‘slugs’ (the URL-safe versions of headwords) or headword text, depending on whether the ‘browse up/down’ option or the ‘jump to’ option is chosen.  I also created an endpoint for displaying an entry, which returns all of the data for an entry including its full XML record.

I then began work on the ‘entry’ page in the front-end, focussing initially on the ‘browse’ feature.  By the end of the week this was fully operational, allowing the user to scroll up and down the list, select an item to load it or enter text into the ‘jump to’ box.  There’s also a pop-up where info about how to use the browse can be added.  The ‘jump to’ still needs some work as if you type fast into it it sometimes gets confused as to what content to show.  I haven’t done anything about displaying the entry yet, other than displaying the headword.  Currently the full versions of both the editor’s and the existing system XML are displayed.  Below is a screenshot of how things currently look:

My last task of the week for the AND was to write a script to extract all of the headwords, variants and deviants from the entries to enable the quick search to work.  I set the script running and by the time it had finished executing there were more than 150,000 entries in the ‘forms’ table I’d created.

Also this week I helped Rob Maslen to migrate his ‘City of Lost Books’ blog to a new URL, had a chat with the DSL people about updates to the search results page based on the work I did last week and had a chat with Thomas Clancy about three upcoming place-names projects.

I also returned to the Historical Thesaurus project and our ongoing attempts to extract dates from the Oxford English Dictionary in order to update the dates of attestation in the Historical Thesaurus.  Firstly, I noticed that there were some issues with end dates for ranged dates before 1000 and I’ve fixed these (there were about 50 or so).  Secondly, I noticed there are about 20 dates that don’t have a ‘year’ as presumably the ‘year’ attribute in the XML was empty.  Some of these I can fix (and I have), but others also have an empty ‘fullyear’ too, meaning the date tag was presumably empty in the XML and I therefore deleted these.

We still needed to figure out how to handle OED dates that have a dot in them.  These are sometimes used (well, used about 4,000 times) to show roughly where a date comes so that it is placed correctly in the sequence of dates (e.g. ’14..’ is given the year ‘1400’).  But sometimes a date has a dot and a specific year (e.g. ’14..’ but ‘1436’).  We figured out that this is to ensure the correct ordering of the date after an earlier specific date.  Fraser therefore wanted these dates to be ‘ante’ the next known date.  I therefore wrote a script that finds all lexemes that have at least one date that has a dot and a specific year, then for each of these lexemes it gets all of the dates in order.  Each date is displayed, with the ‘fullyear’ displayed first and the ‘year’ in brackets.  If the date is a ‘.’ date then it is highlighted in yellow.  For each of these the script then tries to find the next date in sequence that isn’t another ‘.’ date (as sometimes there are several).  If it finds one then the date becomes this row’s ‘year’ plus ‘a’.  If it doesn’t find one (e.g. if the ‘.’ date is the last date for the lexeme) then it retains the year from the ‘.’ date but with ‘a’ added.  Next week I will run this script to actually update the data and we will then move on to using the new OED data with the HT’s lexemes.

Week Beginning 17th August 2020

I lost most of Tuesday this week to root canal surgery, which was uncomfortable and exhausting but thankfully not too painful.  Unfortunately my teeth are still not right and I now have a further appointment booked for next week, but at least the severe toothache that I had previously has now stopped.

I continued to work on the requirements document for the redevelopment of the Anglo-Norman Dictionary this week, and managed to send a first completed version of it to Heather Pagan for feedback.  It will no doubt need some further work, but it’s good to have a clearer picture of how the new version of the website will function.  Also this week I investigated another bizarre situation with the AND’s data.  I have access to the full dataset as is used to power the existing website as a single XML file containing all of the entries.  The Editors are also working on individual entries as single XML files that are then uploaded to the existing website using a content management system.  What we didn’t realise up until now is that the structure of the XML files is transformed when an entry is ingested into the online system.  For example, the ‘language’ tag is changed from <language lang=”M.E.”/> to <lbl type=”lang”>M.E.</lbl>.  Similarly, part of speech has been transformed from <pos type=”s.”/> to <gramGrp><pos>s.</pos></gramGrp>.  We have no idea why the developer of the system chose to do this, as it seems completely unnecessary and it’s a process that doesn’t appear to be documented anywhere.  The crazy thing is that the transformed XML still then needs to be further transformed to HTML for display so what appears on screen is two steps removed from the data the editors work with.  It also means that I don’t have access to the data in the form that the editors are working with, meaning I can’t just take their edits and use them in the new site.

As we ideally want to avoid the situation where we have two structurally different XML datasets for the dictionary I wanted to try and find a way to transform the data I have into the structure used by the editors.  I attempted to do this by looking at the code for the existing content management system to try to decipher where the XML is getting transformed.  There is an option for extracting an entry from the online system for offline editing and this transforms the XML into the format used by the editors.  I figured that if I can understand how this process works and replicate it then I will be able to apply this to the full XML dictionary file and then I will have the complete dataset in the same format as the editors are working with and we can just use this in the redevelopment.

It was not easy to figure out what the system is up to, but I managed to ascertain that when you enter a headword for export this then triggers a Perl script and this in turn uses an XSLT stylesheet, which I managed to track down a version of that appears to have been last updated in 2014.  I then wrote a little script that takes the XML of the entry for ‘padlock’ as found in online data and applies this stylesheet to it, in the hope that it would give me an XML file identical to the one exported by the CMS.

The script successfully executed, but the resulting XML was not quite identical to the file exported by the CMS.  There was no ‘doctype’ and DTD reference, the ‘attestation’ ID was the entry ID with an auto-incrementing ‘C’ number appended to it (AND-201-02592CE7-42F65840-3D2007C6-27706E3A-C001) rather than the ID of the <cit> element (C-11c4b015), and <dateInfo> was not processed, and only the contents of the tags within <dateInfo> were being displayed.

I’m not sure why these differences exist.  It’s possible I only have access to an older version of the XSLT file.  I’m guessing this must be the case because the missing or differently formatted data is does not appear to be instated elsewhere (e.g. in the Perl script).  What I then did was to modify the XSLT file to ensure that the changes are applied: doctype is added in, the ‘attestation’ ID is correct and the <dateInfo> section contains the full data.

I could try applying this script to every entry in the full data file I have, although I suspect there will be other situations that the XSLT file I have is not set up to successfully process.

I therefore tried to investigate another alternative, which was to write a script that will pass the headword of every dictionary item to the ‘Retrieve an entry for editing’ script in the CMS, saving the results of each.  I considered that might be more likely to work reliably for every entry, but that we may run into issues with the server refusing so many requests.  After a few test runs, I set the script loose on all 53,000 or so entries in the system and although it took several hours to run, the process did appear to work for the most part.  I now have the data in the same structure as the editors work with, which should mean we can standardise on this format and abandon the XML structure used by the existing online system.

Also this week I fixed an issue with links through to the Bosworth and Toller Old English dictionary from the Thesaurus of Old English.  Their site has been redeveloped and they’ve changed the way their URLs work without putting redirects from the old URLs, meaning all our links for words in the TOE to words on their site are broken.  URLs for their entries now just use a unique ID rather than the word (e.g. http://bosworthtoller.com/28286), which seems like a bit of a step backwards.  They’ve also got rid of length marks and are using acute accents on characters instead, which is a bit strange.  The change to an ID in the URL means we can no longer link to a specific entry as we can’t possibly know what IDs they’re using for each word.  However, we can link to their search results page instead, e.g. http://bosworthtoller.com/search?q=sōfte works and I updated TOE to use such links.

I also continued with the processing of OED dates for use in the Historical Thesaurus, after my date extraction script finished executing over the weekend.  This week I investigated OED dates that have a dot in them instead of a full date.  There are 4,498 such dates and these mostly all have the lower date as the one recorded in the ‘year’ attribute by the OED.  E.g. 138. Is 1380, 17.. is 1700.  However, sometimes a specific date is given in the ‘year’ attribute despite the presence of a full stop in the date tag.  For example, one entry has ‘1421’ in the ‘year’ attribute but ’14..’ in the date tag.  There are just over a thousand dates where there are two dots but the ‘year’ given does not end in ‘00’.  Fraser reckons this is to do with ordering the dates in the OED and I’ll need to do some further work on this next week.

In addition to the above I continued to work on the Books and Borrowing project.  I made some tweaks to the CMS to make is easier to edit records.  When a borrowing record is edited the page automatically scrolls down to the record that was edited.  This also happens for books and borrowers when accessed and edited from the ‘Books’ and ‘Borrowers’ tabs in a library.  I also wrote an initial script that will help to merge some of the duplicate author records we have in the system due to existing data with different formats being uploaded from different libraries.  What it does is strip all of the non-alpha characters from the forename and surname fields, makes them lower case then joins them together.  So for example, author ID (AID) 111 has ‘Arthur’ as forename and ‘Bedford’ as surname while AID 1896 has nothing for forename and ‘Bedford, Arthur, 1668-1745’ as surname.  When stripped and joined together these both become ‘bedfordarthur’ and we have a match.

There are 162 matches that have been identified, some consisting of more than two matched author records.  I exported these as a spreadsheet.  Each row includes the author’s AID, title, forename, surname, othername, born and died (each containing ‘c’ where given), a count of the number of books the record is associated with and the AID of the record that is set to be retained for the match.  This defaults to the first record, which also appears in bold, to make it easier to see where a new batch of duplicates begins.

The editors can then go through this spreadsheet and reassign the ‘AID to keep’ field to a different row.  E.g. for Francis Bacon the AID to keep is given as 1460.  If the second record for Francis Bacon should be kept instead the editor would just need to change the value in this column for all three Francis Bacons to the AID for this row, which is 163.  Similarly, if something has been marked as a duplicate and it’s wrong, then set the ‘AID to keep’ accordingly.  E.g. There are four ‘David Hume’ records, but looking at the dates at least one of these is a different person.  To keep the record with AID 1610 separate, replace the AID 1623 in the ‘AID to keep’ column with 1610.  It is likely that this spreadsheet will be used to manually split up the imported authors that just have all their data in the surname column.  Someone could, for example take the record that has ‘Hume, David, 1560?-1630?’ in the surname column and split this into the correct columns.

I also generated a spreadsheet containing all of the authors that appear to be unique.  This will also need checking for other duplicates that haven’t been picked up as there are a few.  For example AID 1956 ‘Heywood, Thomas, d. 1641’ and 1570 ‘Heywood, Thomas, -1641.’ Haven’t been matched because of that ‘d’.  Similarly, AID 1598 ‘Buffon, George Louis Leclerc, comte de, 1707-1788’ and 2274 ‘Buffon, Georges Louis Leclerc, comte de, 1707-1788.’ Haven’t been matched up because one is ‘George’ and the other ‘Georges’.  Accented characters have also not been properly matched, e.g. AID 1457 ‘Beze, Theodore de, 1519-1605’ and 397 ‘Bèze, Théodore de, 1519-1605.’.  I could add in a Levenshtein test that matches up things that are one character different and update the script to properly take into account accented characters for matching purposes, or these are things that could just be sorted manually.

Ann Fergusson of the DSL got back to me this week after having rigorously tested the search facilities of our new test versions of the DSL API (V2 containing data from the original API and V3 containing data that has been edited since the original API was made).  Ann had spotted some unexpected behaviour in some of the searches and I spent some time investigating these and fixing things where possible.  There were some cases where incorrect results were being returned when a ‘NOT’ search was performed on a selected source dictionary, due to the positioning of the source dictionary in the query.  This was thankfully easy to fix.  There was also an issue with some exact searches of the full text failing to find entries.  When the full text is ingested into Solr all of the XML tags are stripped out.  If there are no spaces between tagged words then words have ended up squashed together. For example: ‘Westminster</q></cit></sense><sense><b>B</b>. <i>Attrib</i>’.  With the tags (and punctuation) stripped out we’re left with ‘WestminsterB’.  So an exact search for ‘westminster’ fails to find this entry.  A search for ‘westminsterb’ finds the entry, which confirms this.  I suspect this situation is going to crop up quite a lot, so I will need to update the script that prepares content for Solr to add spaces after tags before stripping tags and then removes multiple spaces between words.

Week Beginning 10th August 2020

I was back at work this week after spending two weeks on holiday, during which time we went to Skye, Crinan and Peebles.  It was really great to see some different places after being cooped up at home for 19 weeks and I feel much better for having been away.  Unfortunately during this week I developed a pretty severe toothache and had to make an emergency appointment with the dentist on Thursday morning.  It turns out I need root canal surgery and am now booked in to have this next Tuesday, but until then I need to try and just cope with the pain, which has been almost unbearable at times, despite regular doses of both ibuprofen and paracetamol.  This did affect my ability work a bit on Thursday afternoon and Friday, but I managed to struggle through.

On my return to work from my holiday on Monday I spent some time catching up with emails that had accumulated whilst I was away, including replying to Roslyn Potter in Scottish Literature about a project website, replying to Jennifer Smith about giving a group of students access to the SCOSYA data and making changes to the Berwickshire Place-names website to make it more attractive to the REF reviewers based on feedback passed on by Jeremy Smith.  I also created a series of high-resolution screenshots of the resource for Carole Hough for a publication, had an email chat with Luca Guariento about linked open data.

I also fixed some issues with the Galloway Glens projects that Thomas Clancy had spotted, including an issue with the place-name element page which was not ordering accented characters properly – all accented characters were being listed at the end rather than with their non-accented versions.  It turned out that while the underlying database orders accented characters correctly, for the elements list I need to get a list of elements used in place-names and a list of elements used in historical forms and then I have to combine these lists and reorder the resulting single list.  This part of the process was not dealing with all accented characters, only a limited set that I’d created for Berwickshire that also dealt with ashes and thorns.  Instead I added in a function taken from WordPress that converts all accented characters to their unaccented equivalent for the purposes of ordering and this ensured the order of the elements list was correct.

The rest of my week was divided between three projects, the first of which was the Books and Borrowing project.  For this I spent some time working with some of the digitised images of the register pages.  We now have access to the images from Westerkirk library and in these records appear in a table that spreads across both recto and verso pages but we have images of the individual pages.  The project RA who is transcribing the records is treating both recto and verso as a single ‘page’ in the system, which makes sense.  We therefore need to stitch the r and v images together into on single image to be associated with this ‘page’.  I downloaded all of the images and have found a way to automatically join two page images together.  However, there is rather a lot of overlap in the images, meaning the book appears to have two joins and some columns are repeated.  I could possibly try to automatically crop the images before joining them, but there is quite a bit of variation in the size of the overlap so this is never going to be perfect and may result in some information getting lost.  The other alternative would be to manually crop and join the images, which I did some experimentation with.  It’s still not perfect due to the angle of the page changing between shots, but it’s a lot better.  The downside with this approach is that someone would have to do the task.  There are about 230 images, so about 115 joins, each one taking 2-3 minutes to create, so maybe about 5 or so hours of effort.  I’ve left it with the PI and Co-I to decide what to do about this.  I also downloaded the images for Volume 1 of the register for Innerpeffray library and created tilesets for these that will allow the images to be zoomed and panned.  I also fixed a bug relating to adding new book items to a record and responded to some feedback about the CMS.

My second major project of the week was the Anglo-Norman Dictionary.  This week I began writing a high-level requirements document for the new AND website that I will be developing.  This mean going through the existing site in detail and considering which features will be retained, how things might be handled better, and how I might develop the site.  I made good progress with the document, and by the end of the week I’d covered the main site.  Next week I need to consider the new API for accessing the data and the staff pages for uploading and publishing new or newly edited entries.  I also responded to a few questions from Heather Pagan of the AND about the searches and read through and gave feedback on a completed draft of the AHRC proposal that the team are hoping to submit next month.

My final major project of the week was the Historical Thesaurus, for which I updated and re-executed by OED Date extraction script based on feedback from Fraser and Marc.  It was a long and complicated process to update the script as there are literally millions of dates and some issues only appear a handful of times, so tracking them down and testing things is tricky.    However, I made the following changes: I added a ‘sortdate_new’ column to the main OED lexeme table that holds the sortdate value from the new XML files, which may differ from the original value.  I’ve done some testing and rather strangely there are many occasions where the new sortdate differs from the old, but the ‘revised’ flag is not set to ‘true’.  I also updated the new OED date table to include a new column where the full date text is contained, as I thought this would be useful for tracing back issues.  E.g. if the OED date is ‘?c1225’ this is stored here.  The actual numeric year in my table now comes from the ‘year’ attribute in the XML instead.  This always contains the numeric value in the OED date, e.g. <q year=”1330″><date>c1330</date></q>.  New lexemes in the data are now getting added into the OED lexeme table and are also having their dates processed.  I’ve added a new column called ‘newaugust2020’ to track these new lexemes.  We’ll possibly have to try and match them up with existing HT lexemes at some point, unless we can consider them all to be ‘new’, meaning they’ll have no matches.  The script also now stores all of the various OE dates, rather than one single OE date of 650 being added for all.  I set the script running on Thursday and by Sunday it had finished executing, resulting in 3,912,109 being added and 4061 new words.