I worked on many different projects this week, and the largest amount of my time went into the redevelopment of the Anglo-Norman Dictionary. I processed a lot of the data this week and have created database tables and written extraction scripts to export labels, parts of speech, forms and cross references from the XML. The data extracted will be used for search purposes, for display on the website in places such as the search results or will be used to navigate between entries. The scripts will also be used when updating data in the new content management system for the dictionary when I write it. I have extracted 85,397 parts of speech, 31,213 cross references, 150,077 forms and their types (lemma / variant / deviant) and 86,269 labels which correspond to one of 157 unique labels (usage or semantic), which I also extracted.
I have also finished work on the quick search feature, which is now fully operational. This involved creating a new endpoint in the API for processing the search. This includes the query for the predictive search (i.e. the drop-down list of possible options that appears as you type), which returns any forms that match what you’re typing in and the query for the full quick search, which allows you to use ‘?’ and ‘*’ wildcards (and also “” for an exact match) and returns all of the data about each entry that is needed for the search results page. For example, if you type in ‘from’ in the ‘Quick Search’ box a drop-down list containing all matching forms will appear. Note that these are forms not only headwords so they include lemmas but also variants and deviants. If you select a form that is associated with one single entry then the entry’s page will load. If you select a form that is associated with more than one entry then the search results page will load. You can also choose to not select an item from the drop-down list and search for whatever you’re interested in. For example, enter ‘*ment’ and press enter or the search button to view all of the forms ending in ‘ment’, as the following screenshot demonstrates (note that this is not the final user interface but one purely for test purposes):
With this example you’ll see that the results are paginated, with 100 results per page. You can browse through the pages using the next and previous buttons or select one of the pages to jump directly to it. You can bookmark specific results pages too. Currently the search results display the lemma and homonym number (if applicable) and display whether the entry is an xref or not. Associated parts of speech appear after the lemma. Each one currently has a tooltip and we can add in descriptions of what each POS abbreviation means, although these might not be needed. All of the variant / deviant forms are also displayed as otherwise it can be quite confusing for users if the lemma does not match the term the user entered but a form does. All associated semantic / usage labels are also displayed. I’m also intending to add in earliest citation date and possibly translations to the results as well, but I haven’t extracted them yet.
When you click on an entry from the search results this loads the corresponding entry page. I have updated this to add in tabs to the left-hand column. In addition to the ‘Browse’ tab there is a ‘Results’ tab and a ‘Log’ tab. The latter doesn’t contain anything yet, but the former contains the search results. This allows you to browse up and down the search results in the same way as the regular ‘browse’ feature, selecting another entry. You can also return to the full results page. I still need to do some tweaking to this feature, such as ensuring the ‘Results’ tab loads by default if coming from a search result. The ‘clear’ option also doesn’t currently work properly. I’ll continue with this next week.
For the Books and Borrowing project I spent a bit of time getting the page images for the Westerkirk library uploaded to the server and the page records created for each corresponding page image. I also made some final tweaks to the Glasgow Students pilot website that Matthew Sangster and I worked on and this is now live and available here: https://18c-borrowing.glasgow.ac.uk/.
There are three new place-name related projects starting up at the moment and I spent some time creating initial websites for all of these. I still need to add in the place-name content management systems for two of them, and I’m hoping to find some time to work on this next week. I also spoke to Joanna Kopaczyk about a website for an RSE proposal she’s currently putting together and gave some advice to some people in Special Collections about a project that they are planning.
On Tuesday I had a Zoom call with the ‘Editing Robert Burns’ people to discuss developing the website for phase two of the Editing Robert Burns project. We discussed how the website would integrate with the existing website (https://burnsc21.glasgow.ac.uk/) and discussed some of the features that would be present on the new site, such as an interactive map of Burns’ correspondence and a database of forged items.
I also had a meeting with the Historical Thesaurus people on Tuesday and spent some time this week continuing to work on the extraction of dates from the OED data, which will feed into a new second edition of the HT. I fixed all of the ‘dot’ dates in the HT data. This is where there isn’t a specific date but a dot is used instead (e.g. 14..) but sometimes a specific year is given in the year attribute (e.g. 1432) but at other times a more general year is given (e.g. 1400). We worked out a set of rules for dealing with these and I created a script to process them. I then reworked my script that extracts dates for all lexemes that match a specific date pattern (YYYY-YYYY, where the first year might be Old English and the last year might be ‘Current’) and sent this to Fraser so that the team can decide which of these dates should be used in the new version of the HT. Next week I’ll begin work on a new version of the HT website that uses an updated dataset so we can compare the original dates with the newly updated ones.
I needed a further two trips to the dentist this week, which lost me some time due to my dentist being the other side of the city from where I live (but very handy for my office at work that I’m not currently allowed to use). Despite these interruptions I managed to get a decent amount done this week. For the Books and Borrowing project I processed the images of a register from Westerkirk library. For this register I needed to stitch together the images of the left and right pages to make a single image, as each spread features a table that covers both pages. As we didn’t want to have to manually join hundreds of images I wrote a script that did this, leaving a margin between the two images as they don’t line up perfectly. I used the command-line tool Imagemagick to achieve this – firstly adding the margin to the left-hand image and secondly joining this to the right-hand image. I then needed to generate tilesets of the images using Zoomify, but when I came to do so the converter processed the images the wrong way round – treating them as portrait rather than landscape and resulting in tilesets that were all wrong. I realised that when joining the page images together the image metadata hadn’t been updated: two portrait images were joined together to make one landscape image, but the metadata still suggested that the image was portrait, which confused the Zoomify converter. I therefore had to run the images through Imagemagick again to strip out all of the metadata and then rotate the images 90 degrees clockwise, which resulted in a set of images I could then upload to the server.
Also this week I made some further tweaks to Matthew Sangster’s pilot project featuring the Glasgow Student data, which we will be able to go live with soon. This involved adding in a couple of missing page images, fixing some encoding issues with Greek characters in a few book titles, fixing a bug that was preventing the links to pages from the frequency lists working, ensuring any rows that are to be omitted from searches were actually being omitted and adding in tooltips for the table column headers to describe what the columns mean.
I also made some progress with the redevelopment of the Anglo-Norman Dictionary. I had a Zoom meeting with the editors on Wednesday, which went very well, and resulted in me making some changes to the system documentation I had previously written. I also worked on an initial structure for the new dictionary website, setting up WordPress for the ancillary pages and figuring out how to create a WordPress theme that is based on Bootstrap. This was something I hadn’t done before and it was a good learning experience. It mostly went pretty smoothly, but getting a WordPress menu to use Bootstrap’s layout was a little tricky. Thankfully someone has already solved the issue and has made the code available to use (see https://github.com/wp-bootstrap/wp-bootstrap-navwalker) so I could just integrate this with my theme.
I completed work on the theme and generated placeholder pages and menu items for all the various parts of the site. The page design is just my temporary page design for now, looking very similar to the Books and Borrowing CMS design, but this will be replaced with something more suitable in time. With this in place I regenerated the XML data from the existing CMS based on the final ‘entry_hash’ data I had. This was even more successful than my first attempt with an earlier version of the data last week and resulted in all but 35 of the 54,025 dictionary entries being generated. This XML has the same structure as the files being used by the editors, so we will now be able to standardise on this structure.
With the new data imported I then started work on an API for the site. This will process all requests for data and will then return the data in either JSON or CSV format (with the front-end using JSON). I created the endpoints necessary to make the ‘browse’ panel work – returning a section of the dictionary as headwords and links either based on entry ‘slugs’ (the URL-safe versions of headwords) or headword text, depending on whether the ‘browse up/down’ option or the ‘jump to’ option is chosen. I also created an endpoint for displaying an entry, which returns all of the data for an entry including its full XML record.
I then began work on the ‘entry’ page in the front-end, focussing initially on the ‘browse’ feature. By the end of the week this was fully operational, allowing the user to scroll up and down the list, select an item to load it or enter text into the ‘jump to’ box. There’s also a pop-up where info about how to use the browse can be added. The ‘jump to’ still needs some work as if you type fast into it it sometimes gets confused as to what content to show. I haven’t done anything about displaying the entry yet, other than displaying the headword. Currently the full versions of both the editor’s and the existing system XML are displayed. Below is a screenshot of how things currently look:
My last task of the week for the AND was to write a script to extract all of the headwords, variants and deviants from the entries to enable the quick search to work. I set the script running and by the time it had finished executing there were more than 150,000 entries in the ‘forms’ table I’d created.
Also this week I helped Rob Maslen to migrate his ‘City of Lost Books’ blog to a new URL, had a chat with the DSL people about updates to the search results page based on the work I did last week and had a chat with Thomas Clancy about three upcoming place-names projects.
I also returned to the Historical Thesaurus project and our ongoing attempts to extract dates from the Oxford English Dictionary in order to update the dates of attestation in the Historical Thesaurus. Firstly, I noticed that there were some issues with end dates for ranged dates before 1000 and I’ve fixed these (there were about 50 or so). Secondly, I noticed there are about 20 dates that don’t have a ‘year’ as presumably the ‘year’ attribute in the XML was empty. Some of these I can fix (and I have), but others also have an empty ‘fullyear’ too, meaning the date tag was presumably empty in the XML and I therefore deleted these.
We still needed to figure out how to handle OED dates that have a dot in them. These are sometimes used (well, used about 4,000 times) to show roughly where a date comes so that it is placed correctly in the sequence of dates (e.g. ’14..’ is given the year ‘1400’). But sometimes a date has a dot and a specific year (e.g. ’14..’ but ‘1436’). We figured out that this is to ensure the correct ordering of the date after an earlier specific date. Fraser therefore wanted these dates to be ‘ante’ the next known date. I therefore wrote a script that finds all lexemes that have at least one date that has a dot and a specific year, then for each of these lexemes it gets all of the dates in order. Each date is displayed, with the ‘fullyear’ displayed first and the ‘year’ in brackets. If the date is a ‘.’ date then it is highlighted in yellow. For each of these the script then tries to find the next date in sequence that isn’t another ‘.’ date (as sometimes there are several). If it finds one then the date becomes this row’s ‘year’ plus ‘a’. If it doesn’t find one (e.g. if the ‘.’ date is the last date for the lexeme) then it retains the year from the ‘.’ date but with ‘a’ added. Next week I will run this script to actually update the data and we will then move on to using the new OED data with the HT’s lexemes.
I lost most of Tuesday this week to root canal surgery, which was uncomfortable and exhausting but thankfully not too painful. Unfortunately my teeth are still not right and I now have a further appointment booked for next week, but at least the severe toothache that I had previously has now stopped.
I continued to work on the requirements document for the redevelopment of the Anglo-Norman Dictionary this week, and managed to send a first completed version of it to Heather Pagan for feedback. It will no doubt need some further work, but it’s good to have a clearer picture of how the new version of the website will function. Also this week I investigated another bizarre situation with the AND’s data. I have access to the full dataset as is used to power the existing website as a single XML file containing all of the entries. The Editors are also working on individual entries as single XML files that are then uploaded to the existing website using a content management system. What we didn’t realise up until now is that the structure of the XML files is transformed when an entry is ingested into the online system. For example, the ‘language’ tag is changed from <language lang=”M.E.”/> to <lbl type=”lang”>M.E.</lbl>. Similarly, part of speech has been transformed from <pos type=”s.”/> to <gramGrp><pos>s.</pos></gramGrp>. We have no idea why the developer of the system chose to do this, as it seems completely unnecessary and it’s a process that doesn’t appear to be documented anywhere. The crazy thing is that the transformed XML still then needs to be further transformed to HTML for display so what appears on screen is two steps removed from the data the editors work with. It also means that I don’t have access to the data in the form that the editors are working with, meaning I can’t just take their edits and use them in the new site.
As we ideally want to avoid the situation where we have two structurally different XML datasets for the dictionary I wanted to try and find a way to transform the data I have into the structure used by the editors. I attempted to do this by looking at the code for the existing content management system to try to decipher where the XML is getting transformed. There is an option for extracting an entry from the online system for offline editing and this transforms the XML into the format used by the editors. I figured that if I can understand how this process works and replicate it then I will be able to apply this to the full XML dictionary file and then I will have the complete dataset in the same format as the editors are working with and we can just use this in the redevelopment.
It was not easy to figure out what the system is up to, but I managed to ascertain that when you enter a headword for export this then triggers a Perl script and this in turn uses an XSLT stylesheet, which I managed to track down a version of that appears to have been last updated in 2014. I then wrote a little script that takes the XML of the entry for ‘padlock’ as found in online data and applies this stylesheet to it, in the hope that it would give me an XML file identical to the one exported by the CMS.
The script successfully executed, but the resulting XML was not quite identical to the file exported by the CMS. There was no ‘doctype’ and DTD reference, the ‘attestation’ ID was the entry ID with an auto-incrementing ‘C’ number appended to it (AND-201-02592CE7-42F65840-3D2007C6-27706E3A-C001) rather than the ID of the <cit> element (C-11c4b015), and <dateInfo> was not processed, and only the contents of the tags within <dateInfo> were being displayed.
I’m not sure why these differences exist. It’s possible I only have access to an older version of the XSLT file. I’m guessing this must be the case because the missing or differently formatted data is does not appear to be instated elsewhere (e.g. in the Perl script). What I then did was to modify the XSLT file to ensure that the changes are applied: doctype is added in, the ‘attestation’ ID is correct and the <dateInfo> section contains the full data.
I could try applying this script to every entry in the full data file I have, although I suspect there will be other situations that the XSLT file I have is not set up to successfully process.
I therefore tried to investigate another alternative, which was to write a script that will pass the headword of every dictionary item to the ‘Retrieve an entry for editing’ script in the CMS, saving the results of each. I considered that might be more likely to work reliably for every entry, but that we may run into issues with the server refusing so many requests. After a few test runs, I set the script loose on all 53,000 or so entries in the system and although it took several hours to run, the process did appear to work for the most part. I now have the data in the same structure as the editors work with, which should mean we can standardise on this format and abandon the XML structure used by the existing online system.
Also this week I fixed an issue with links through to the Bosworth and Toller Old English dictionary from the Thesaurus of Old English. Their site has been redeveloped and they’ve changed the way their URLs work without putting redirects from the old URLs, meaning all our links for words in the TOE to words on their site are broken. URLs for their entries now just use a unique ID rather than the word (e.g. http://bosworthtoller.com/28286), which seems like a bit of a step backwards. They’ve also got rid of length marks and are using acute accents on characters instead, which is a bit strange. The change to an ID in the URL means we can no longer link to a specific entry as we can’t possibly know what IDs they’re using for each word. However, we can link to their search results page instead, e.g. http://bosworthtoller.com/search?q=sōfte works and I updated TOE to use such links.
I also continued with the processing of OED dates for use in the Historical Thesaurus, after my date extraction script finished executing over the weekend. This week I investigated OED dates that have a dot in them instead of a full date. There are 4,498 such dates and these mostly all have the lower date as the one recorded in the ‘year’ attribute by the OED. E.g. 138. Is 1380, 17.. is 1700. However, sometimes a specific date is given in the ‘year’ attribute despite the presence of a full stop in the date tag. For example, one entry has ‘1421’ in the ‘year’ attribute but ’14..’ in the date tag. There are just over a thousand dates where there are two dots but the ‘year’ given does not end in ‘00’. Fraser reckons this is to do with ordering the dates in the OED and I’ll need to do some further work on this next week.
In addition to the above I continued to work on the Books and Borrowing project. I made some tweaks to the CMS to make is easier to edit records. When a borrowing record is edited the page automatically scrolls down to the record that was edited. This also happens for books and borrowers when accessed and edited from the ‘Books’ and ‘Borrowers’ tabs in a library. I also wrote an initial script that will help to merge some of the duplicate author records we have in the system due to existing data with different formats being uploaded from different libraries. What it does is strip all of the non-alpha characters from the forename and surname fields, makes them lower case then joins them together. So for example, author ID (AID) 111 has ‘Arthur’ as forename and ‘Bedford’ as surname while AID 1896 has nothing for forename and ‘Bedford, Arthur, 1668-1745’ as surname. When stripped and joined together these both become ‘bedfordarthur’ and we have a match.
There are 162 matches that have been identified, some consisting of more than two matched author records. I exported these as a spreadsheet. Each row includes the author’s AID, title, forename, surname, othername, born and died (each containing ‘c’ where given), a count of the number of books the record is associated with and the AID of the record that is set to be retained for the match. This defaults to the first record, which also appears in bold, to make it easier to see where a new batch of duplicates begins.
The editors can then go through this spreadsheet and reassign the ‘AID to keep’ field to a different row. E.g. for Francis Bacon the AID to keep is given as 1460. If the second record for Francis Bacon should be kept instead the editor would just need to change the value in this column for all three Francis Bacons to the AID for this row, which is 163. Similarly, if something has been marked as a duplicate and it’s wrong, then set the ‘AID to keep’ accordingly. E.g. There are four ‘David Hume’ records, but looking at the dates at least one of these is a different person. To keep the record with AID 1610 separate, replace the AID 1623 in the ‘AID to keep’ column with 1610. It is likely that this spreadsheet will be used to manually split up the imported authors that just have all their data in the surname column. Someone could, for example take the record that has ‘Hume, David, 1560?-1630?’ in the surname column and split this into the correct columns.
I also generated a spreadsheet containing all of the authors that appear to be unique. This will also need checking for other duplicates that haven’t been picked up as there are a few. For example AID 1956 ‘Heywood, Thomas, d. 1641’ and 1570 ‘Heywood, Thomas, -1641.’ Haven’t been matched because of that ‘d’. Similarly, AID 1598 ‘Buffon, George Louis Leclerc, comte de, 1707-1788’ and 2274 ‘Buffon, Georges Louis Leclerc, comte de, 1707-1788.’ Haven’t been matched up because one is ‘George’ and the other ‘Georges’. Accented characters have also not been properly matched, e.g. AID 1457 ‘Beze, Theodore de, 1519-1605’ and 397 ‘Bèze, Théodore de, 1519-1605.’. I could add in a Levenshtein test that matches up things that are one character different and update the script to properly take into account accented characters for matching purposes, or these are things that could just be sorted manually.
Ann Fergusson of the DSL got back to me this week after having rigorously tested the search facilities of our new test versions of the DSL API (V2 containing data from the original API and V3 containing data that has been edited since the original API was made). Ann had spotted some unexpected behaviour in some of the searches and I spent some time investigating these and fixing things where possible. There were some cases where incorrect results were being returned when a ‘NOT’ search was performed on a selected source dictionary, due to the positioning of the source dictionary in the query. This was thankfully easy to fix. There was also an issue with some exact searches of the full text failing to find entries. When the full text is ingested into Solr all of the XML tags are stripped out. If there are no spaces between tagged words then words have ended up squashed together. For example: ‘Westminster</q></cit></sense><sense><b>B</b>. <i>Attrib</i>’. With the tags (and punctuation) stripped out we’re left with ‘WestminsterB’. So an exact search for ‘westminster’ fails to find this entry. A search for ‘westminsterb’ finds the entry, which confirms this. I suspect this situation is going to crop up quite a lot, so I will need to update the script that prepares content for Solr to add spaces after tags before stripping tags and then removes multiple spaces between words.
I was back at work this week after spending two weeks on holiday, during which time we went to Skye, Crinan and Peebles. It was really great to see some different places after being cooped up at home for 19 weeks and I feel much better for having been away. Unfortunately during this week I developed a pretty severe toothache and had to make an emergency appointment with the dentist on Thursday morning. It turns out I need root canal surgery and am now booked in to have this next Tuesday, but until then I need to try and just cope with the pain, which has been almost unbearable at times, despite regular doses of both ibuprofen and paracetamol. This did affect my ability work a bit on Thursday afternoon and Friday, but I managed to struggle through.
On my return to work from my holiday on Monday I spent some time catching up with emails that had accumulated whilst I was away, including replying to Roslyn Potter in Scottish Literature about a project website, replying to Jennifer Smith about giving a group of students access to the SCOSYA data and making changes to the Berwickshire Place-names website to make it more attractive to the REF reviewers based on feedback passed on by Jeremy Smith. I also created a series of high-resolution screenshots of the resource for Carole Hough for a publication, had an email chat with Luca Guariento about linked open data.
I also fixed some issues with the Galloway Glens projects that Thomas Clancy had spotted, including an issue with the place-name element page which was not ordering accented characters properly – all accented characters were being listed at the end rather than with their non-accented versions. It turned out that while the underlying database orders accented characters correctly, for the elements list I need to get a list of elements used in place-names and a list of elements used in historical forms and then I have to combine these lists and reorder the resulting single list. This part of the process was not dealing with all accented characters, only a limited set that I’d created for Berwickshire that also dealt with ashes and thorns. Instead I added in a function taken from WordPress that converts all accented characters to their unaccented equivalent for the purposes of ordering and this ensured the order of the elements list was correct.
The rest of my week was divided between three projects, the first of which was the Books and Borrowing project. For this I spent some time working with some of the digitised images of the register pages. We now have access to the images from Westerkirk library and in these records appear in a table that spreads across both recto and verso pages but we have images of the individual pages. The project RA who is transcribing the records is treating both recto and verso as a single ‘page’ in the system, which makes sense. We therefore need to stitch the r and v images together into on single image to be associated with this ‘page’. I downloaded all of the images and have found a way to automatically join two page images together. However, there is rather a lot of overlap in the images, meaning the book appears to have two joins and some columns are repeated. I could possibly try to automatically crop the images before joining them, but there is quite a bit of variation in the size of the overlap so this is never going to be perfect and may result in some information getting lost. The other alternative would be to manually crop and join the images, which I did some experimentation with. It’s still not perfect due to the angle of the page changing between shots, but it’s a lot better. The downside with this approach is that someone would have to do the task. There are about 230 images, so about 115 joins, each one taking 2-3 minutes to create, so maybe about 5 or so hours of effort. I’ve left it with the PI and Co-I to decide what to do about this. I also downloaded the images for Volume 1 of the register for Innerpeffray library and created tilesets for these that will allow the images to be zoomed and panned. I also fixed a bug relating to adding new book items to a record and responded to some feedback about the CMS.
My second major project of the week was the Anglo-Norman Dictionary. This week I began writing a high-level requirements document for the new AND website that I will be developing. This mean going through the existing site in detail and considering which features will be retained, how things might be handled better, and how I might develop the site. I made good progress with the document, and by the end of the week I’d covered the main site. Next week I need to consider the new API for accessing the data and the staff pages for uploading and publishing new or newly edited entries. I also responded to a few questions from Heather Pagan of the AND about the searches and read through and gave feedback on a completed draft of the AHRC proposal that the team are hoping to submit next month.
My final major project of the week was the Historical Thesaurus, for which I updated and re-executed by OED Date extraction script based on feedback from Fraser and Marc. It was a long and complicated process to update the script as there are literally millions of dates and some issues only appear a handful of times, so tracking them down and testing things is tricky. However, I made the following changes: I added a ‘sortdate_new’ column to the main OED lexeme table that holds the sortdate value from the new XML files, which may differ from the original value. I’ve done some testing and rather strangely there are many occasions where the new sortdate differs from the old, but the ‘revised’ flag is not set to ‘true’. I also updated the new OED date table to include a new column where the full date text is contained, as I thought this would be useful for tracing back issues. E.g. if the OED date is ‘?c1225’ this is stored here. The actual numeric year in my table now comes from the ‘year’ attribute in the XML instead. This always contains the numeric value in the OED date, e.g. <q year=”1330″><date>c1330</date></q>. New lexemes in the data are now getting added into the OED lexeme table and are also having their dates processed. I’ve added a new column called ‘newaugust2020’ to track these new lexemes. We’ll possibly have to try and match them up with existing HT lexemes at some point, unless we can consider them all to be ‘new’, meaning they’ll have no matches. The script also now stores all of the various OE dates, rather than one single OE date of 650 being added for all. I set the script running on Thursday and by Sunday it had finished executing, resulting in 3,912,109 being added and 4061 new words.
Week 19 of Lockdown, and it was a short week for me as the Monday was the Glasgow Fair holiday. I spent a couple of days this week continuing to add features to the content management system for the Books and Borrowing project. I have now implemented the ‘normalised occupations’ part of the CMS. Originally occupations were just going to be a set of keywords, allowing one or more keyword to be associated with a borrower. However, we have been liaising with another project that has already produced a list of occupations and we have agreed to share their list. This is slightly different as it is hierarchical, with a top-level ‘parent’ containing multiple main occupations. E.g. ‘Religion and Clergy’ features ‘Bishop’. However, for our project we needed a third hierarchical level do differentiate types of minister/priest, so I’ve had to add this in too. I’ve achieved this by means of a parent occupation ID in the database, which is ‘null’ for top-level occupations and contains the ID of the parent category for all other occupations.
I completed work on the page to browse occupations, arranging the hierarchical occupations in a nested structure that features a count of the number of borrowers associated with the occupation to the right of the occupation name. These are all currently zero, but once some associations are made the numbers will go up and you’ll be able to click on the count to bring up a list of all associated borrowers, with links through to each borrower. If an occupation has any child occupations a ‘+’ icon appears beside it. Press on this to view the child occupations, which also have counts. The counts for ‘parent’ occupations tally up all of the totals for the child occupations, and clicking on one of these counts will display all borrowers assigned to all child occupations. If an occupation is empty there is a ‘delete’ button beside it. As the list of occupations is going to be fairly fixed I didn’t add in an ‘edit’ facility – if an occupation needs editing I can do it directly through the database, or it can be deleted and a new version created. Here’s a screenshot showing some of the occupations in the ‘browse’ page:
I also created facilities to add new occupations. You can enter an occupation name and optionally specify a parent occupation from a drop-down list. Doing so will add the new occupation as a child of the selected category, either at the second level if a top level parent is selected (e.g. ‘Agriculture’) or at the third level if a second level parent is selected (e.g. ‘Farmer’). If you don’t include a parent the occupation will become a new top-level grouping. I used this feature to upload all of the occupations, and it worked very well.
I then updated the ‘Borrowers’ tab in the ‘Browse Libraries’ page to add ‘Normalised Occupation’ to the list of columns in the table. The ‘Add’ and ‘Edit’ borrower facilities also now feature ‘Normalised Occupation’, which replicates the nested structure from the ‘browse occupations’ page, only features checkboxes beside each main occupation. You can select any number of occupations for a borrower and when you press the ‘Upload’ or ‘Edit’ button your choice will be saved. Deselecting all ticked checkboxes will clear all occupations for the borrower. If you edit a borrower who has one or more occupations selected, in addition to the relevant checkboxes being ticked, the occupations with their full hierarchies also appear above the list of occupations, so you can easily see what is already selected. I also updated the ‘Add’ and ‘Edit’ borrowing record pages so that whenever a borrower appears in the forms the normalised occupations feature also appears.
I also added in the option to view page images. Currently the only ledgers that have page images are the three Glasgow ones, but more will be added in due course. When viewing a page in a ledger that includes a page image you will see the ‘Page Image’ button above the table of records. Press on this and a new browser tab will open. It includes a link through to the full-size image of the page if you want to open this in your browser or download it to open in a graphics package. It also features the ‘zoom and pan’ interface that allows you to look at the image in the same manner as you’d look at a Google Map. You can also view this full screen by pressing on the button in the top right of the image.
Also this week I made further tweaks to the script I’d written to update lexeme start and end dates in the Historical Thesaurus based on citation dates in the OED. I’d sent a sample output of 10,000 rows to Fraser last week and he got back to me with some suggestions and observations. I’m going to have to rerun the script I wrote to extract the more than 3 million citation dates from the OED as some of the data needs to be processed differently, but as this script will take several days to run and I’m on holiday next week this isn’t something I can do right now. However, I managed to change the way the date matching script runs to fix some bugs and make the various processes easier to track. I also generated a list of all of the distinct labels in the OED data, with counts of the number of times these appear. Labels are associated with specific citation dates, thankfully. Only a handful are actually used lots of times, and many of the others appear to be used as a ‘notes’ field rather than as a more general label.
In addition to the above I also had a further conversation with Heather Pagan about the data management plan for the AND’s new proposal, responded to a query from Kathryn Cooper about the website I set up for her at the end of last year, responded to a couple of separate requests from post-grad students in Scottish Literature, spoke to Thomas Clancy about the start date for his Place-Names of Iona project, which got funded recently, helped with some issues with Matthew Creasy’s Scottish Cosmopolitanism website and spoke to Carole Hough about making a few tweaks to the Berwickshire Place-names website for REF.
I’m going to be on holiday for the next two weeks, so there will be no further updates from me for a while.
This was week 18 of Lockdown, which is now definitely easing here. I’m still working from home, though, and will be for the foreseeable future. I took Friday off this week, so it was a four-day week for me. I spent about half of this time on the Books and Borrowing project, during which time I returned to adding features to the content management system, after spending recent weeks importing datasets. I added a number of indexes to the underlying database which should speed up the loading of certain pages considerably. E.g. the browse books, borrowers and author pages. I then updated the ‘Books’ tab when viewing a library (i.e. the page that lists all of the book holdings in the library) so that it now lists the number of book holdings in the library above the table. The table itself now has separate columns for all additional fields that have been created for book holdings in the library and it is now possible to order the table by any of the headings (pressing on a heading a second time reverses the ordering). The count of ‘Borrowing records’ for each book in the table is now a button and pressing on it brings up a popup listing all of the borrowing records that are associated with the book holding record, and from this pop-up you can then follow a link to view the borrowing record you’re interested in. I then made similar changes to the ‘Borrowers’ tab when viewing a library (i.e. the page that lists all of the borrowers the library has). It also now displays the total number of borrowers at the top. This table already allowed the reordering by any column, so that’s not new, but as above, the ‘Borrowing records’ count is now a link that when clicked on opens a list of all of the borrowing records the borrower is associated with.
The big new feature I implemented this week was borrower cross references. These can be added via the ‘Borrowers’ tab within a library when adding or editing a borrower on this page. When adding or editing a borrower there is now a section of the form labelled ‘Cross-references to other borrowers’. If there are any existing cross references these will appear here, with a checkbox beside each that you can tick if you want to delete the cross reference (the user can tick the box then press ‘Edit’ to edit the borrower and the reference will be deleted). Any number of new cross references can be added by pressing on the ‘Add a cross-reference’ button (multiple times, if required). Doing so adds two fields to the form, one for a ‘description’, which is the text that shows how the current borrower links to the referenced borrowing record, and one for ‘referenced borrower’, which is an auto-complete. Type in a name or part of a name and any borrower that matches in any library will be listed. The library appears in brackets after the borrower’s name to help differentiate records. Select a borrower and then when the ‘Add’ or ‘Edit’ button is pressed for the borrower the cross reference will be made.
Cross-references work in both directions – if you add a cross reference from Borrower A to Borrower B you don’t then need to load up the record for Borrower B to add a reference back to Borrower A. The description text will sit between the borrower whose form you make the cross reference on and the referenced borrower you select, so if you’re on the edit form for Borrower A and link to Borrower B and the description is ‘is the son of’ then the cross reference will appear as ‘Borrower A is the son of Borrower B’. If you then view Borrower B the cross reference will still be written in this order. I also updated the table of borrowers to add in a new ‘X-Refs’ column that lists all cross-references for a borrower.
I spent the remainder of my working week completing smaller tasks for a variety of projects, such as updating the spreadsheet output of duplicate child entries for the DSL people, getting an output of the latest version of the Thesaurus of Old English data for Fraser, advising Eleanor Lawson on ‘.ac.uk’ domain names and having a chat with Simon Taylor about the pilot Place-names of Fife project that I worked on with him several years ago. I also wrote a Data Management Plan for a new AHRC proposal the Anglo-Norman Dictionary people are putting together, which involved a lengthy email correspondence with Heather Pagan at Aberystwyth.
Finally, I returned to the ongoing task of merging data from the Oxford English Dictionary with the Historical Thesaurus. We are currently attempting to extract citation dates from OED entries in order to update the dates of usage that we have in the HT. This process uses the new table I recently generated from the OED XML dataset which contains every citation date for every word in the OED (more than 3 million dates). Fraser had prepared a document listing how he and Marc would like the HT dates to be updated (e.g. if the first OED citation date is earlier than the HT start date by 140 years or more then use the OED citation date as the suggested change). Each rule was to be given its own type, so that we could check through each type individually to make sure the rules were working ok.
It took about a day to write an initial version of the script, which I ran on the first 10,000 HT lexemes as a test. I didn’t split the output into different tables depending on the type, but instead exported everything to a spreadsheet so Marc and Fraser could look through it.
In the spreadsheet if there is no ‘type’ for a row it means it didn’t match any of the criteria, but I included these rows anyway so we can check whether there are any other criteria the rows should match. I also included all the OED citation dates (rather than just the first and last) for reference. I noted that Fraser’s document doesn’t seem to take labels into consideration. There are some labels in the data, and sometimes there’s a new label for an OED start or end date when nothing else is different, e.g. htid 1479 ‘Shore-going’: This row has no ‘type’ but does have new data from the OED.
Another issue I spotted is that as the same ‘type’ variable is set when a start date matches the criteria and then when an end date matches the criteria, the ‘type’ as set during start date is then replaced with the ‘type’ for end date. I think, therefore, that we might have to split the start and end processes up, or append the end process type to the start process type rather than replacing it (so e.g. type 2-13 rather than type 2 being replaced by type 13). I also noticed that there are some lexemes where the HT has ‘current’ but the OED has a much earlier last citation date (e.g. htid 73 ‘temporal’ has 9999 in the HT but 1832 in the OED. Such cases are not currently considered.
Finally, according to the document, Antes and Circas are only considered for update if the OED and HT date is the same, but there are many cases where the start / end OED date is picked to replace the HT date (because it’s different) and it has an ‘a’ or ‘c’ and this would then be lost. Currently I’m including the ‘a’ or ‘c’ in such cases, but I can remove this if needs be (e.g. HT 37 ‘orb’ has HT start date 1601 (no ‘a’ or ‘c’) but this is to be replaced with OED 1550 that has an ‘a’. Clearly the script will need to be tweaked based on feedback from Marc and Fraser, but I feel like we’re finally making some decent progress with this after all of the preparatory work that was required to get to this point.
Next Monday is the Glasgow Fair holiday, so I won’t be back to work until the Tuesday.
This was week 15 of Lockdown, which I guess is sort of coming to an end now, although I will still be working from home for the foreseeable future and having to juggle work and childcare every day. I continued to work on the Books and Borrowing project for much of this week, this time focussing on importing some of the existing datasets from previous transcription projects. I had previously written scripts to import data from Glasgow University library and Innerpeffray library, which gave us 14,738 borrowing records. This week I began by focussing on the data from St Andrews University library.
The St Andrews data is pretty messy, reflecting the layout and language of the original documents, so I haven’t been able to fully extract everything and it will require a lot of manual correcting. However, I did manage to migrate all of the data to a test version of the database running on my local PC and then updated the online database to incorporate this data.
The data I’ve got are CSV and HTML representations of transcribed pages that come from an existing website with pages that look like this: https://arts.st-andrews.ac.uk/transcribe/index.php?title=Page:UYLY205_2_Receipt_Book_1748-1753.djvu/100. The links in the pages (e.g. Locks Works) lead through to further pages with information about books or borrowers. Unfortunately the CSV version of the data doesn’t include the links or the linked to data, and as I wanted to try and pull in the data found on the linked pages I therefore needed to process the HTML instead.
I wrote a script that pulled in all of the files in the ‘HTML’ directory and processed each in turn. From the filenames my script could ascertain the ledger volume, its dates and the page number. For example ‘Page_UYLY205_2_Receipt_Book_1748-1753.djvu_10.html’ is ledger 2 (1748-1753) page 10. The script creates ledgers and pages, and adds in the ‘next’ and ‘previous’ page links to join all the pages in a ledger together.
The actual data in the file posed further problems. As you can see from the linked page above, dates are just too messy to automatically extract into our strongly structured borrowed and returned date system. Often a record is split over multiple rows as well (e.g. the borrowing record for ‘Rollins belles Lettres’ is actually split over 3 rows). I could have just grabbed each row and inserted it as a separate borrowing record, which would then need to be manually merged, but I figured out a way to do this automatically. The first row of a record always appears to have a code (the shelf number) in the second column (e.g. J.5.2 for ‘Rollins’) whereas subsequent rows that appear to belong to the same record don’t (e.g. ‘on profr Shaws order by’ and ‘James Key’). I therefore set up my script to insert new borrowing records for rows that have codes, and to append any subsequent rows that don’t have codes to this record until a row with a code is reached again.
I also used this approach to set up books and borrowers too. If you look at the page linked to above again you’ll see that the links through to things are not categorised – some are links to books and others to borrowers, with no obvious way to know which is which. However, it’s pretty much always the case that it’s a book that appears in the row with the code and it’s people that are linked to in the other rows. I could therefore create or link to existing book holding records for links in the row with a code and create or link to existing borrower records for links in rows without a code. There are bound to be situations where this system doesn’t quite work correctly, but I think the majority of rows do fit this pattern.
The next thing I needed to do was to figure out which data from the St Andrews files should be stored as what in our system. I created four new ‘Additional Fields’ for St Andrews as follows:
- Original Borrowed date: This contains the full text of the first column (e.g. Decr 16)
- Code: This contains the full text of the second column (e.g. J.5.2)
- Original Returned date: This contains the full text of the fourth column (e.g. Jan. 5)
- Original returned text: This contains the full date of the fifth column (e.g. ‘Rollins belles Lettres V. 2d’)
In the borrowing table the ‘transcription’ field is set to contain the full text of the ‘borrowed’ column, but without links. Where subsequent rows contain data in this column but no code, this data is then appended to the transcription. E.g. the complete transcription for the third item on the page linked to above is ‘Rollins belles Lettres Vol 2<sup>d</sup> on profr Shaws order by James Key’.
The contents of all pages linked to in the transcriptions are added to the ‘editors notes’ field for future use if required. Both the page URL and the page content are included, separated by a bar (|) and if there are multiple links these are separated by five dashes. E.g. for the above the notes field contains:
‘Rollins_belles_Lettres| <p>Possibly: De la maniere d’enseigner et d’etuder les belles-lettres, Par raport à l’esprit & au coeur, by Charles Rollin. (A Amsterdam : Chez Pierre Mortier, M. DCC. XLV. ) <a href=”http://library.st-andrews.ac.uk/record=b2447402~S1″>http://library.st-andrews.ac.uk/record=b2447402~S1</a></p>
—– profr_Shaws| <p><a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484</a></p>
—– James_Key| <p>Possibly James Kay: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860</a></p>
As mentioned earlier, the script also generates book and borrower records based on the linked pages too. I’ve chosen to set up book holding rather than book edition records as the details are all very vague and specific to St Andrews. In the holdings table I’ve set the ‘standardised title’ to be the page link with underscores replaced with dashes (e.g. ‘Rollins belles Lettres’) and the page content is stored in the ‘editors notes’ field. One book item is created for each holding to be used to link to the corresponding borrowing records.
For borrowers a similar process is followed, with the link added to the surname column (e.g. Thos Duncan) and the page content added to the ‘editors notes’ field (e.g. <p>Possibly Thomas Duncan: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372</a></p>’). All borrowers are linked to records as ‘Main’ borrowers.
During the processing I noticed that the fourth ledger had a slightly different structure to the others, with entire pages devoted to a particular borrower, whose name then appeared in a heading row in the table. I therefore updated my script to check for the existence of this heading row, and if it exists my script then grabs the borrower name, creates the borrower record if it doesn’t already exist and then links this borrower to every borrowing item found on the page. After my script had finished running we had 11147 borrowing records, 996 borrowers and 6395 book holding records for St Andrew in the system.
I then moved onto looking at the data for Selkirk library. This data was more nicely structured than the St Andrews data, with separate spreadsheets for borrowings, borrowers and books and borrowers and books connected to borrowings via unique identifiers. Unfortunately the dates were still transcribed as they were written rather than being normalised in any way, which meant it was not possible to straightforwardly generate structured dates for the records and these will need to be manually generated. The script I wrote to import the data took about a day to write, and after running it we had a further 11,431 borrowing records across two registers and 415 pages entered into our database.
As with St Andrews, I created book records as Holding records only (i.e. associated specifically with the library rather than being project-wide ‘Edition’ records. There are 612 Holding records for Selkirk. I also processed the borrower records, resulting in 86 borrower records being added. I added the dates as originally transcribed to an additional field named ‘Original Borrowed Date’ and the only other additional field is in the Holding records for ‘Subject’, that will eventually be merged with our ‘Genre’ when this feature becomes available.
Also this week I advised Katie on a file naming convention for the digitised images of pages that will be created for the project. I recommended that the filenames shouldn’t have spaces in them as these can be troublesome on some operating systems and that we’d want a character to use as a delimiter between the parts of the filename that wouldn’t appear elsewhere in the filename so it’s easy to split up the filename. I suggested that the page number should be included in the filename and that it should reflect the page number as it will be written into the database – e.g. if we’re going to use ‘r’ and ‘v’ these would be included. Each page in the database will be automatically assigned an auto-incrementing ID, and the only means of linking a specific page record in the database with a specific image will be via the page number entered when the page is created, so if this is something like ‘23r’ then ideally this should be represented in the image filename.
Katie had wondered about using characters to denote ledgers and pages in the filename (e.g. ‘L’ and ‘P’) but if we’re using a specific delimiting character to separate parts of the filename then using these characters wouldn’t be necessary and I suggested it would be better to not use ‘L’ as a lower case ‘l’ is very easy to confuse with a ‘1’ or a capital ‘I’ which might confuse future human users.
Instead I suggested using a ‘-‘ instead of spaces and a ‘_’ as a delimiter and pointed out that we should ensure that no other non-alphanumeric characters are ever used in the filename – no apostrophes, commas, colons, semi-colons, ampersands etc and to make sure the ‘-‘ is really a minus sign and not one of the fancy dashes (–) that get created by MS Office. This shouldn’t be an issue when entering a filename, but might be if a list of filenames is created in Word and then pasted into the ‘save as’ box, for example.
Finally, I suggested that it might be best to make the filenames entirely lower case, as some operating systems are case sensitive and if we don’t specify all lower case then there may be variation in the use of case. Following these guidelines the filenames would look something like this:
In addition to the Books and Borrowing project I worked on a number of other projects this week. I gave Matthew Creasy some further advice on using forums in his new project website, and ‘Scottish Cosmopolitanism at the Fin de Siècle’ website is now available here: https://scoco.glasgow.ac.uk/.
I also worked a bit more on using dates from the OED data in the Historical Thesaurus. Fraser had sent me a ZIP file containing the entire OED dataset as 240 XML files and I began analysing these to figure out how we’d extract these dates so that we could use them to update the dates associated with the lexemes in the HT. I needed to extract the quotation dates as these have ‘ante’ and ‘circa’ notes, plus labels. I noted that in addition to ‘a’ and ‘c’ a question mark is also used, somethings with an ‘a’ or ‘c’ and sometimes without. I decided to process things as follows:
- ?a will just be ‘a’
- ?c will just be ‘c’
- ? without an ‘a’ or ‘c’ will be ‘c’.
I also noticed that a date may sometimes be a range (e.g. 1795-8) so I needed to include a second date column in my data structure to accommodate this. I also noted that there are sometimes multiple Old English dates, and the contents of the ‘date’ tag vary depending on the date – sometimes the content is ‘OE’ and othertimes ‘lOE’ or ‘eOE’. I decided to process any OE dates for a lexeme as being 650 and to have only one OE date stored, so as to align with how OE dates are stored in the HT database (we don’t differentiate between date for OE words).
While running my date extraction script over one of the XML files I also noticed that there were lexemes in the OED data that were not present in the OED data we had previously extracted. This presumably means the dataset Fraser sent me is more up to date than the dataset I used to populate our online OED data table. This will no doubt mean we’ll need to update our online OED table, but as we link to the HT lexeme table using the OED catid, refentry, refid and lemmaid fields if we were to replace the online OED lexeme table with the data in these XML files the connections from OED to HT lexemes would be retained without issue (hopefully), but any matching processes we performed would need to be done again for the new lexemes.
I set my extraction script running on the OED XML files on Wednesday and processing took a long time. The script didn’t complete until sometime during Friday night, but after it had finished it had processed 238,699 categories, 754,285 lexemes, generating 3,893,341 date rows. It also found 4,062 new words in the OED data that it couldn’t process because they don’t exist in our OED lexeme database.
I also spent a bit more time working on some scripts for Fraser’s Scots Thesaurus project. The scripts now ignore ‘additional’ entries and only include ‘n.’ entries that match an HT ‘n’ category. Variant spellings are also removed (these were all tagged with <form> and I removed all of these). I also created a new field to store only the ‘NN_’ tagged words and remove all others.
The scripts generated three datasets, which I saved as spreadsheets for Fraser. The first (postagged-monosemous-dost-no-adds-n-only) contains all of the content that matches the above criteria. The second (postagged-monosemous-dost-no-adds-n-only-catheading-match) lists those lexemes where a postagged word fully matches the HT category heading. The final (postagged-monosemous-dost-no-adds-n-only-catcontents-match) lists those lexemes where a postagged word fully matches a lexeme in the HT category. For this table I’ve also added in the full list of lexemes for each HT category too.
I also spent a bit of time working on the Data Management Plan for the new project for Jane Stuart-Smith and Eleanor Lawson at QMU and arranged for a PhD student to get access to the TextGrid files that were generated for the audio records for the SCOTS Corpus project.
Finally, I investigated the issue the DSL people are having with duplicate child entries appearing in their data. This was due to something not working quite right in a script Thomas Widmann had written to extract the data from the DSL’s editing system before he left last year, and Ann had sent me some examples of where the issue was cropping up.
I have the data that was extracted from Thomas’s script last July as two XML files (dost.xml and snd.xml) and I looked through these for the examples Ann had sent. The entry for snd13897 contains the following URLs:
The first is the ID for the main entry and the other two are child entries. If I search for the second one (snds3788) this is the only occurrence of the ID in the file, as the child entry has been successfully merged. But if I search for the third one (sndns2217) I find a separate entry with this ID (with more limited content). The pulling in of data into a webpage in the V3 site uses URLs stored in a table linked to entry IDs. These were generated from the URLs in the entries in the XML file (see the <url> tags above). For the URL ‘sndns2217’ the query finds multiple IDs, one for the entry snd13897 and another for the entry sdnns2217. But it finds snd13897 first, so it’s the content of this entry that is pulled into the page.
The entry for dost16606 contains the following URLs:
(in addition to headword URLs). Searching for the second one discovers a separate entry with the ID dost50272 (with more limited content). As with SND, searching the URL table for this URL finds two IDs, and as dost16606 appears first this is the entry that gets displayed.
What we need to do is remove the child entries that still exist as separate entries in the data. To do this I could is write a script that would go through each entry in the dost.xml and snd.xml files. It would then pick out every <url> that is not the same as the entry ID and search the file to see if any entry exists with this ID. If it does then presumably this is a duplicate that should then be deleted. I’m waiting to hear back from the DSL people to see how we should proceed with this.
As you can no doubt gather from the above, this was a very busy week but I do at least feel that I’m getting on top of things again.
This was week 14 of Lockdown and I spent most of it continuing to work on the Books and Borrowing project. Last week I’d planned to migrate the CMS from my test server at Glasgow to the official project server at Stirling, but during the process some discrepancies between PHP versions on the servers meant that the code which worked fine at Glasgow was giving errors at Stirling. As mentioned in last week’s post, on the Stirling server calling a function while passing less than the required number of variables resulted in a fatal error, plus database ‘warnings’ (e.g. an empty string rather than a numeric zero being inserted into an integer field) were being treated as fatal errors too. It took most of Monday to go through my scripts and identify all the places such issues cropped up, but by the end of the day I had the CMS set up and fully usable at Stirling and had asked the team to start using it.
I then spent some further time working on the public website for the project, installing a theme, working with fonts and colour schemes, selecting header images, adding logos to the footer and other such matters. I made six different versions of the interface and emailed screenshots to the team for comment. We all agreed on the interface and I then made some further tweaks to it, during which time team member Kit Baston was adding content to the pages. On Thursday the website went live and you can access it here: https://borrowing.stir.ac.uk/. Here’s a screenshot too:
I also continued to make improvements to the CMS this week, adding new functionality to the pages for browsing book editions, book works and authors. The table of Book Works now includes a column listing the number of Holdings each Work is associated with and now includes the options of ordering the listed Works by any of the columns in the table. When a book work row is expanded and its associated editions loads in, this table also now features the number of holdings an edition is associated with and allows the table to be ordered by any of the columns. I then made the number of holdings and records listed for each Work and Edition a link (so long as the number is greater than 0). Pressing on the link brings up a popup that lists the holdings and records. Each item in the list features an ‘eye’ icon and pressing on this will take you to the record in question (either in the library’s list of holdings or the page that the borrowing record appears on) with the page opening at the item in question.
On Friday I had a Zoom call wit Project PI Katie Halsey and Co-I Matt Sangster to discuss my work on the project and to decide where I should focus my attention next. We agreed that it would be good to get all of the sample data into the system now, so that the team can see what’s already there and begin the process of merging records and rationalising the data. Therefore I’ll be spending a lot of next week writing import scripts for the remaining datasets.
I worked for a number of additional projects this week as well. On Tuesday I had a Zoom call with Jane Stuart-Smith, Eleanor Lawson of QMU and Joanne Cleland of Strathclyde to discuss a new project that they’re putting together. I can’t say too much about it at this stage, but I’ll probably be doing the technical work for the project, if it gets funding. I also spoke with Thomas Clancy about another place-names project that has been funded and I’ll need to adapt my existing place-names system for. This will probably be starting in September and involves a part of East Ayrshire. I also adding in some forum software to Matthew Creasy’s new project website that I recently put together for him. He’s hoping to launch this next week so will probably add in a link to it then.
I also managed to spend some time this week looking into the Historical Thesaurus’s new dates system. My scripts to generate the new HT date structure completed over the weekend and I then had to manually fix the 60 or so label errors that Fraser had previously identified in his spreadsheet. I then wrote a further script to check that the original fulldate, the new fulldate and a fulldate generated on the fly from the new date table all matched for each lexeme. This brought up about a thousand lexemes where the match wasn’t identical. Most of these were due to ‘b’ dates not being recorded in a consistent manner in the original data (sometimes two digits e.g. 1781/86 and sometimes one digit e.g. 1781/6). There were some other issues with dates that had both labels and slashes as connectors, whereby the label ended up associated with both dates rather than just one. There were also some issues with bracketed dates sometimes being recorded with the brackets and sometimes not, plus a few that had a dash before the date instead. I went through the 1000 or so rows and fixed the ones that actually needed fixing (maybe about 50). I then imported the new lexeme_dates table into the online database. There are 1,381,772 rows in it. I also attempted to import the updated lexeme database (which includes a new fulldate column plus new firstdate and lastdate fields). Unfortunately the file contains too much data to be uploaded and the process timed out. I contacted Arts IT Support and they managed to increase the execution time on the server and I was then able to get this second table uploaded too.
Fraser had sent around a document listing the next steps in the data update process and I read through this and began to think things through. Fraser noted that the unique date types list didn’t appear to include ‘a’ and ‘c’ for firstdates. I checked my script that generated the date types (way back in April last year) and spotted an error – the script was looking for a column called ‘oefirstdac’ where it should have been looking for ‘firstdac’. What this means is any lexeme that has an ‘a’ or ‘c’ with its first date has been rolled into the count for regular first dates, but it turns out that this is what Fraser wanted to happen anyway, so no harm was done there.
Before I can make a start on getting all HT lexemes that are XXXX-XXXX, OE-XXXX and XXXX-Current and are matched to an OED lexeme and grabbing the OED date information I’ll need to find a way to actually get the new OED date information. Fraser noted that we can’t just use the OED ‘sortdate’ and ‘enddate’ fields but instead need to use the first and last citation dates as these have ‘a’ and ‘c’. I’m going to need to get access to the most recent version of all of the OED XML files and to write a script that goes through all of the quotations data, such as:
<quotations><q year=”1200″><date>?c1200</date></q><q year=”1392″><date>a1393</date></q><q year=”1450″><date>c1450</date></q><q year=”1481″><date>1481</date></q><q year=”1520″><date>?1520</date></q><q year=”1530″><date>1530</date></q><q year=”1556″><date>1556</date></q><q year=”1608″><date>1608</date></q><q year=”1647″><date>1647</date></q><q year=”1690″><date>1690</date></q><q year=”1709″><date>1709</date></q><q year=”1728″><date>1728</date></q><q year=”1755″><date>1755</date></q><q year=”1804″><date>1804</date></q><q year=”1882″><date>1882</date></q><q year=”1967″><date>1967</date></q><q year=”2007″><date>2007</date></q></quotations>
And then picks out the first date and the last date, plus any ‘a’, ‘c’ and ‘?’ value. This is going to be another long process, but I can’t begin it until I can get my hands on the full OED dataset, which I don’t have with my at home.
This was week 13 of Lockdown, with still no end in sight. I spent most of my time on the Books and Borrowing project, as there is still a huge amount to do to get the project’s systems set up. Last week I’d imported several thousand records into the database and had given the team access to the Content Management System to test things out. One thing that cropped up was that the autocomplete that is used for selecting existing books, borrowers and authors was sometimes not working, or if it did work on selection of an item the script that then populates all of the fields about the book, borrower or author was not working. I’d realised that this was because there were invisible line break characters (\n or \r) in the imported data and the data is passed to the autocomplete via a JSON file. Line break characters are not allowed in a JSON file and therefore the autocomplete couldn’t access the data. I spent some time writing a script that would clean the data of all offending characters and after running this the autocomplete and pre-population scripts worked fine. However, a further issue cropped up with the text editors in the various forms in the CMS. These use the TinyMCE widget to allow formatting to be added to the text area, which works great. However, whenever a new line is created this adds in HTML paragraphs ( ‘<p></p>’, which is good) but the editor also adds a hidden line break character (‘\r’ or ‘\n’ which is bad). When this field is then used to populate a form via the selection of an autocomplete value the line break makes the data invalid and the form fails to populate. After identifying this issue I managed ensured all such characters are stripped out of any uploaded data and that fixed the issue.
I had to spend some time fixing a few more bugs that the team had uncovered during the week. The ‘delete borrower’ option was not appearing, even when a borrower was associated with no records, and I fixed this. There was also an issue with autocompletes not working in certain situations (e.g. when trying to add an existing borrower to a borrowing record that was initially created without a borrower). I tracked down and fixed these. Another issue involved the record page order incrementing whenever the record was edited, even when this had not been manually changed, while another involved book edition data not getting saved in some cases when a borrowing record was created. I tracked down and fixed these issues too.
With these fixes in place I then moved on to adding new features to the CMS, specifically facilities to add and browse the book works, editions and authors that are used across the project. Pressing on the ‘Add Book’ menu item nowloads a page through which you can choose to add a Book Work or a Book Edition (with associated Work, if required). You can also associate authors with the Works and Editions too. Pressing on the ‘Browse Books’ option now loads a page that lists all of the Book Works in a table, with counts of the number of editions and borrowing records associated with each. There’s also a row for all editions that don’t currently have a work. There are currently 1925 such editions so most of the data appears in this section, but this will change.
Through the page you can edit a work (including associating authors) by pressing on the ‘edit’ button. You can delete a work so long as it isn’t associated with an Edition. You can bring up a list of all editions in the work by pressing on the eye icon. Once loaded, the editions are displayed in a table. I may need to change this as there are so many fields relating to editions that the table is very wide. It’s usable if I make my browser take up the full width of my widescreen monitor, but for people using a smaller screen it’s probably going to be a bit unwieldy. From the list of editions you can press the ‘edit’ button to edit one of them – for example assigning one of the ‘no work’ editions to a work (existing or newly created via the edit form). You can also delete an edition if it’s not associated with anything. The Edition table includes a list of borrowing records, but I’ll also need to find a way to add in an option to display a list of all of the associated records for each, as I imagine this will be useful.
Pressing on the ‘Add Author’ menu item brings up a form allowing a new author to be added, which will then be available to associate with books throughout the CMS, while pressing on the ‘Browse Authors’ menu item brings up a list of authors. At the moment this table (and the book tables) can’t be reordered by their various columns. This is something else I still need to implement. You can delete an author if it’s not associated with anything and also edit the author details. As with the book tables I also need to add in a facility to bring up a list of all records the author is associated with, in addition to just displaying counts. I also noticed that there seems to be a bug somewhere that is resulting in blank authors occasionally being generated, and I’ll need to look into this.
I then spent some time setting up the project’s server, which is hosted at Stirling University. I was given access details by Stirling’s IT Support people and managed to sign into the Stirling VPN and get access to the server and the database. There was an issue getting write access to the server, but after that was resolved I was able to upload all of the CMS files, set up the WordPress instance that will be the main project website and migrate the database.
I was hoping I’d be able to get the CMS up and running on the new server without issue, but unfortunately this did not prove to be the case. It turns out that the Stirling server uses a different (and newer) version of the PHP scripting language than the Glasgow server and some of the functionality is different, for example on the Glasgow server you can call a function with less parameters than it is set up to require (e.g. addAuthor(1) when the function is set up to take 2 parameters (e.g.addAuthor(1,2)). The version on the Stirling server doesn’t allow this and instead the script breaks and a blank page is displayed. It took a bit of time to figure out what was going on, and now I know what the issue is I’m going to have to go through every script and check how every function is called, and this is going to be my priority next week.
I also spent a bit of time finalising the website for the project’s pilot project, which deals with borrowing records at Glasgow. This was managed by Matt Sangster, and he’d sent me a list of things we wanted to sort; I spent a few hours going through this, and we’re just about at the point where the website can be made publicly available.
I had intended to spend Friday working on the new way of managing dates for the Historical Thesaurus. The script I’d created to generate the dates for all 790,000-odd lexemes completed during last Friday night and over the weekend I wrote another script that would then shift the connectors up one (so a dash would be associated with the date before the dash rather than the one after it, for example). This script then took many hours to run. Unfortunately I didn’t get a chance to look further into this until Thursday, when I found a bit of time to analyse the output, at which point I realised that while the generation of the new fulldate field had worked successfully, the insertion of bracketed dates into the new dates table had failed, as the column was set as an integer and I’d forgotten to strip out the brackets. Due to this problem I had to set my scripts running all over again. The first one completed at lunchtime on Friday, but the second didn’t complete until Saturday so I didn’t manage to work on the HT this week. However, this did mean that I was able to return to a Scots Thesaurus data processing task that Fraser asked me to look into at the start of May, so it’s not all bad news.
Fraser’s task required me to set up the Stanford Part of Speech tagger on my computer, which meant configuring Java and other such tasks that took a bit of time. I then write a script that took the output of a script I’d written over a year ago that contained monosemous headwords in the DOST data, ran their definitions through the Part of Speech tagger and then outputted this to a new table. This may sound straightforward, but it took quite some time to get everything working, and then another couple of hours for the script to process around 3,000 definitions. But I was able to send the output to Fraser on Friday evening.
Also this week I gave advice to a few members of staff, such as speaking to Matthew Creasy about his new Scottish Cosmopolitanism project, Jane Stuart-Smith about a new project that she’s putting together with QMU, Heather Pagan of the Anglo-Norman Dictionary about a proposal she’s putting together, Rhona Alcorn about the Scots School Dictionary app and Gerry McKeever about publicising his interactive map.
This was week 12 of Lockdown and on Monday I arranged to get access to my office at work in order to copy some files from my work PC. There were some scripts that I needed for the Historical Thesaurus, Fraser’s Scots Thesaurus and the Books and Borrowing projects so I reckoned it was about time to get access. It all went pretty smoothly, thankfully. My train into Central was very quiet – I think there were only about five people in my carriage, and none of them were near me. I walked to the West End and called security to let them know I’d arrived, then got into my office and spent about an hour and a half copying files and doing some work tasks. It was a bit strange to be back in my office after so long, with my calendar still showing March. Once the files were all copied I left the building, checked out with security and walked back through a still deserted town to Central. My train carriage was completely empty on the way back home.
I spent most of the rest of the week continuing with my work on the Books and Borrowing project. My main task was importing sample data into the content management system. Matt had sent me the latest copy of the Glasgow Student data over the weekend, and once I had the data processing scripts from the PC at work I could then process his spreadsheet and upload it to the pilot project database. Processing the Glasgow Student data was not entirely straightforward as the transcriber had used Microsoft Office formatting in the spreadsheet cells to replicate features such as superscript text and strikethroughs. It is a bit of a pain to export an Excel spreadsheet as plain text while retaining such formatting, but thankfully I’d solved that issue previously and my script was able to take an Excel file that had been saved as HTML and then pick out the formatting to keep whilst ditching all of the horrible HTML formatting that Microsoft adds in to Office files that are saved in that format.
Once the Glasgow Student data had been uploaded to the pilot project website I could then migrate it to the Books and Borrowing data structure. It took the best part of a day to write a script that processed the data, dealing with issues like multiple book levels, additional fields and generating ledgers and pages. After the migration there were 3 ledgers, 403 pages and 8191 borrowing records, with associations to 832 borrowers and 1080 books. With this in place I then began to import sample data from a previous study of Innerpeffray library. This was also in a spreadsheet, but was structured very differently and I needed to write a separate data import script to process it. There were some additional complications due to the character encoding the spreadsheet uses, that resulting in lots of hidden special characters being embedded in the text when the spreadsheet was converted to a plain text file for upload. This really messed up the upload process and took some time to get to the bottom of. Also, there is variation in page numbering (e.g. sometimes ‘3r’, sometimes ‘3 r’) and this resulted in multiple pages being created for each variation before I spotted the issue. Also, the spreadsheet is not always listed in page order – there were records from earlier pages added in amongst later pages. This also messed up the upload process before I spotted the issue and updated my script to take this into consideration. There were also some issues of data failing to upload when it contained accented characters, but I think I got to the bottom of that.
As with the Glasgow data, I created editions from holdings. I did add in a check to see whether any of the Glasgow editions matched the titles of the Innerpeffray titles, and used the existing Glasgow edition if this situation arose, but due to the differences in transcription I don’t think any existing editions have been used. This will need some manual correction at some point. Similarly, there may be some existing Glasgow authors that might be used rather than repeating the same information from Innerpeffray but due to differences in transcription I don’t think this will have happened either. As before, author data has for now just been uploaded into the ‘surname’ field and will need to be manually split up further and some Glasgow and Innerpeffray authors will need to be merged. For example, in the Glasgow data we have ‘Cave, William, 1637-1713.’ Whereas in Innerpeffray we have ‘Cave, William, 1637-1713’. Because of the full stop at the end of the Glasgow author these have ended up being inserted as separate authors. After the upload process was complete there were 6550 borrowing records for Innerpeffray, split over 340 pages in one ledger. A total of 1017 unique borrowers and 840 unique book holdings were added to the library.
I created user accounts for the rest of the team to access the CMS and test things out once the sample data for these two libraries was in place. The project PI, Katie Halsey spotted an issue with the autocomplete for selecting an existing edition not working, so I spent some time investigating this. It turns out that there are more character encoding issues with the data that are resulting in the JSON file that is generated for use in the autocomplete failing to be valid. This is also happening with the AJAX script that populates the fields once an autocomplete option is selected. I only investigated this on Friday afternoon and didn’t have time to fix it, but I’m hoping that next week if I fix the character encoding issues and ensure all line break characters are removed from the data then things will be ok.
Other than the Books and Borrowing project, I spoke to Rhona Alcorn of the DSL this week to discuss timescales for DSL developments. I also fixed an issue with the Android version of the Scots School Dictionary app. I gave some advice to Cris Sarg, who is managing the data for the Glasgow Medical Humanities project, and I made some further tweaks to the ‘export data for publication’ facilities for Carole Hough’s REELS project.
I rounded off the week by working on sorting out the new way of storing dates for the Historical Thesaurus. Although we’d previously decided on a structure for the new dates system (which is much more rational and will allow labels to be associated with specific dates rather than the lexeme as a whole) I hadn’t generated the actual new date data. My earlier script (which I retrieved from my office on Monday) instead iterated through each lexeme, generated the new date information and only outputted data if the generated full date did not match the original full date. I’d saved this output as a spreadsheet and Fraser had gone through the rows and had identified any the needed fixed, updating the spreadsheet as required. I then wrote a script to fix the date columns that needed fixed in order for the new fulldate to be properly generated.
With that in place I then wrote a script to generate the new date information for each of the more than 700,000 lexemes in the system. I tried running this on the server initially, but this quickly timed out, meaning I had to run the script locally. I will then be able to import the table into the online database. The script took about 20 hours to run, but seems to have worked successfully, with almost 1.4 million date rows generated for the lexemes. Hopefully next week I’ll find the time to work on this some more.