Week Beginning 7th September 2020

This was a pretty busy week, involving lots of different projects.  I set up the systems for a new place-name project focusing on Ayrshire this week, based on the system that I initially developed for the Berwickshire project and has subsequently been used for Kirkcudbrightshire and Mull.  It didn’t take too long to port the system over, but the PI also wanted the system to be populated with data from the GB1900 crowdsourcing project.  This project has transcribed every place-name on the GB1900 Ordnance Survey maps across the whole of the UK and is an amazing collection of data totalling some 2.5 million names.  I had previously extracted a subset of names for the Mull and Ulva project so thankfully had all of the scripts needed to get the information for Ayrshire.  Unfortunately what I didn’t have was the data in a database, as I’d previously extracted it to my PC at work.  This meant that I had to run the extraction script again on my home PC, which took about three days to work through all of the rows in the monstrous CSV file.  Once this was complete I could then extract the names found in the Ayrshire parishes that the project will be dealing with, resulting in almost 4,000 place-names.  However, this wasn’t the end of the process as while the extracted place-names had latitude and longitude they didn’t have grid references or altitude.  My place-names system is set up to automatically generate these values and I could customise the scripts to automatically apply the generated data to each of the 4000 places.  Generating the grid reference was pretty straightforward but grabbing the altitude was less so, as it involved submitting a query to Google Maps and then inserting the returned value into my system using an AJAX call.  I ran into difficulties with my script exceeding the allowed number of Google Map queries and also the maximum number of page requests on our server, resulting in my PC getting blocked by the server and a ‘Forbidden’ error being displayed instead, but with some tweaking I managed to get everything working within the allowed limits.

I also continued to work on the Second Edition of the Historical Thesaurus.  I set up a new version of the website that we will work on for the Second Edition, and created new versions of the database tables that this new site connects to.  I also spent some time thinking about how we will implement some kind of changelog or ‘history’ feature to track changes to the lexemes, their dates and corresponding categories.  I had a Zoom call with Marc and Fraser on Wednesday to discuss the developments and we realised that the date matching spreadsheets I’d generated last week could do with some additional columns from the OED data, namely links through to the entries on the OED website and also a note to say whether the definition contains ‘(a)’ or ‘(also’ as these would suggest the entry has multiple senses that may need a closer analysis of the dates.

I then started to update the new front-end to use the new date structure that we will use for the Second Edition (with dates stored in a separate date table rather than split across almost 20 different date fields in the lexeme table).  I updated the timeline visualisations (mini and full) to use this new date table, and although this took quite some time to get my head around the resulting code is MUCH less complicated than the horrible code I had to write to deal with the old 20-odd date columns.  For example, the code to generate the data for the mini timelines is about 70 lines long now as opposed to over 400 previously.

The timelines use the new data tables in the category browse and the search results.  I also spotted some dates weren’t working properly with the old system but are working properly now.  I then updated the ‘label’ autocomplete in the advanced search to use the labels in the new date table.  What I still need to do is update the search to actually search for the new labels and also to search the new date tables for both ‘simple’ and ‘complex’ year searches.  This might be a little tricky, and I will continue on this next week.

Also this week I gave Gerry McKeever some advice about preserving the data of his Regional Romanticism project, spoke to the DSL people about the wording of the search results page, gave feedback on and wrote some sections for Matthew Creasy’s Chancellor’s Fund proposal, gave feedback to Craig Lamont regarding the structure of a spreadsheet for holding data about the correspondence of Robert Burns and gave some advice to Rob Maslen about the stats for his ‘City of Lost Books’ blog.  I also made a couple of tweaks to the content management system for the Books and Borrowers project based on feedback from the team.

I spent the remainder of the week working on the redevelopment of the Anglo-Norman dictionary.  I updated the search results page to style the parts of speech to make it clearer where one ends and the next begins.  I also reworked the ‘forms’ section to add in a cut-off point for entries that have a huge number of forms.  In such cases the long list of cut off and an ellipsis is added in, together with an ‘expand’ button.  Pressing on this scrolls down the full list of forms and the button is replaced with a ‘collapse’ button.  I also updated the search so that it no longer includes cross references (these are to be used for the ‘Browse’ list only) and the quick search now defaults to an exact match search whether you select an item from the auto-complete or not.  Previously it performed an exact match if you selected an item but defaulted to a partial match if you didn’t.  Now if you search for ‘mes’ (for example) and press enter or the search button your results are for “mes” (exactly).  I suspect most people will select ‘mes’ from the list of options, which already did this, though.  It is also still possible to use the question mark wildcard with an ‘exact’ search, e.g. “m?s” will find 14 entries that have three letter forms beginning with ‘m’ and ending in ‘s’.

I also updated the display of the parts of speech so that they are in order of appearance in the XML rather than alphabetically and I’ve updated the ‘v.a.’ and ‘v.n.’ labels as the editor requested.  I also updated the ‘entry’ page to make the ‘results’ tab load by default when reaching an entry from the search results page or when choosing a different entry in the search results tab.  In addition, the search result navigation buttons no longer appear in the search tab if all the results fit on the page and the ‘clear search’ button now works properly.  Also, on the search results page the pagination options now only appear if there is more than one page of results.

On Friday I began to process the entry XML for display on the entry page, which was pretty slow going, wading through the XSLT file that is used to transform the XML to HTML for display.  Unfortunately I can’t just use the existing XSLT file from the old site because we’re using the editor’s version of the XML and not the system version, and the two are structurally very different in places.

So far I’ve been dealing with forms and have managed to get the forms listed, with grammatical labels displayed where available and commas separating forms and semi-colons separating groups of forms.  Deviant forms are surrounded by brackets.  Where there are lots of forms the area is cut off as with the search results.  I still need to add in references where these appear, which is what I’ll tackle next week.  Hopefully now I’ve started to get my head around the XML a bit progress with the rest of the page will be a little speedier, but there will undoubtedly be many more complexities that will need to be dealt with.

Week Beginning 10th August 2020

I was back at work this week after spending two weeks on holiday, during which time we went to Skye, Crinan and Peebles.  It was really great to see some different places after being cooped up at home for 19 weeks and I feel much better for having been away.  Unfortunately during this week I developed a pretty severe toothache and had to make an emergency appointment with the dentist on Thursday morning.  It turns out I need root canal surgery and am now booked in to have this next Tuesday, but until then I need to try and just cope with the pain, which has been almost unbearable at times, despite regular doses of both ibuprofen and paracetamol.  This did affect my ability work a bit on Thursday afternoon and Friday, but I managed to struggle through.

On my return to work from my holiday on Monday I spent some time catching up with emails that had accumulated whilst I was away, including replying to Roslyn Potter in Scottish Literature about a project website, replying to Jennifer Smith about giving a group of students access to the SCOSYA data and making changes to the Berwickshire Place-names website to make it more attractive to the REF reviewers based on feedback passed on by Jeremy Smith.  I also created a series of high-resolution screenshots of the resource for Carole Hough for a publication, had an email chat with Luca Guariento about linked open data.

I also fixed some issues with the Galloway Glens projects that Thomas Clancy had spotted, including an issue with the place-name element page which was not ordering accented characters properly – all accented characters were being listed at the end rather than with their non-accented versions.  It turned out that while the underlying database orders accented characters correctly, for the elements list I need to get a list of elements used in place-names and a list of elements used in historical forms and then I have to combine these lists and reorder the resulting single list.  This part of the process was not dealing with all accented characters, only a limited set that I’d created for Berwickshire that also dealt with ashes and thorns.  Instead I added in a function taken from WordPress that converts all accented characters to their unaccented equivalent for the purposes of ordering and this ensured the order of the elements list was correct.

The rest of my week was divided between three projects, the first of which was the Books and Borrowing project.  For this I spent some time working with some of the digitised images of the register pages.  We now have access to the images from Westerkirk library and in these records appear in a table that spreads across both recto and verso pages but we have images of the individual pages.  The project RA who is transcribing the records is treating both recto and verso as a single ‘page’ in the system, which makes sense.  We therefore need to stitch the r and v images together into on single image to be associated with this ‘page’.  I downloaded all of the images and have found a way to automatically join two page images together.  However, there is rather a lot of overlap in the images, meaning the book appears to have two joins and some columns are repeated.  I could possibly try to automatically crop the images before joining them, but there is quite a bit of variation in the size of the overlap so this is never going to be perfect and may result in some information getting lost.  The other alternative would be to manually crop and join the images, which I did some experimentation with.  It’s still not perfect due to the angle of the page changing between shots, but it’s a lot better.  The downside with this approach is that someone would have to do the task.  There are about 230 images, so about 115 joins, each one taking 2-3 minutes to create, so maybe about 5 or so hours of effort.  I’ve left it with the PI and Co-I to decide what to do about this.  I also downloaded the images for Volume 1 of the register for Innerpeffray library and created tilesets for these that will allow the images to be zoomed and panned.  I also fixed a bug relating to adding new book items to a record and responded to some feedback about the CMS.

My second major project of the week was the Anglo-Norman Dictionary.  This week I began writing a high-level requirements document for the new AND website that I will be developing.  This mean going through the existing site in detail and considering which features will be retained, how things might be handled better, and how I might develop the site.  I made good progress with the document, and by the end of the week I’d covered the main site.  Next week I need to consider the new API for accessing the data and the staff pages for uploading and publishing new or newly edited entries.  I also responded to a few questions from Heather Pagan of the AND about the searches and read through and gave feedback on a completed draft of the AHRC proposal that the team are hoping to submit next month.

My final major project of the week was the Historical Thesaurus, for which I updated and re-executed by OED Date extraction script based on feedback from Fraser and Marc.  It was a long and complicated process to update the script as there are literally millions of dates and some issues only appear a handful of times, so tracking them down and testing things is tricky.    However, I made the following changes: I added a ‘sortdate_new’ column to the main OED lexeme table that holds the sortdate value from the new XML files, which may differ from the original value.  I’ve done some testing and rather strangely there are many occasions where the new sortdate differs from the old, but the ‘revised’ flag is not set to ‘true’.  I also updated the new OED date table to include a new column where the full date text is contained, as I thought this would be useful for tracing back issues.  E.g. if the OED date is ‘?c1225’ this is stored here.  The actual numeric year in my table now comes from the ‘year’ attribute in the XML instead.  This always contains the numeric value in the OED date, e.g. <q year=”1330″><date>c1330</date></q>.  New lexemes in the data are now getting added into the OED lexeme table and are also having their dates processed.  I’ve added a new column called ‘newaugust2020’ to track these new lexemes.  We’ll possibly have to try and match them up with existing HT lexemes at some point, unless we can consider them all to be ‘new’, meaning they’ll have no matches.  The script also now stores all of the various OE dates, rather than one single OE date of 650 being added for all.  I set the script running on Thursday and by Sunday it had finished executing, resulting in 3,912,109 being added and 4061 new words.

Week Beginning 20th July 2020

Week 19 of Lockdown, and it was a short week for me as the Monday was the Glasgow Fair holiday.  I spent a couple of days this week continuing to add features to the content management system for the Books and Borrowing project.  I have now implemented the ‘normalised occupations’ part of the CMS.  Originally occupations were just going to be a set of keywords, allowing one or more keyword to be associated with a borrower.  However, we have been liaising with another project that has already produced a list of occupations and we have agreed to share their list.  This is slightly different as it is hierarchical, with a top-level ‘parent’ containing multiple main occupations. E.g. ‘Religion and Clergy’ features ‘Bishop’.  However, for our project we needed a third hierarchical level do differentiate types of minister/priest, so I’ve had to add this in too.  I’ve achieved this by means of a parent occupation ID in the database, which is ‘null’ for top-level occupations and contains the ID of the parent category for all other occupations.

I completed work on the page to browse occupations, arranging the hierarchical occupations in a nested structure that features a count of the number of borrowers associated with the occupation to the right of the occupation name.  These are all currently zero, but once some associations are made the numbers will go up and you’ll be able to click on the count to bring up a list of all associated borrowers, with links through to each borrower.  If an occupation has any child occupations a ‘+’ icon appears beside it.  Press on this to view the child occupations, which also have counts.  The counts for ‘parent’ occupations tally up all of the totals for the child occupations, and clicking on one of these counts will display all borrowers assigned to all child occupations.  If an occupation is empty there is a ‘delete’ button beside it.  As the list of occupations is going to be fairly fixed I didn’t add in an ‘edit’ facility – if an occupation needs editing I can do it directly through the database, or it can be deleted and a new version created.  Here’s a screenshot showing some of the occupations in the ‘browse’ page:

I also created facilities to add new occupations.  You can enter an occupation name and optionally specify a parent occupation from a drop-down list.  Doing so will add the new occupation as a child of the selected category, either at the second level if a top level parent is selected (e.g. ‘Agriculture’) or at the third level if a second level parent is selected (e.g. ‘Farmer’).  If you don’t include a parent the occupation will become a new top-level grouping.  I used this feature to upload all of the occupations, and it worked very well.

I then updated the ‘Borrowers’ tab in the ‘Browse Libraries’ page to add ‘Normalised Occupation’ to the list of columns in the table.  The ‘Add’ and ‘Edit’ borrower facilities also now feature ‘Normalised Occupation’, which replicates the nested structure from the ‘browse occupations’ page, only features checkboxes beside each main occupation.  You can select any number of occupations for a borrower and when you press the ‘Upload’ or ‘Edit’ button your choice will be saved.  Deselecting all ticked checkboxes will clear all occupations for the borrower.  If you edit a borrower who has one or more occupations selected, in addition to the relevant checkboxes being ticked, the occupations with their full hierarchies also appear above the list of occupations, so you can easily see what is already selected. I also updated the ‘Add’ and ‘Edit’ borrowing record pages so that whenever a borrower appears in the forms the normalised occupations feature also appears.

I also added in the option to view page images.  Currently the only ledgers that have page images are the three Glasgow ones, but more will be added in due course.  When viewing a page in a ledger that includes a page image you will see the ‘Page Image’ button above the table of records.  Press on this and a new browser tab will open.  It includes a link through to the full-size image of the page if you want to open this in your browser or download it to open in a graphics package.  It also features the ‘zoom and pan’ interface that allows you to look at the image in the same manner as you’d look at a Google Map.  You can also view this full screen by pressing on the button in the top right of the image.

Also this week I made further tweaks to the script I’d written to update lexeme start and end dates in the Historical Thesaurus based on citation dates in the OED.  I’d sent a sample output of 10,000 rows to Fraser last week and he got back to me with some suggestions and observations.  I’m going to have to rerun the script I wrote to extract the more than 3 million citation dates from the OED as some of the data needs to be processed differently, but as this script will take several days to run and I’m on holiday next week this isn’t something I can do right now.  However, I managed to change the way the date matching script runs to fix some bugs and make the various processes easier to track.  I also generated a list of all of the distinct labels in the OED data, with counts of the number of times these appear.  Labels are associated with specific citation dates, thankfully.  Only a handful are actually used lots of times, and many of the others appear to be used as a ‘notes’ field rather than as a more general label.

In addition to the above I also had a further conversation with Heather Pagan about the data management plan for the AND’s new proposal, responded to a query from Kathryn Cooper about the website I set up for her at the end of last year, responded to a couple of separate requests from post-grad students in Scottish Literature, spoke to Thomas Clancy about the start date for his Place-Names of Iona project, which got funded recently, helped with some issues with Matthew Creasy’s Scottish Cosmopolitanism website and spoke to Carole Hough about making a few tweaks to the Berwickshire Place-names website for REF.

I’m going to be on holiday for the next two weeks, so there will be no further updates from me for a while.

Week Beginning 13th July 2020

This was week 18 of Lockdown, which is now definitely easing here.  I’m still working from home, though, and will be for the foreseeable future.  I took Friday off this week, so it was a four-day week for me.  I spent about half of this time on the Books and Borrowing project, during which time I returned to adding features to the content management system, after spending recent weeks importing datasets.  I added a number of indexes to the underlying database which should speed up the loading of certain pages considerably.  E.g. the browse books, borrowers and author pages.  I then updated the ‘Books’ tab when viewing a library (i.e. the page that lists all of the book holdings in the library) so that it now lists the number of book holdings in the library above the table.  The table itself now has separate columns for all additional fields that have been created for book holdings in the library and it is now possible to order the table by any of the headings (pressing on a heading a second time reverses the ordering).  The count of ‘Borrowing records’ for each book in the table is now a button and pressing on it brings up a popup listing all of the borrowing records that are associated with the book holding record, and from this pop-up you can then follow a link to view the borrowing record you’re interested in.  I then made similar changes to the ‘Borrowers’ tab when viewing a library (i.e. the page that lists all of the borrowers the library has). It also now displays the total number of borrowers at the top.  This table already allowed the reordering by any column, so that’s not new, but as above, the ‘Borrowing records’ count is now a link that when clicked on opens a list of all of the borrowing records the borrower is associated with.

The big new feature I implemented this week was borrower cross references.   These can be added via the ‘Borrowers’ tab within a library when adding or editing a borrower on this page.  When adding or editing a borrower there is now a section of the form labelled ‘Cross-references to other borrowers’.  If there are any existing cross references these will appear here, with a checkbox beside each that you can tick if you want to delete the cross reference (the user can tick the box then press ‘Edit’ to edit the borrower and the reference will be deleted).  Any number of new cross references can be added by pressing on the ‘Add a cross-reference’ button (multiple times, if required).  Doing so adds two fields to the form, one for a ‘description’, which is the text that shows how the current borrower links to the referenced borrowing record, and one for ‘referenced borrower’, which is an auto-complete.  Type in a name or part of a name and any borrower that matches in any library will be listed.  The library appears in brackets after the borrower’s name to help differentiate records.  Select a borrower and then when the ‘Add’ or ‘Edit’ button is pressed for the borrower the cross reference will be made.

Cross-references work in both directions – if you add a cross reference from Borrower A to Borrower B you don’t then need to load up the record for Borrower B to add a reference back to Borrower A.  The description text will sit between the borrower whose form you make the cross reference on and the referenced borrower you select, so if you’re on the edit form for Borrower A and link to Borrower B and the description is ‘is the son of’ then the cross reference will appear as ‘Borrower A is the son of Borrower B’.  If you then view Borrower B the cross reference will still be written in this order.  I also updated the table of borrowers to add in a new ‘X-Refs’ column that lists all cross-references for a borrower.

I spent the remainder of my working week completing smaller tasks for a variety of projects, such as updating the spreadsheet output of duplicate child entries for the DSL people, getting an output of the latest version of the Thesaurus of Old English data for Fraser, advising Eleanor Lawson on ‘.ac.uk’ domain names and having a chat with Simon Taylor about the pilot Place-names of Fife project that I worked on with him several years ago.  I also wrote a Data Management Plan for a new AHRC proposal the Anglo-Norman Dictionary people are putting together, which involved a lengthy email correspondence with Heather Pagan at Aberystwyth.

Finally, I returned to the ongoing task of merging data from the Oxford English Dictionary with the Historical Thesaurus.  We are currently attempting to extract citation dates from OED entries in order to update the dates of usage that we have in the HT.  This process uses the new table I recently generated from the OED XML dataset which contains every citation date for every word in the OED (more than 3 million dates).  Fraser had prepared a document listing how he and Marc would like the HT dates to be updated (e.g. if the first OED citation date is earlier than the HT start date by 140 years or more then use the OED citation date as the suggested change).  Each rule was to be given its own type, so that we could check through each type individually to make sure the rules were working ok.

It took about a day to write an initial version of the script, which I ran on the first 10,000 HT lexemes as a test.  I didn’t split the output into different tables depending on the type, but instead exported everything to a spreadsheet so Marc and Fraser could look through it.

In the spreadsheet if there is no ‘type’ for a row it means it didn’t match any of the criteria, but I included these rows anyway so we can check whether there are any other criteria the rows should match.  I also included all the OED citation dates (rather than just the first and last) for reference.  I noted that Fraser’s document doesn’t seem to take labels into consideration.  There are some labels in the data, and sometimes there’s a new label for an OED start or end date when nothing else is different, e.g. htid 1479 ‘Shore-going’:  This row has no ‘type’ but does have new data from the OED.

Another issue I spotted is that as the same ‘type’ variable is set when a start date matches the criteria and then when an end date matches the criteria, the ‘type’ as set during start date is then replaced with the ‘type’ for end date.  I think, therefore, that we might have to split the start and end processes up, or append the end process type to the start process type rather than replacing it (so e.g. type 2-13 rather than type 2 being replaced by type 13).  I also noticed that there are some lexemes where the HT has ‘current’ but the OED has a much earlier last citation date (e.g. htid 73 ‘temporal’ has 9999 in the HT but 1832 in the OED.  Such cases are not currently considered.

Finally, according to the document, Antes and Circas are only considered for update if the OED and HT date is the same, but there are many cases where the start / end OED date is picked to replace the HT date (because it’s different) and it has an ‘a’ or ‘c’ and this would then be lost.  Currently I’m including the ‘a’ or ‘c’ in such cases, but I can remove this if needs be (e.g. HT 37 ‘orb’ has HT start date 1601 (no ‘a’ or ‘c’) but this is to be replaced with OED 1550 that has an ‘a’.  Clearly the script will need to be tweaked based on feedback from Marc and Fraser, but I feel like we’re finally making some decent progress with this after all of the preparatory work that was required to get to this point.

Next Monday is the Glasgow Fair holiday, so I won’t be back to work until the Tuesday.

Week Beginning 6th July 2020

Week 16 of Lockdown and still working from home.  I continued working on the data import for the Books and Borrowers project this week.  I wrote a script to import data from Haddington, which took some time due to the large number of additional fields in the data (15 across Borrowers, Holdings and Borrowings), but are executing it resulted in a further 5,163 borrowing records across 2 ledgers and 494 pages being added, including 1399 book holding records and 717 borrowers.

I then moved onto the datasets from Leighton and Wigtown.  Leighton was a much smaller dataset, with just 193 borrowing records over 18 pages in one ledger and involving 18 borrowers and 71 books.  As before, I have just created book holding records for these (rather than project-wide edition records), although in this case there are authors for books too, which I have also created.  Wigtown was another smaller dataset.  The spreadsheet has three sheets, the first is a list of borrowers, the second a list of borrowings and the third a list of books.  However, no unique identifiers are used to connect the borrowers and books to the information in the borrowings sheet and there’s no other field that matches across the sheets to allow the data to be automatically connected up.  For example, in the Books sheet there is the book ‘History of Edinburgh’ by author ‘Arnot, Hugo’ but in the borrowings tab author surname and forename are split into different columns (so ‘Arnot’ and ‘Hugo’ and book titles don’t match (in this case the book appears as simply ‘Edinburgh’ in the borrowings).  Therefore I’ve not been able to automatically pull in the information from the books sheet.  However, as there are only 59 books in the books sheet it shouldn’t take too much time to manually add the necessary data when created Edition records.  It’s a similar issue with Borrowers in the first sheet – they appear with name in one column (e.g. ‘Douglas, Andrew’) but in the Borrowings sheet the names are split into separate forename and surname columns.  There are also instances of people with the same name (e.g. ‘Stewart, John’) but without unique identifiers there’s no way to differentiate these.  There are only 110 people listed in the Borrowers sheet, and only 43 in the actual borrowing data, so again, it’s probably better if any details that are required are added in manually.

I imported a total of 898 borrowing records for Wigtown.  As there is no page or ledger information in the data I just added these all to one page in a made-up ledger.  It does however mean that the page can take quite a while to load in the CMS.  There are 43 associated borrowers and 53 associated books, which again have been created as Holding records only and have associated authors.  However, there are multiple Book Items created for many of these 53 books – there are actually 224 book items.  This is because the spreadsheet contains a separate ‘Volume’ column and a book may be listed with the same title but a different volume.  In such cases a Holding record is made for the book (e.g. ‘Decline and Fall of Rome’) and an Item is made for each Volume that appears (in this case 12 items for the listed volumes 1-12 across the dataset).  With these datasets imported I have now processed all of the existing data I have access to, other than the Glasgow Professors borrowing records, but these are still being worked on.

I did some other tasks for the project this week as well, including reviewing the digitisation policy document for the project, which lists guidelines for the team to follow when they have to take photos of ledger pages themselves in libraries where no professional digitisation service is available.  I also discussed how borrower occupations will be handled in the system with Katie.

In addition to the Books and Borrowers project I found time to work on a number of other projects this week too.  I wrote a Data Management Plan for an AHRC Networking proposal that Carolyn Jess-Cooke in English Literature is putting together and I had an email conversation with Heather Pagan of the Anglo-Norman Dictionary about the Data Management Plan she wants me to write for a new AHRC proposal that Glasgow will be involved with.  I responded to a query about a place-names project from Thomas Clancy, a query about App certification from Brian McKenna in IT Services and a query about domain name registration from Eleanor Lawson at QMU.  Also (outside of work time) I’ve been helping my brother-in-law set up Beacon Genealogy, through which he offers genealogy and family history research services.

Also this week I worked with Jennifer Smith to make a number of changes to the content of the SCOSYA website (https://scotssyntaxatlas.ac.uk/) to provide more information about the project for REF purposes and I added a new dataset to the interactive map of Burns Suppers that I’m creating for Paul Malgrati in Scottish Literature.  I also went through all of the WordPress sites I manage and upgraded them to the most recent version of WordPress.

Finally, I spent some time writing scripts for the DSL people to help identify child entries in the DOST and SND datasets that haven’t been properly merged with main entries when exported from their editing software.  In such cases the child entries have been added to the main entries, but then they haven’t been removed as separate entries in the output data, meaning the child entries appear twice.  When attempting to process the SND data I discovered there were some errors in the XML file (mismatched tags) that prevented my script from processing the file, so I had to spend some time tracking these down and fixing them.  But once this had been done my script could do through the entire dataset, look for an ID that appeared as a URL in one entry and as an ID of another entry and in such cases pull out the IDs and the full XML of each entry and export it into an HTML table.  There were about 180 duplicate child entries in DOST but a lot more in SND (the DOST file is about 1.5mb, the SND one is about 50mb).  Hopefully once the DSL people have analysed the data we can then strip out the unnecessary child entries and have a better dataset to import into the new editing system the DSL is going to be using.

 

Week Beginning 29th June 2020

This was week 15 of Lockdown, which I guess is sort of coming to an end now, although I will still be working from home for the foreseeable future and having to juggle work and childcare every day.  I continued to work on the Books and Borrowing project for much of this week, this time focussing on importing some of the existing datasets from previous transcription projects.  I had previously written scripts to import data from Glasgow University library and Innerpeffray library, which gave us 14,738 borrowing records.  This week I began by focussing on the data from St Andrews University library.

The St Andrews data is pretty messy, reflecting the layout and language of the original documents, so I haven’t been able to fully extract everything and it will require a lot of manual correcting.  However, I did manage to migrate all of the data to a test version of the database running on my local PC and then updated the online database to incorporate this data.

The data I’ve got are CSV and HTML representations of transcribed pages that come from an existing website with pages that look like this: https://arts.st-andrews.ac.uk/transcribe/index.php?title=Page:UYLY205_2_Receipt_Book_1748-1753.djvu/100.  The links in the pages (e.g. Locks Works) lead through to further pages with information about books or borrowers.  Unfortunately the CSV version of the data doesn’t include the links or the linked to data, and as I wanted to try and pull in the data found on the linked pages I therefore needed to process the HTML instead.

I wrote a script that pulled in all of the files in the ‘HTML’ directory and processed each in turn.  From the filenames my script could ascertain the ledger volume, its dates and the page number.  For example ‘Page_UYLY205_2_Receipt_Book_1748-1753.djvu_10.html’ is ledger 2 (1748-1753) page 10.  The script creates ledgers and pages, and adds in the ‘next’ and ‘previous’ page links to join all the pages in a ledger together.

The actual data in the file posed further problems.  As you can see from the linked page above, dates are just too messy to automatically extract into our strongly structured borrowed and returned date system.  Often a record is split over multiple rows as well (e.g. the borrowing record for ‘Rollins belles Lettres’ is actually split over 3 rows).  I could have just grabbed each row and inserted it as a separate borrowing record, which would then need to be manually merged, but I figured out a way to do this automatically.  The first row of a record always appears to have a code (the shelf number) in the second column (e.g. J.5.2 for ‘Rollins’) whereas subsequent rows that appear to belong to the same record don’t (e.g. ‘on profr Shaws order by’ and ‘James Key’).  I therefore set up my script to insert new borrowing records for rows that have codes, and to append any subsequent rows that don’t have codes to this record until a row with a code is reached again.

I also used this approach to set up books and borrowers too.  If you look at the page linked to above again you’ll see that the links through to things are not categorised – some are links to books and others to borrowers, with no obvious way to know which is which.  However, it’s pretty much always the case that it’s a book that appears in the row with the code and it’s people that are linked to in the other rows.  I could therefore create or link to existing book holding records for links in the row with a code and create or link to existing borrower records for links in rows without a code.  There are bound to be situations where this system doesn’t quite work correctly, but I think the majority of rows do fit this pattern.

The next thing I needed to do was to figure out which data from the St Andrews files should be stored as what in our system.  I created four new ‘Additional Fields’ for St Andrews as follows:

  • Original Borrowed date: This contains the full text of the first column (e.g. Decr 16)
  • Code: This contains the full text of the second column (e.g. J.5.2)
  • Original Returned date: This contains the full text of the fourth column (e.g. Jan. 5)
  • Original returned text: This contains the full date of the fifth column (e.g. ‘Rollins belles Lettres V. 2d’)

In the borrowing table the ‘transcription’ field is set to contain the full text of the ‘borrowed’ column, but without links.  Where subsequent rows contain data in this column but no code, this data is then appended to the transcription.  E.g. the complete transcription for the third item on the page linked to above is ‘Rollins belles Lettres Vol 2<sup>d</sup> on profr Shaws order by James Key’.

The contents of all pages linked to in the transcriptions are added to the ‘editors notes’ field for future use if required.  Both the page URL and the page content are included, separated by a bar (|) and if there are multiple links these are separated by five dashes.  E.g. for the above the notes field contains:

‘Rollins_belles_Lettres| <p>Possibly: De la maniere d’enseigner et d’etuder les belles-lettres, Par raport à l’esprit &amp; au coeur, by Charles Rollin. (A Amsterdam : Chez Pierre Mortier, M. DCC. XLV. [1745]) <a href=”http://library.st-andrews.ac.uk/record=b2447402~S1″>http://library.st-andrews.ac.uk/record=b2447402~S1</a></p>

—– profr_Shaws| <p><a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484</a></p>

—– James_Key| <p>Possibly James Kay: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860</a></p>

—–‘

As mentioned earlier, the script also generates book and borrower records based on the linked pages too.  I’ve chosen to set up book holding rather than book edition records as the details are all very vague and specific to St Andrews.  In the holdings table I’ve set the ‘standardised title’ to be the page link with underscores replaced with dashes (e.g. ‘Rollins belles Lettres’) and the page content is stored in the ‘editors notes’ field.  One book item is created for each holding to be used to link to the corresponding borrowing records.

For borrowers a similar process is followed, with the link added to the surname column (e.g. Thos Duncan) and the page content added to the ‘editors notes’ field (e.g. <p>Possibly Thomas Duncan: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372</a></p>’).  All borrowers are linked to records as ‘Main’ borrowers.

During the processing I noticed that the fourth ledger had a slightly different structure to the others, with entire pages devoted to a particular borrower, whose name then appeared in a heading row in the table.  I therefore updated my script to check for the existence of this heading row, and if it exists my script then grabs the borrower name, creates the borrower record if it doesn’t already exist and then links this borrower to every borrowing item found on the page.  After my script had finished running we had 11147 borrowing records, 996 borrowers and 6395 book holding records for St Andrew in the system.

I then moved onto looking at the data for Selkirk library.  This data was more nicely structured than the St Andrews data, with separate spreadsheets for borrowings, borrowers and books and borrowers and books connected to borrowings via unique identifiers.  Unfortunately the dates were still transcribed as they were written rather than being normalised in any way, which meant it was not possible to straightforwardly generate structured dates for the records and these will need to be manually generated.  The script I wrote to import the data took about a day to write, and after running it we had a further 11,431 borrowing records across two registers and 415 pages entered into our database.

As with St Andrews, I created book records as Holding records only (i.e. associated specifically with the library rather than being project-wide ‘Edition’ records.  There are 612 Holding records for Selkirk.  I also processed the borrower records, resulting in 86 borrower records being added.  I added the dates as originally transcribed to an additional field named ‘Original Borrowed Date’ and the only other additional field is in the Holding records for ‘Subject’, that will eventually be merged with our ‘Genre’ when this feature becomes available.

Also this week I advised Katie on a file naming convention for the digitised images of pages that will be created for the project.  I recommended that the filenames shouldn’t have spaces in them as these can be troublesome on some operating systems and that we’d want a character to use as a delimiter between the parts of the filename that wouldn’t appear elsewhere in the filename so it’s easy to split up the filename.  I suggested that the page number should be included in the filename and that it should reflect the page number as it will be written into the database – e.g. if we’re going to use ‘r’ and ‘v’ these would be included.  Each page in the database will be automatically assigned an auto-incrementing ID, and the only means of linking a specific page record in the database with a specific image will be via the page number entered when the page is created, so if this is something like ‘23r’ then ideally this should be represented in the image filename.

Katie had wondered about using characters to denote ledgers and pages in the filename (e.g. ‘L’ and ‘P’) but if we’re using a specific delimiting character to separate parts of the filename then using these characters wouldn’t be necessary and I suggested it would be better to not use ‘L’ as a lower case ‘l’ is very easy to confuse with a ‘1’ or a capital ‘I’ which might confuse future human users.

Instead I suggested using a ‘-‘ instead of spaces and a ‘_’ as a delimiter and pointed out that we should  ensure that no other non-alphanumeric characters are ever used in the filename – no apostrophes, commas, colons, semi-colons, ampersands etc and to make sure the ‘-‘ is really a minus sign and not one of the fancy dashes (–) that get created by MS Office.  This shouldn’t be an issue when entering a filename, but might be if a list of filenames is created in Word and then pasted into the ‘save as’ box, for example.

Finally, I suggested that it might be best to make the filenames entirely lower case, as some operating systems are case sensitive and if we don’t specify all lower case then there may be variation in the use of case.  Following these guidelines the filenames would look something like this:

  • jpg
  • dumfries-presbytery_2_3v.jpg
  • standrews-ul_9_300r.jpg

In addition to the Books and Borrowing project I worked on a number of other projects this week.  I gave Matthew Creasy some further advice on using forums in his new project website, and ‘Scottish Cosmopolitanism at the Fin de Siècle’ website is now available here: https://scoco.glasgow.ac.uk/.

I also worked a bit more on using dates from the OED data in the Historical Thesaurus.  Fraser had sent me a ZIP file containing the entire OED dataset as 240 XML files and I began analysing these to figure out how we’d extract these dates so that we could use them to update the dates associated with the lexemes in the HT.  I needed to extract the quotation dates as these have ‘ante’ and ‘circa’ notes, plus labels.  I noted that in addition to ‘a’ and ‘c’ a question mark is also used, somethings with an ‘a’ or ‘c’ and sometimes without.  I decided to process things as follows:

  • ?a will just be ‘a’
  • ?c will just be ‘c’
  • ? without an ‘a’ or ‘c’ will be ‘c’.

I also noticed that a date may sometimes be a range (e.g. 1795-8) so I needed to include a second date column in my data structure to accommodate this.  I also noted that there are sometimes multiple Old English dates, and the contents of the ‘date’ tag vary depending on the date – sometimes the content is ‘OE’ and othertimes ‘lOE’ or ‘eOE’.  I decided to process any OE dates for a lexeme as being 650 and to have only one OE date stored, so as to align with how OE dates are stored in the HT database (we don’t differentiate between date for OE words).

While running my date extraction script over one of the XML files I also noticed that there were lexemes in the OED data that were not present in the OED data we had previously extracted.  This presumably means the dataset Fraser sent me is more up to date than the dataset I used to populate our online OED data table.  This will no doubt mean we’ll need to update our online OED table, but as we link to the HT lexeme table using the OED catid, refentry, refid and lemmaid fields if we were to replace the online OED lexeme table with the data in these XML files the connections from OED to HT lexemes would be retained without issue (hopefully), but any matching processes we performed would need to be done again for the new lexemes.

I set my extraction script running on the OED XML files on Wednesday and processing took a long time.  The script didn’t complete until sometime during Friday night, but after it had finished it had processed 238,699 categories, 754,285 lexemes, generating 3,893,341 date rows.  It also found 4,062 new words in the OED data that it couldn’t process because they don’t exist in our OED lexeme database.

I also spent a bit more time working on some scripts for Fraser’s Scots Thesaurus project.  The scripts now ignore ‘additional’ entries and only include ‘n.’ entries that match an HT ‘n’ category.  Variant spellings are also removed (these were all tagged with <form> and I removed all of these).  I also created a new field to store only the ‘NN_’ tagged words and remove all others.

The scripts generated three datasets, which I saved as spreadsheets for Fraser.  The first (postagged-monosemous-dost-no-adds-n-only) contains all of the content that matches the above criteria. The second (postagged-monosemous-dost-no-adds-n-only-catheading-match) lists those lexemes where a postagged word fully matches the HT category heading.  The final (postagged-monosemous-dost-no-adds-n-only-catcontents-match) lists those lexemes where a postagged word fully matches a lexeme in the HT category.  For this table I’ve also added in the full list of lexemes for each HT category too.

I also spent a bit of time working on the Data Management Plan for the new project for Jane Stuart-Smith and Eleanor Lawson at QMU and arranged for a PhD student to get access to the TextGrid files that were generated for the audio records for the SCOTS Corpus project.

Finally, I investigated the issue the DSL people are having with duplicate child entries appearing in their data.  This was due to something not working quite right in a script Thomas Widmann had written to extract the data from the DSL’s editing system before he left last year, and Ann had sent me some examples of where the issue was cropping up.

I have the data that was extracted from Thomas’s script last July as two XML files (dost.xml and snd.xml) and I looked through these for the examples Ann had sent.  The entry for snd13897 contains the following URLs:

<url>snd13897</url>

<url>snds3788</url>

<url>sndns2217</url>

The first is the ID for the main entry and the other two are child entries.  If I search for the second one (snds3788) this is the only occurrence of the ID in the file, as the child entry has been successfully merged.  But if I search for the third one (sndns2217) I find a separate entry with this ID (with more limited content).  The pulling in of data into a webpage in the V3 site uses URLs stored in a table linked to entry IDs. These were generated from the URLs in the entries in the XML file (see the <url> tags above).  For the URL ‘sndns2217’ the query finds multiple IDs, one for the entry snd13897 and another for the entry sdnns2217.  But it finds snd13897 first, so it’s the content of this entry that is pulled into the page.

The entry for dost16606 contains the following URLs:

<url>dost16606</url>

<url>dost50272</url>

(in addition to headword URLs).  Searching for the second one discovers a separate entry with the ID dost50272 (with more limited content).  As with SND, searching the URL table for this URL finds two IDs, and as dost16606 appears first this is the entry that gets displayed.

What we need to do is remove the child entries that still exist as separate entries in the data.  To do this I could is write a script that would go through each entry in the dost.xml and snd.xml files.  It would then pick out every <url> that is not the same as the entry ID and search the file to see if any entry exists with this ID.  If it does then presumably this is a duplicate that should then be deleted.  I’m waiting to hear back from the DSL people to see how we should proceed with this.

As you can no doubt gather from the above, this was a very busy week but I do at least feel that I’m getting on top of things again.

Week Beginning 22nd June 2020

This was week 14 of Lockdown and I spent most of it continuing to work on the Books and Borrowing project.  Last week I’d planned to migrate the CMS from my test server at Glasgow to the official project server at Stirling, but during the process some discrepancies between PHP versions on the servers meant that the code which worked fine at Glasgow was giving errors at Stirling.  As mentioned in last week’s post, on the Stirling server calling a function while passing less than the required number of variables resulted in a fatal error, plus database ‘warnings’ (e.g. an empty string rather than a numeric zero being inserted into an integer field) were being treated as fatal errors too.  It took most of Monday to go through my scripts and identify all the places such issues cropped up, but by the end of the day I had the CMS set up and fully usable at Stirling and had asked the team to start using it.

I then spent some further time working on the public website for the project, installing a theme, working with fonts and colour schemes, selecting header images, adding logos to the footer and other such matters.  I made six different versions of the interface and emailed screenshots to the team for comment.  We all agreed on the interface and I then made some further tweaks to it, during which time team member Kit Baston was adding content to the pages.  On Thursday the website went live and you can access it here: https://borrowing.stir.ac.uk/.  Here’s a screenshot too:

I also continued to make improvements to the CMS this week, adding new functionality to the pages for browsing book editions, book works and authors.  The table of Book Works now includes a column listing the number of Holdings each Work is associated with and now includes the options of ordering the listed Works by any of the columns in the table.  When a book work row is expanded and its associated editions loads in, this table also now features the number of holdings an edition is associated with and allows the table to be ordered by any of the columns.  I then made the number of holdings and records listed for each Work and Edition a link (so long as the number is greater than 0).  Pressing on the link brings up a popup that lists the holdings and records.  Each item in the list features an ‘eye’ icon and pressing on this will take you to the record in question (either in the library’s list of holdings or the page that the borrowing record appears on) with the page opening at the item in question.

I updated the ‘browse authors’ page in a similar way:  added in the option of ordering the table by any of the columns and adding in counts of associated works, editions, holdings and items that are also now links that open up a popup containing all related items.  Each of these feature an ‘eye’ icon and you can press on one of these to be taken to the record in question.  Holdings and Items will open in the corresponding library’s list of book holdings while works and editions will load the ‘Browse books’ page.  Linking to an edition was a bit tricky as editions are dynamically loaded into the page via JavaScript when a book work row is expanded.  I had to pass variables to the page that flagged that one work should be open on page load, triggered the loading in of the editions and then scrolled the page to the correct location once the editions had loaded.  If the edition has no work then the ‘no work specified’ section needs to open, which currently takes a long time due to there being 1911 such editions at present.  There isn’t currently a ‘loading’ icon or anything but things do load in the background and the page will eventually jump down to the correct place.  I also fixed a bug whereby if you disassociated a book holding from a record the edition and work autocompletes stopped working for that record.

On Friday I had a Zoom call wit Project PI Katie Halsey and Co-I Matt Sangster to discuss my work on the project and to decide where I should focus my attention next.  We agreed that it would be good to get all of the sample data into the system now, so that the team can see what’s already there and begin the process of merging records and rationalising the data.  Therefore I’ll be spending a lot of next week writing import scripts for the remaining datasets.

I worked for a number of additional projects this week as well.  On Tuesday I had a Zoom call with Jane Stuart-Smith, Eleanor Lawson of QMU and Joanne Cleland of Strathclyde to discuss a new project that they’re putting together.  I can’t say too much about it at this stage, but I’ll probably be doing the technical work for the project, if it gets funding.  I also spoke with Thomas Clancy about another place-names project that has been funded and I’ll need to adapt my existing place-names system for.  This will probably be starting in September and involves a part of East Ayrshire.  I also adding in some forum software to Matthew Creasy’s new project website that I recently put together for him.  He’s hoping to launch this next week so will probably add in a link to it then.

I also managed to spend some time this week looking into the Historical Thesaurus’s new dates system.  My scripts to generate the new HT date structure completed over the weekend and I then had to manually fix the 60 or so label errors that Fraser had previously identified in his spreadsheet.  I then wrote a further script to check that the original fulldate, the new fulldate and a fulldate generated on the fly from the new date table all matched for each lexeme.  This brought up about a thousand lexemes where the match wasn’t identical.  Most of these were due to ‘b’ dates not being recorded in a consistent manner in the original data (sometimes two digits e.g. 1781/86 and sometimes one digit e.g. 1781/6).  There were some other issues with dates that had both labels and slashes as connectors, whereby the label ended up associated with both dates rather than just one.  There were also some issues with bracketed dates sometimes being recorded with the brackets and sometimes not, plus a few that had a dash before the date instead.  I went through the 1000 or so rows and fixed the ones that actually needed fixing (maybe about 50).  I then imported the new lexeme_dates table into the online database.  There are 1,381,772 rows in it.  I also attempted to import the updated lexeme database (which includes a new fulldate column plus new firstdate and lastdate fields).  Unfortunately the file contains too much data to be uploaded and the process timed out.  I contacted Arts IT Support and they managed to increase the execution time on the server and I was then able to get this second table uploaded too.

Fraser had sent around a document listing the next steps in the data update process and I read through this and began to think things through.  Fraser noted that the unique date types list didn’t appear to include ‘a’ and ‘c’ for firstdates.  I checked my script that generated the date types (way back in April last year) and spotted an error – the script was looking for a column called ‘oefirstdac’ where it should have been looking for ‘firstdac’.  What this means is any lexeme that has an ‘a’ or ‘c’ with its first date has been rolled into the count for regular first dates, but it turns out that this is what Fraser wanted to happen anyway, so no harm was done there.

Before I can make a start on getting all HT lexemes that are XXXX-XXXX, OE-XXXX and XXXX-Current and are matched to an OED lexeme and grabbing the OED date information I’ll need to find a way to actually get the new OED date information.  Fraser noted that we can’t just use the OED ‘sortdate’ and ‘enddate’ fields but instead need to use the first and last citation dates as these have ‘a’ and ‘c’.  I’m going to need to get access to the most recent version of all of the OED XML files and to write a script that goes through all of the quotations data, such as:

<quotations><q year=”1200″><date>?c1200</date></q><q year=”1392″><date>a1393</date></q><q year=”1450″><date>c1450</date></q><q year=”1481″><date>1481</date></q><q year=”1520″><date>?1520</date></q><q year=”1530″><date>1530</date></q><q year=”1556″><date>1556</date></q><q year=”1608″><date>1608</date></q><q year=”1647″><date>1647</date></q><q year=”1690″><date>1690</date></q><q year=”1709″><date>1709</date></q><q year=”1728″><date>1728</date></q><q year=”1755″><date>1755</date></q><q year=”1804″><date>1804</date></q><q year=”1882″><date>1882</date></q><q year=”1967″><date>1967</date></q><q year=”2007″><date>2007</date></q></quotations>

And then picks out the first date and the last date, plus any ‘a’, ‘c’ and ‘?’ value.  This is going to be another long process, but I can’t begin it until I can get my hands on the full OED dataset, which I don’t have with my at home.

 

Week Beginning 1st June 2020

During week 11 of Lockdown I continued to work on the Books and Borrowing project, but also spent a fair amount of time catching up with other projects that I’d had to put to one side due to the development of the Books and Borrowing content management system.  This included reading through the proposal documentation for Jennifer Smith’s follow-on funding application for SCOSYA, and writing a new version of the Data Management Plan based on this updated documentation and making some changes to the ‘export data for print publication’ facility for Carole Hough’s REELS project.  I also spent some time creating as new export facility to format the place-name elements and any associated place-names for print publication too.

During this week a number of SSL certificates expired for a bunch of websites, which meant browsers were displaying scary warning messages when people visited the sites.  I had to spend a bit of time tracking these down and passing the details over to Arts IT Support for them to fix as it is not something I have access rights to do myself.  I also liaised with Mike Black to migrate some websites over from the server that houses many project websites to a new server.  This is because the old server is running out of space and is getting rather temperamental and freeing up some space should address the issue.

I also made some further tweaks to Paul Malgrati’s interactive map of Burns’ Suppers and created a new WordPress-powered project website for Matthew Creasy’s new ‘Scottish Cosmopolitanism at the Fin de Siècle’ project.  This included the usual choosing a theme, colour schemes and fonts, adding in header images and footer logos and creating initial versions of the main pages of the site.  I’d also received a query from Jane Stuart-Smith about the audio recordings in the SCOTS Corpus so I did a bit of investigation about that.

Fraser Dallachy had got back to me with some further tasks for me to carry out on the processing of dates for the Historical Thesaurus, and I had intended to spend some time on this towards the end of the week, but when I began to look into this I realised that the scripts I’d written to process the old HT dates (comprising 23 different fields) and to generate the new, streamlined date system that uses a related table with just 6 fields were sitting on my PC in my office at work.  Usually all the scripts I work on are located on a server, meaning I can easily access them from anywhere by connecting to the server and downloading them.  However, sometimes I can’t run the scripts on the server as they may need to be left running for hours (or sometimes days) if they’re processing large amounts of data or performing intensive tasks on the data.  In these cases the scripts run directly on my office PC, and this was the situation with the dates script.  I realised I would need to get into my office at work on retrieve the scripts, so I put in a request to be allowed into work.  Staff are not currently allowed to just go into work – instead you need to get approval from your Head of School and then arrange a time that suits security.  Thankfully it looks like I’ll be able to go in early next week.

Other than these issues, I spent my time continuing to work for the Books and Borrowing project.  On Tuesday we had a Zoom call with all six members of the core project team, during which I demonstrated the CMS as it currently stands.  This gave me an opportunity to demonstrate the new Author association facilities I had created last week.  The demonstration all went very smoothly and I think the team are happy with how the system works, although no doubt once they actually begin to use it there will be bugs to fix and workflows to tweak.  I also spent some time before the meeting testing the system again, and fixing some issues that were not quite right with the author system.

I spent the remainder of my time on the project completing work on the facility to add, edit and view book holding records directly via the library page, as opposed to doing so whilst adding / editing a borrowing record.  I also implemented a similar facility for borrowers as well.  Next week I will begin to import some of the sample data from various libraries into the system and will allow the team to access the system to test it out.

Week Beginning 25th May 2020

We’ve now reached week 10 of Lockdown, and I spent it in much the same way as previous weeks, dividing my time between work and homeschooling my son.  This week I continued to focus on the development of the content management system for the Books and Borrowing project.  On Tuesday I had a Zoom meeting to demonstrate the system as it currently stands to the project PI Katie Halsey and Co-I Matt Sangster.  Monday was a bank holiday but I decided to work it and take the day off at a later date in order to prepare a walkthrough and undertake a detailed testing of the system, which uncovered a number of bugs that I then tracked down and fixed.  My walkthrough went through all of the features that are so far in place:  creating, editing and deleting libraries, viewing libraries, adding ledgers and additional fields to libraries, viewing, editing and deleting these ledgers and additional fields, adding pages to ledgers, editing and deleting them, viewing a page, the automated approach to constructing navigation between pages, viewing records on pages and then the big thing: adding and editing borrowing records.  This latter process can involve adding data about the borrowing (e.g. lending date), one or more borrowers (which may be new borrowers or ones already in the system), a new or existing book holding, which may consist of one or more book items (e.g. volumes 1 and 3 of a book) and may be connected to one or more new or existing project-wide book edition records which may have a new or existing top-level book work record.

The walkthrough via Zoom went well, with me sharing my screen with Katie and Matt so they could follow my actions as I used the CMS.  I was a bit worried they would think the add / edit borrowing record form would be too complicated but although it does look rather intimidating, most of the information is optional and many parts of it will be automatically populated by linking to existing records via autocomplete drop-downs, so once there is a critical mass of existing data in the system (e.g. existing book and borrower records) the process of adding new borrowing records will be much quicker and easier.

The only major change that I needed to make following the walkthrough was to add a new ‘publication end date’ field to book edition and book work records as some books are published in parts over multiple years (especially books comprised of multiple volumes).  I implemented this after the meeting and then spent most of the remainder of the week continuing to implement further aspects of the CMS.  I made a start on the facility to view a list of all book holding records that have been created for a library, through which the project team will be able to bring up a list of all borrowing records that involve the book.  I got as far as getting a table listing the book holdings in place, but as the project team will be started next week I figured it would make more sense to try and tackle the last major part of the system that still needed to be implemented:  creating and associating author records with the four levels of book record.

A book may have any number of authors and their associations with a book record cascades down through the levels.  For example, if an author is associated with a book via its top-level ‘book work’ record then the author will automatically be associated with a related ‘book edition’ record, any ‘book holding’ records this edition is connected to and any ‘book item’ records belonging to the book holding.  But we need to be able to associate an author not just with ‘book works’ but with any level of book record, as a book may have a different author at one of these levels (e.g. a particular volume may be attributed to a different author) or the same author may be referred to by a different alias in a particular edition.  Therefore I had to update the already complicated add / edit borrowing record form to enable authors to be created, associated and disassociated with any book level.  Plus I needed to add in an autocomplete facility to enable authors already in the system to be attached to records and to ensure that the author sections clear and reset themselves if the user removes the book from the borrowing record.  It took a long time to implement this system, but by the end of the week I’d got an initial version working.  It will need a lot of testing and no doubt some fixing next week, but it’s a relief to get this major part of the system in place.  I also added in a little feature that keeps the user’s CMS session going for as long as the browser is on a page of the CMS, which is very important as the complicated forms may take a long time to complete and it would be horrible if the sessions timed out before the user was able to submit the form.

I didn’t have time to do much else this week.  I was supposed to have a Zoom call about the Historical Thesaurus on Friday but this has been postponed as we’re all pretty busy with other things.  One of the server that hosts a lot of project websites has been experiencing difficulties this week so I had to deal with emails from staff about this and to contact Arts IT Support to ask them to fix things as it’s not something I have access to myself.  The server appears to be down again as I’m writing this, unfortunately.

The interactive map I’d created for Gerry McKeever’s Regional Romanticism project was launched this week, and can now be accessed here: https://regionalromanticism.glasgow.ac.uk/paul-jones/ but be aware that this is one of the sites currently affected by the server issue so the map, or parts of the site in general may be unavailable.

Next week the project team for the Books and Borrowing project start work and I will be giving them a demonstration of the CMS on Tuesday, so no doubt I will be spending a lot of time continuing to work on this then.

Week Beginning 18th May 2020

I spent week 9 of Lockdown continuing to implement the content management system for the Books and Borrowing project.  I was originally hoping to have completed an initial version of the system by the end of this week, but this was unfortunately not possible due to having to juggle work and home-schooling, commitments to other projects and the complexity of the project’s data.  It took several days to complete the scripts for uploading a new borrowing record due to the interrelated nature of the data structure.  A borrowing record can be associated with one or more borrowers, and each of these may be new borrower records or existing ones, meaning data needs to be pulled in via an autocomplete to prepopulate the section of the form.  Books can also be new or existing records but can also have one or more new or existing book item records (as a book may have multiple volumes) and may be linked to one or more project-wide book edition records which may already exist or may need to be created as part of the upload process, and each of these may be associated with a new or existing top-level book work record.  Therefore the script for uploading a new borrowing record needs to incorporate the ‘add’ and ‘edit’ functionality for a lot of associated data as well.  However, as I have implemented all of these aspects of the system now it will make it quicker and easier to develop the dedicated pages for adding and editing borrowers and the various book levels once I move onto this.  I still haven’t working on the facilities to add in book authors, genres or borrower occupations, which I intend to move onto once the main parts of the system are in place.

After completing the scripts for processing the display of the ‘add borrowing’ form and the storing of all of the uploaded data I moved onto the script for viewing all of the borrowing records on a page.  Due to the huge number of potential fields I’ve had to experiment with various layouts, but I think I’ve got one that works pretty well, which displays all of the data about each record in a table split into four main columns (Borrowing, Borrower, Book Holding / Items, Book Edition / Works).  I’ve also added in a facility to delete a record from the page.  I then moved on to the facility to edit a borrowing record, which I’ve added to the ‘view’ page rather than linking out to a separate page.  When the ‘edit’ button is pressed on for a record its row in the table is replace with the ‘edit’ form, which is identical in style and functionality to the ‘add’ form, but is prepopulated with all of the record’s data.  As with the ‘add’ form, it’s possible to associated multiple borrowers and book items and editions, and also to manage the existing associations using this script.  The processing of the form uses the same logic as the ‘add’ script so thankfully didn’t require much time to implement.

What I still need to do is add authors and borrower occupations to the ‘view page’, ‘add record’ and ‘edit record’ facilities, add the options to view / edit / add / delete a library’s book holdings and borrowers independently of the borrowing records, plus facilities to manage book editions / works, authors, genres and occupations at the top level as opposed to when working on a record.  I also still need to add in the facilities to view / zoom / pan a page image and add in facilities to manage borrower cross-references.  This is clearly quite a lot, but the core facilities of adding, editing and deleting borrowing, borrower and book records is now in place, which I’m happy about.  Next week I’ll continue to work on the system ahead of the project’s official start date at the beginning on June.

Also this week I made a few tweaks to the interface for the Place-names of Mull and Ulva project, spoke to Matthew Creasy some more about the website for his new project, spoke to Jennifer Smith about the follow-on funding proposal for the SCOSYA project and investigated an issue that was affecting the server that hosts several project websites (basically it turned out that the server had run out of disk space).

I also spent some time working on scripts to process data from the OED for the Historical Thesaurus.  Fraser is working on incorporating new dates from the OED and needs to work out which dates in the HT data we want to replace and which should be retained.  The script makes groups of all of the distinct lexemes in the OED data.  If the group has two or more lexemes it then checks that at least one of them is revised.  It then makes subgroups of all of the lexemes that have the same date (so for example all the ‘Strike’ words with the same ‘sortdate’ and ‘lastdate’ are grouped together).  If one word in the whole group is ‘revised’ and at least two words have the same date then the words with the same dates are displayed in the table.  The script also checks for matches in the HT lexemes (based on catid, refentry, refid and lemmaid fields).  If there is a match this data is also displayed.  I then further refined the output based on feedback from Fraser, firstly highlighting in green those rows where at least two of the HT dates match, and secondly splitting the table into three separate tables, one with the green rows, one containing all other OED lexemes that have a matching HT lexeme and a third containing OED lexemes that (As of yet) do not have a matching HT lexeme.