Week Beginning 29th June 2020

This was week 15 of Lockdown, which I guess is sort of coming to an end now, although I will still be working from home for the foreseeable future and having to juggle work and childcare every day.  I continued to work on the Books and Borrowing project for much of this week, this time focussing on importing some of the existing datasets from previous transcription projects.  I had previously written scripts to import data from Glasgow University library and Innerpeffray library, which gave us 14,738 borrowing records.  This week I began by focussing on the data from St Andrews University library.

The St Andrews data is pretty messy, reflecting the layout and language of the original documents, so I haven’t been able to fully extract everything and it will require a lot of manual correcting.  However, I did manage to migrate all of the data to a test version of the database running on my local PC and then updated the online database to incorporate this data.

The data I’ve got are CSV and HTML representations of transcribed pages that come from an existing website with pages that look like this: https://arts.st-andrews.ac.uk/transcribe/index.php?title=Page:UYLY205_2_Receipt_Book_1748-1753.djvu/100.  The links in the pages (e.g. Locks Works) lead through to further pages with information about books or borrowers.  Unfortunately the CSV version of the data doesn’t include the links or the linked to data, and as I wanted to try and pull in the data found on the linked pages I therefore needed to process the HTML instead.

I wrote a script that pulled in all of the files in the ‘HTML’ directory and processed each in turn.  From the filenames my script could ascertain the ledger volume, its dates and the page number.  For example ‘Page_UYLY205_2_Receipt_Book_1748-1753.djvu_10.html’ is ledger 2 (1748-1753) page 10.  The script creates ledgers and pages, and adds in the ‘next’ and ‘previous’ page links to join all the pages in a ledger together.

The actual data in the file posed further problems.  As you can see from the linked page above, dates are just too messy to automatically extract into our strongly structured borrowed and returned date system.  Often a record is split over multiple rows as well (e.g. the borrowing record for ‘Rollins belles Lettres’ is actually split over 3 rows).  I could have just grabbed each row and inserted it as a separate borrowing record, which would then need to be manually merged, but I figured out a way to do this automatically.  The first row of a record always appears to have a code (the shelf number) in the second column (e.g. J.5.2 for ‘Rollins’) whereas subsequent rows that appear to belong to the same record don’t (e.g. ‘on profr Shaws order by’ and ‘James Key’).  I therefore set up my script to insert new borrowing records for rows that have codes, and to append any subsequent rows that don’t have codes to this record until a row with a code is reached again.

I also used this approach to set up books and borrowers too.  If you look at the page linked to above again you’ll see that the links through to things are not categorised – some are links to books and others to borrowers, with no obvious way to know which is which.  However, it’s pretty much always the case that it’s a book that appears in the row with the code and it’s people that are linked to in the other rows.  I could therefore create or link to existing book holding records for links in the row with a code and create or link to existing borrower records for links in rows without a code.  There are bound to be situations where this system doesn’t quite work correctly, but I think the majority of rows do fit this pattern.

The next thing I needed to do was to figure out which data from the St Andrews files should be stored as what in our system.  I created four new ‘Additional Fields’ for St Andrews as follows:

  • Original Borrowed date: This contains the full text of the first column (e.g. Decr 16)
  • Code: This contains the full text of the second column (e.g. J.5.2)
  • Original Returned date: This contains the full text of the fourth column (e.g. Jan. 5)
  • Original returned text: This contains the full date of the fifth column (e.g. ‘Rollins belles Lettres V. 2d’)

In the borrowing table the ‘transcription’ field is set to contain the full text of the ‘borrowed’ column, but without links.  Where subsequent rows contain data in this column but no code, this data is then appended to the transcription.  E.g. the complete transcription for the third item on the page linked to above is ‘Rollins belles Lettres Vol 2<sup>d</sup> on profr Shaws order by James Key’.

The contents of all pages linked to in the transcriptions are added to the ‘editors notes’ field for future use if required.  Both the page URL and the page content are included, separated by a bar (|) and if there are multiple links these are separated by five dashes.  E.g. for the above the notes field contains:

‘Rollins_belles_Lettres| <p>Possibly: De la maniere d’enseigner et d’etuder les belles-lettres, Par raport à l’esprit &amp; au coeur, by Charles Rollin. (A Amsterdam : Chez Pierre Mortier, M. DCC. XLV. [1745]) <a href=”http://library.st-andrews.ac.uk/record=b2447402~S1″>http://library.st-andrews.ac.uk/record=b2447402~S1</a></p>

—– profr_Shaws| <p><a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484</a></p>

—– James_Key| <p>Possibly James Kay: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860</a></p>

—–‘

As mentioned earlier, the script also generates book and borrower records based on the linked pages too.  I’ve chosen to set up book holding rather than book edition records as the details are all very vague and specific to St Andrews.  In the holdings table I’ve set the ‘standardised title’ to be the page link with underscores replaced with dashes (e.g. ‘Rollins belles Lettres’) and the page content is stored in the ‘editors notes’ field.  One book item is created for each holding to be used to link to the corresponding borrowing records.

For borrowers a similar process is followed, with the link added to the surname column (e.g. Thos Duncan) and the page content added to the ‘editors notes’ field (e.g. <p>Possibly Thomas Duncan: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372</a></p>’).  All borrowers are linked to records as ‘Main’ borrowers.

During the processing I noticed that the fourth ledger had a slightly different structure to the others, with entire pages devoted to a particular borrower, whose name then appeared in a heading row in the table.  I therefore updated my script to check for the existence of this heading row, and if it exists my script then grabs the borrower name, creates the borrower record if it doesn’t already exist and then links this borrower to every borrowing item found on the page.  After my script had finished running we had 11147 borrowing records, 996 borrowers and 6395 book holding records for St Andrew in the system.

I then moved onto looking at the data for Selkirk library.  This data was more nicely structured than the St Andrews data, with separate spreadsheets for borrowings, borrowers and books and borrowers and books connected to borrowings via unique identifiers.  Unfortunately the dates were still transcribed as they were written rather than being normalised in any way, which meant it was not possible to straightforwardly generate structured dates for the records and these will need to be manually generated.  The script I wrote to import the data took about a day to write, and after running it we had a further 11,431 borrowing records across two registers and 415 pages entered into our database.

As with St Andrews, I created book records as Holding records only (i.e. associated specifically with the library rather than being project-wide ‘Edition’ records.  There are 612 Holding records for Selkirk.  I also processed the borrower records, resulting in 86 borrower records being added.  I added the dates as originally transcribed to an additional field named ‘Original Borrowed Date’ and the only other additional field is in the Holding records for ‘Subject’, that will eventually be merged with our ‘Genre’ when this feature becomes available.

Also this week I advised Katie on a file naming convention for the digitised images of pages that will be created for the project.  I recommended that the filenames shouldn’t have spaces in them as these can be troublesome on some operating systems and that we’d want a character to use as a delimiter between the parts of the filename that wouldn’t appear elsewhere in the filename so it’s easy to split up the filename.  I suggested that the page number should be included in the filename and that it should reflect the page number as it will be written into the database – e.g. if we’re going to use ‘r’ and ‘v’ these would be included.  Each page in the database will be automatically assigned an auto-incrementing ID, and the only means of linking a specific page record in the database with a specific image will be via the page number entered when the page is created, so if this is something like ‘23r’ then ideally this should be represented in the image filename.

Katie had wondered about using characters to denote ledgers and pages in the filename (e.g. ‘L’ and ‘P’) but if we’re using a specific delimiting character to separate parts of the filename then using these characters wouldn’t be necessary and I suggested it would be better to not use ‘L’ as a lower case ‘l’ is very easy to confuse with a ‘1’ or a capital ‘I’ which might confuse future human users.

Instead I suggested using a ‘-‘ instead of spaces and a ‘_’ as a delimiter and pointed out that we should  ensure that no other non-alphanumeric characters are ever used in the filename – no apostrophes, commas, colons, semi-colons, ampersands etc and to make sure the ‘-‘ is really a minus sign and not one of the fancy dashes (–) that get created by MS Office.  This shouldn’t be an issue when entering a filename, but might be if a list of filenames is created in Word and then pasted into the ‘save as’ box, for example.

Finally, I suggested that it might be best to make the filenames entirely lower case, as some operating systems are case sensitive and if we don’t specify all lower case then there may be variation in the use of case.  Following these guidelines the filenames would look something like this:

  • jpg
  • dumfries-presbytery_2_3v.jpg
  • standrews-ul_9_300r.jpg

In addition to the Books and Borrowing project I worked on a number of other projects this week.  I gave Matthew Creasy some further advice on using forums in his new project website, and ‘Scottish Cosmopolitanism at the Fin de Siècle’ website is now available here: https://scoco.glasgow.ac.uk/.

I also worked a bit more on using dates from the OED data in the Historical Thesaurus.  Fraser had sent me a ZIP file containing the entire OED dataset as 240 XML files and I began analysing these to figure out how we’d extract these dates so that we could use them to update the dates associated with the lexemes in the HT.  I needed to extract the quotation dates as these have ‘ante’ and ‘circa’ notes, plus labels.  I noted that in addition to ‘a’ and ‘c’ a question mark is also used, somethings with an ‘a’ or ‘c’ and sometimes without.  I decided to process things as follows:

  • ?a will just be ‘a’
  • ?c will just be ‘c’
  • ? without an ‘a’ or ‘c’ will be ‘c’.

I also noticed that a date may sometimes be a range (e.g. 1795-8) so I needed to include a second date column in my data structure to accommodate this.  I also noted that there are sometimes multiple Old English dates, and the contents of the ‘date’ tag vary depending on the date – sometimes the content is ‘OE’ and othertimes ‘lOE’ or ‘eOE’.  I decided to process any OE dates for a lexeme as being 650 and to have only one OE date stored, so as to align with how OE dates are stored in the HT database (we don’t differentiate between date for OE words).

While running my date extraction script over one of the XML files I also noticed that there were lexemes in the OED data that were not present in the OED data we had previously extracted.  This presumably means the dataset Fraser sent me is more up to date than the dataset I used to populate our online OED data table.  This will no doubt mean we’ll need to update our online OED table, but as we link to the HT lexeme table using the OED catid, refentry, refid and lemmaid fields if we were to replace the online OED lexeme table with the data in these XML files the connections from OED to HT lexemes would be retained without issue (hopefully), but any matching processes we performed would need to be done again for the new lexemes.

I set my extraction script running on the OED XML files on Wednesday and processing took a long time.  The script didn’t complete until sometime during Friday night, but after it had finished it had processed 238,699 categories, 754,285 lexemes, generating 3,893,341 date rows.  It also found 4,062 new words in the OED data that it couldn’t process because they don’t exist in our OED lexeme database.

I also spent a bit more time working on some scripts for Fraser’s Scots Thesaurus project.  The scripts now ignore ‘additional’ entries and only include ‘n.’ entries that match an HT ‘n’ category.  Variant spellings are also removed (these were all tagged with <form> and I removed all of these).  I also created a new field to store only the ‘NN_’ tagged words and remove all others.

The scripts generated three datasets, which I saved as spreadsheets for Fraser.  The first (postagged-monosemous-dost-no-adds-n-only) contains all of the content that matches the above criteria. The second (postagged-monosemous-dost-no-adds-n-only-catheading-match) lists those lexemes where a postagged word fully matches the HT category heading.  The final (postagged-monosemous-dost-no-adds-n-only-catcontents-match) lists those lexemes where a postagged word fully matches a lexeme in the HT category.  For this table I’ve also added in the full list of lexemes for each HT category too.

I also spent a bit of time working on the Data Management Plan for the new project for Jane Stuart-Smith and Eleanor Lawson at QMU and arranged for a PhD student to get access to the TextGrid files that were generated for the audio records for the SCOTS Corpus project.

Finally, I investigated the issue the DSL people are having with duplicate child entries appearing in their data.  This was due to something not working quite right in a script Thomas Widmann had written to extract the data from the DSL’s editing system before he left last year, and Ann had sent me some examples of where the issue was cropping up.

I have the data that was extracted from Thomas’s script last July as two XML files (dost.xml and snd.xml) and I looked through these for the examples Ann had sent.  The entry for snd13897 contains the following URLs:

<url>snd13897</url>

<url>snds3788</url>

<url>sndns2217</url>

The first is the ID for the main entry and the other two are child entries.  If I search for the second one (snds3788) this is the only occurrence of the ID in the file, as the child entry has been successfully merged.  But if I search for the third one (sndns2217) I find a separate entry with this ID (with more limited content).  The pulling in of data into a webpage in the V3 site uses URLs stored in a table linked to entry IDs. These were generated from the URLs in the entries in the XML file (see the <url> tags above).  For the URL ‘sndns2217’ the query finds multiple IDs, one for the entry snd13897 and another for the entry sdnns2217.  But it finds snd13897 first, so it’s the content of this entry that is pulled into the page.

The entry for dost16606 contains the following URLs:

<url>dost16606</url>

<url>dost50272</url>

(in addition to headword URLs).  Searching for the second one discovers a separate entry with the ID dost50272 (with more limited content).  As with SND, searching the URL table for this URL finds two IDs, and as dost16606 appears first this is the entry that gets displayed.

What we need to do is remove the child entries that still exist as separate entries in the data.  To do this I could is write a script that would go through each entry in the dost.xml and snd.xml files.  It would then pick out every <url> that is not the same as the entry ID and search the file to see if any entry exists with this ID.  If it does then presumably this is a duplicate that should then be deleted.  I’m waiting to hear back from the DSL people to see how we should proceed with this.

As you can no doubt gather from the above, this was a very busy week but I do at least feel that I’m getting on top of things again.