This was another busy week involving lots of projects. For the Books and Borrowing project I wrote an import script to process the Glasgow Professors borrowing records, comprising of more than 7,000 rows in a spreadsheet. It was tricky to integrate this with the rest of the project’s data and it took about a day to write the necessary processing scripts. I can only run the scripts on the real data in the evening as I need to take the CMS offline to do so, otherwise changes made to the database whilst I’m integrating the data will be lost and unfortunately it took three attempts to get the import to work properly. There are a few reasons why this data has been particularly tricky. Firstly, it needs to be integrated with existing Glasgow data, rather than being a ‘fresh’ upload to a new library. This caused some problems as my scripts that match up borrowing records and borrowers were getting confused with the existing Student borrowers. Secondly, the spreadsheet order was not in page order for each register – the order appears to have been ‘10r’, ‘10v’, then ‘11r’ etc then after ‘19v’ came ‘1r’. This is presumably to do with Excel ordering numbers as text. I tried reordering on the ‘sort order’ column but this also ordered things weirdly (all the numbers beginning with 1, then all the numbers beginning with 2 etc). I tried changing the data type of this field to a number rather than text but that just resulted in Excel giving errors in all of the fields. What this meant was I needed to sort the data in my own script before I could use it (otherwise the ‘next’ and ‘previous’ page links would all have been wrong), and it took time to implement this. However, I got there in the end.
I also continued working on the Historical Thesaurus database and front-end to allow us to use the new date fields and to enable us to keep track of what lexemes and categories had been updated in the new Second Edition. I have now fully migrated my second edition test site to using the new date system, including the advanced search for labels and both the ‘simple’ and ‘advanced’ date search. I have also now created the database structure for dealing with second edition updates. As we agreed at the last team meeing, the lexeme and category tables have been updated to each have two new fields – ‘last_updated’, which holds a human-readable date (YYYY-MM-DD) that will be automatically populated when rows are updated and ‘changelogcode’ which holds the ID of the row in the new ‘changelog’ table that applies to the lexeme or category. This new table consists of an ID, a ‘type’ (lexeme or category) and the text of the changelog. I’ve created two changelogs for test purposes: ‘This word was antedated in the second edition’ and ‘This word was postdated in the second edition’. I’ve realised that this structure means only one changelog can be associated with a lexeme, with a new one overwriting the old one. A more robust system would record all of the changelogs that have been applied to a lexeme or category and the dates these were applied, and depending on what Marc and Fraser think I may update the system with an extra joining table that would allow this papertrail to be recorded.
For now I’ve updated two lexemes in category 1 to use the two changelogs for test purposes. I’ve updated the category browser in the front end to add in a ‘2’ in a circle where ‘second edition’ changelog IDs are present. These have tooltips that when hovered over display the changelog text and the following screenshot demonstrates:
I haven’t added these circles to the search results yet or the full timeline visualisations, but it is likely that they will need to appear there too.
I also spent some time working on a new script for Fraser’s Scots Thesaurus project. This script allows a user to select an HT category to bring back all of the words contained in it. It then queries the DSL for each of these words and returns a list of those entries that contain at least two of the category’s words somewhere in the entry text. The script outputs the name of the category that was searched for, a list of returned HT words so you can see exactly what is being searched for, and the DSL entries that feature at least two of the words in a table that contains fields such as source dictionary, parts of speech, a link through to the DSL entry, headword etc. I may have to tweak this further next week, but it seems to be working pretty well.
I spent most of the rest of the week working on the redevelopment of the Anglo-Norman Dictionary. We had a bit of a shock at the start of the week because the entire old site was offline and inaccessible. It turned out that the domain name subscription had expired, and thankfully it was possible to renew it and the site became available again. I spent a lot of time this week continuing to work on the entry page, trying to untangle the existing XSLT script and work out how to apply the necessary rules to the editors’ version of the XML, which differs from the system version of the XML that was generated by an incomprehensible and undocumented series of processes in the old system.
I started off with the references located within the variant forms. In the existing site these link through to source texts, with information appearing in a pop-up when the reference is clicked on. To get these working I needed to figure out where the list of source texts was being stored and also how to make the references appear properly. The Editors’ XML and the System XML differ in structure, and only the latter actually contains the text that appears as the link. So, for example, while the latter has:
<cit> <bibl siglum=”Secr_waterford1″ loc=”94.787″><i>Secr</i> <sc>waterford</sc><sup>1</sup> 94.787</bibl> </cit>
The former only has:
<varref> <reference><source siglum=”Secr_waterford1″ target=””><loc>94.787</loc></source></reference> </varref>
This meant that the text to display and its formatting (<i>Secr</i> <sc>waterford</sc><sup>1</sup>) is not available to me. Thankfully I managed to track down an XML file that contained the list of texts, which contained this formatting and also all of the information that should appear in the pop-up that is opened when the link is clicked on, e.g.
<item id=”Secr_waterford1″ cits=”552″>
<siglum><i>Secr</i> <span class=”sc”>WATERFORD</span><sup>1</sup></siglum>
<bibl>Yela Schauwecker, <i>Die Diätetik nach dem ‘Secretum secretorum’ in der Version
von Jofroi de Waterford: Teiledition und lexikalische Untersuchung</i>, Würzburger medizinhistorische Forschungen 92, Würzburg, 2007
<date>c.1300 (text and MS)</date>
I then turned my attention cognate references section, and there were also some issues here with the Editors’ XML not including information that is in the system XML. The structure of the cognate references in the system XML is like this:
<xr_group type=”cognate” linkable=”yes”> <xr><ref siglum=”FEW” target=”90/page/231″ loc=”9,231b”>posse</ref></xr> </xr_group>
Note that there is a ‘target’ attribute that provides a link. The Editor’s XML does not include this information – here’s the same reference:
<FEW_refs siglum=”FEW” linkable=”yes”> <link_form>posse</link_form><link_loc>9,231b</link_loc> </FEW_refs>
There’s nothing in there that I can use to ascertain the correct link to add in. However, I have found a ‘hash’ file called ‘cognate_hash’ that when extracted I found contains a list of cognate references and targets. These don’t include entry identifiers so I’m not sure how they were connected to entries, but by combining the ‘siglum’ and the ‘loc’ it looks like it might be possible to find the target, e.g:
<xr_group type=”cognate” linkable=”yes”>
<ref siglum=”FEW” target=”90/page/231″ loc=”*9,231b”>posse</ref>
I’m not sure why there’s an asterisk, though. I also found another hash file called ‘commentary_hash’ that I guess contains the commentaries that appear in some entries but not in their XML. We’ll probably need to figure out whether we want to properly integrate these with the editor’s XML as well.
I completed work on the ‘cognate references’ section, omitting the links out for now (I’ll add these in later) and then moved on to the ‘summary’ box that contains links through to lower sections of the entry. Unfortunately the ‘sense’ numbers are something else that are not present in any form in the Editor’s XML. In the System XML each entry has a number, e.g. ‘<sense n=”1″>’ but in the Editor’s XML there is no such number. I spent quite a bit of time trying to increment a number in XSLT and apply it to each sense but it turns out you can’t increment a number in XSLT, even though there are ‘for’ loops where such an incrementing number would be easy to implement in other languages.
I still need to add in the non-locution xrefs, labels and some other things, but overall I’m very happy with the progress I’ve made this week. Below is an example of an entry in the old site, with the entry as it currently looks in the new test site I’m working on (be aware that the new interface is only a placeholder). Before:
This was week 15 of Lockdown, which I guess is sort of coming to an end now, although I will still be working from home for the foreseeable future and having to juggle work and childcare every day. I continued to work on the Books and Borrowing project for much of this week, this time focussing on importing some of the existing datasets from previous transcription projects. I had previously written scripts to import data from Glasgow University library and Innerpeffray library, which gave us 14,738 borrowing records. This week I began by focussing on the data from St Andrews University library.
The St Andrews data is pretty messy, reflecting the layout and language of the original documents, so I haven’t been able to fully extract everything and it will require a lot of manual correcting. However, I did manage to migrate all of the data to a test version of the database running on my local PC and then updated the online database to incorporate this data.
The data I’ve got are CSV and HTML representations of transcribed pages that come from an existing website with pages that look like this: https://arts.st-andrews.ac.uk/transcribe/index.php?title=Page:UYLY205_2_Receipt_Book_1748-1753.djvu/100. The links in the pages (e.g. Locks Works) lead through to further pages with information about books or borrowers. Unfortunately the CSV version of the data doesn’t include the links or the linked to data, and as I wanted to try and pull in the data found on the linked pages I therefore needed to process the HTML instead.
I wrote a script that pulled in all of the files in the ‘HTML’ directory and processed each in turn. From the filenames my script could ascertain the ledger volume, its dates and the page number. For example ‘Page_UYLY205_2_Receipt_Book_1748-1753.djvu_10.html’ is ledger 2 (1748-1753) page 10. The script creates ledgers and pages, and adds in the ‘next’ and ‘previous’ page links to join all the pages in a ledger together.
The actual data in the file posed further problems. As you can see from the linked page above, dates are just too messy to automatically extract into our strongly structured borrowed and returned date system. Often a record is split over multiple rows as well (e.g. the borrowing record for ‘Rollins belles Lettres’ is actually split over 3 rows). I could have just grabbed each row and inserted it as a separate borrowing record, which would then need to be manually merged, but I figured out a way to do this automatically. The first row of a record always appears to have a code (the shelf number) in the second column (e.g. J.5.2 for ‘Rollins’) whereas subsequent rows that appear to belong to the same record don’t (e.g. ‘on profr Shaws order by’ and ‘James Key’). I therefore set up my script to insert new borrowing records for rows that have codes, and to append any subsequent rows that don’t have codes to this record until a row with a code is reached again.
I also used this approach to set up books and borrowers too. If you look at the page linked to above again you’ll see that the links through to things are not categorised – some are links to books and others to borrowers, with no obvious way to know which is which. However, it’s pretty much always the case that it’s a book that appears in the row with the code and it’s people that are linked to in the other rows. I could therefore create or link to existing book holding records for links in the row with a code and create or link to existing borrower records for links in rows without a code. There are bound to be situations where this system doesn’t quite work correctly, but I think the majority of rows do fit this pattern.
The next thing I needed to do was to figure out which data from the St Andrews files should be stored as what in our system. I created four new ‘Additional Fields’ for St Andrews as follows:
- Original Borrowed date: This contains the full text of the first column (e.g. Decr 16)
- Code: This contains the full text of the second column (e.g. J.5.2)
- Original Returned date: This contains the full text of the fourth column (e.g. Jan. 5)
- Original returned text: This contains the full date of the fifth column (e.g. ‘Rollins belles Lettres V. 2d’)
In the borrowing table the ‘transcription’ field is set to contain the full text of the ‘borrowed’ column, but without links. Where subsequent rows contain data in this column but no code, this data is then appended to the transcription. E.g. the complete transcription for the third item on the page linked to above is ‘Rollins belles Lettres Vol 2<sup>d</sup> on profr Shaws order by James Key’.
The contents of all pages linked to in the transcriptions are added to the ‘editors notes’ field for future use if required. Both the page URL and the page content are included, separated by a bar (|) and if there are multiple links these are separated by five dashes. E.g. for the above the notes field contains:
‘Rollins_belles_Lettres| <p>Possibly: De la maniere d’enseigner et d’etuder les belles-lettres, Par raport à l’esprit & au coeur, by Charles Rollin. (A Amsterdam : Chez Pierre Mortier, M. DCC. XLV. ) <a href=”http://library.st-andrews.ac.uk/record=b2447402~S1″>http://library.st-andrews.ac.uk/record=b2447402~S1</a></p>
—– profr_Shaws| <p><a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484</a></p>
—– James_Key| <p>Possibly James Kay: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860</a></p>
As mentioned earlier, the script also generates book and borrower records based on the linked pages too. I’ve chosen to set up book holding rather than book edition records as the details are all very vague and specific to St Andrews. In the holdings table I’ve set the ‘standardised title’ to be the page link with underscores replaced with dashes (e.g. ‘Rollins belles Lettres’) and the page content is stored in the ‘editors notes’ field. One book item is created for each holding to be used to link to the corresponding borrowing records.
For borrowers a similar process is followed, with the link added to the surname column (e.g. Thos Duncan) and the page content added to the ‘editors notes’ field (e.g. <p>Possibly Thomas Duncan: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372</a></p>’). All borrowers are linked to records as ‘Main’ borrowers.
During the processing I noticed that the fourth ledger had a slightly different structure to the others, with entire pages devoted to a particular borrower, whose name then appeared in a heading row in the table. I therefore updated my script to check for the existence of this heading row, and if it exists my script then grabs the borrower name, creates the borrower record if it doesn’t already exist and then links this borrower to every borrowing item found on the page. After my script had finished running we had 11147 borrowing records, 996 borrowers and 6395 book holding records for St Andrew in the system.
I then moved onto looking at the data for Selkirk library. This data was more nicely structured than the St Andrews data, with separate spreadsheets for borrowings, borrowers and books and borrowers and books connected to borrowings via unique identifiers. Unfortunately the dates were still transcribed as they were written rather than being normalised in any way, which meant it was not possible to straightforwardly generate structured dates for the records and these will need to be manually generated. The script I wrote to import the data took about a day to write, and after running it we had a further 11,431 borrowing records across two registers and 415 pages entered into our database.
As with St Andrews, I created book records as Holding records only (i.e. associated specifically with the library rather than being project-wide ‘Edition’ records. There are 612 Holding records for Selkirk. I also processed the borrower records, resulting in 86 borrower records being added. I added the dates as originally transcribed to an additional field named ‘Original Borrowed Date’ and the only other additional field is in the Holding records for ‘Subject’, that will eventually be merged with our ‘Genre’ when this feature becomes available.
Also this week I advised Katie on a file naming convention for the digitised images of pages that will be created for the project. I recommended that the filenames shouldn’t have spaces in them as these can be troublesome on some operating systems and that we’d want a character to use as a delimiter between the parts of the filename that wouldn’t appear elsewhere in the filename so it’s easy to split up the filename. I suggested that the page number should be included in the filename and that it should reflect the page number as it will be written into the database – e.g. if we’re going to use ‘r’ and ‘v’ these would be included. Each page in the database will be automatically assigned an auto-incrementing ID, and the only means of linking a specific page record in the database with a specific image will be via the page number entered when the page is created, so if this is something like ‘23r’ then ideally this should be represented in the image filename.
Katie had wondered about using characters to denote ledgers and pages in the filename (e.g. ‘L’ and ‘P’) but if we’re using a specific delimiting character to separate parts of the filename then using these characters wouldn’t be necessary and I suggested it would be better to not use ‘L’ as a lower case ‘l’ is very easy to confuse with a ‘1’ or a capital ‘I’ which might confuse future human users.
Instead I suggested using a ‘-‘ instead of spaces and a ‘_’ as a delimiter and pointed out that we should ensure that no other non-alphanumeric characters are ever used in the filename – no apostrophes, commas, colons, semi-colons, ampersands etc and to make sure the ‘-‘ is really a minus sign and not one of the fancy dashes (–) that get created by MS Office. This shouldn’t be an issue when entering a filename, but might be if a list of filenames is created in Word and then pasted into the ‘save as’ box, for example.
Finally, I suggested that it might be best to make the filenames entirely lower case, as some operating systems are case sensitive and if we don’t specify all lower case then there may be variation in the use of case. Following these guidelines the filenames would look something like this:
In addition to the Books and Borrowing project I worked on a number of other projects this week. I gave Matthew Creasy some further advice on using forums in his new project website, and ‘Scottish Cosmopolitanism at the Fin de Siècle’ website is now available here: https://scoco.glasgow.ac.uk/.
I also worked a bit more on using dates from the OED data in the Historical Thesaurus. Fraser had sent me a ZIP file containing the entire OED dataset as 240 XML files and I began analysing these to figure out how we’d extract these dates so that we could use them to update the dates associated with the lexemes in the HT. I needed to extract the quotation dates as these have ‘ante’ and ‘circa’ notes, plus labels. I noted that in addition to ‘a’ and ‘c’ a question mark is also used, somethings with an ‘a’ or ‘c’ and sometimes without. I decided to process things as follows:
- ?a will just be ‘a’
- ?c will just be ‘c’
- ? without an ‘a’ or ‘c’ will be ‘c’.
I also noticed that a date may sometimes be a range (e.g. 1795-8) so I needed to include a second date column in my data structure to accommodate this. I also noted that there are sometimes multiple Old English dates, and the contents of the ‘date’ tag vary depending on the date – sometimes the content is ‘OE’ and othertimes ‘lOE’ or ‘eOE’. I decided to process any OE dates for a lexeme as being 650 and to have only one OE date stored, so as to align with how OE dates are stored in the HT database (we don’t differentiate between date for OE words).
While running my date extraction script over one of the XML files I also noticed that there were lexemes in the OED data that were not present in the OED data we had previously extracted. This presumably means the dataset Fraser sent me is more up to date than the dataset I used to populate our online OED data table. This will no doubt mean we’ll need to update our online OED table, but as we link to the HT lexeme table using the OED catid, refentry, refid and lemmaid fields if we were to replace the online OED lexeme table with the data in these XML files the connections from OED to HT lexemes would be retained without issue (hopefully), but any matching processes we performed would need to be done again for the new lexemes.
I set my extraction script running on the OED XML files on Wednesday and processing took a long time. The script didn’t complete until sometime during Friday night, but after it had finished it had processed 238,699 categories, 754,285 lexemes, generating 3,893,341 date rows. It also found 4,062 new words in the OED data that it couldn’t process because they don’t exist in our OED lexeme database.
I also spent a bit more time working on some scripts for Fraser’s Scots Thesaurus project. The scripts now ignore ‘additional’ entries and only include ‘n.’ entries that match an HT ‘n’ category. Variant spellings are also removed (these were all tagged with <form> and I removed all of these). I also created a new field to store only the ‘NN_’ tagged words and remove all others.
The scripts generated three datasets, which I saved as spreadsheets for Fraser. The first (postagged-monosemous-dost-no-adds-n-only) contains all of the content that matches the above criteria. The second (postagged-monosemous-dost-no-adds-n-only-catheading-match) lists those lexemes where a postagged word fully matches the HT category heading. The final (postagged-monosemous-dost-no-adds-n-only-catcontents-match) lists those lexemes where a postagged word fully matches a lexeme in the HT category. For this table I’ve also added in the full list of lexemes for each HT category too.
I also spent a bit of time working on the Data Management Plan for the new project for Jane Stuart-Smith and Eleanor Lawson at QMU and arranged for a PhD student to get access to the TextGrid files that were generated for the audio records for the SCOTS Corpus project.
Finally, I investigated the issue the DSL people are having with duplicate child entries appearing in their data. This was due to something not working quite right in a script Thomas Widmann had written to extract the data from the DSL’s editing system before he left last year, and Ann had sent me some examples of where the issue was cropping up.
I have the data that was extracted from Thomas’s script last July as two XML files (dost.xml and snd.xml) and I looked through these for the examples Ann had sent. The entry for snd13897 contains the following URLs:
The first is the ID for the main entry and the other two are child entries. If I search for the second one (snds3788) this is the only occurrence of the ID in the file, as the child entry has been successfully merged. But if I search for the third one (sndns2217) I find a separate entry with this ID (with more limited content). The pulling in of data into a webpage in the V3 site uses URLs stored in a table linked to entry IDs. These were generated from the URLs in the entries in the XML file (see the <url> tags above). For the URL ‘sndns2217’ the query finds multiple IDs, one for the entry snd13897 and another for the entry sdnns2217. But it finds snd13897 first, so it’s the content of this entry that is pulled into the page.
The entry for dost16606 contains the following URLs:
(in addition to headword URLs). Searching for the second one discovers a separate entry with the ID dost50272 (with more limited content). As with SND, searching the URL table for this URL finds two IDs, and as dost16606 appears first this is the entry that gets displayed.
What we need to do is remove the child entries that still exist as separate entries in the data. To do this I could is write a script that would go through each entry in the dost.xml and snd.xml files. It would then pick out every <url> that is not the same as the entry ID and search the file to see if any entry exists with this ID. If it does then presumably this is a duplicate that should then be deleted. I’m waiting to hear back from the DSL people to see how we should proceed with this.
As you can no doubt gather from the above, this was a very busy week but I do at least feel that I’m getting on top of things again.
This was week 13 of Lockdown, with still no end in sight. I spent most of my time on the Books and Borrowing project, as there is still a huge amount to do to get the project’s systems set up. Last week I’d imported several thousand records into the database and had given the team access to the Content Management System to test things out. One thing that cropped up was that the autocomplete that is used for selecting existing books, borrowers and authors was sometimes not working, or if it did work on selection of an item the script that then populates all of the fields about the book, borrower or author was not working. I’d realised that this was because there were invisible line break characters (\n or \r) in the imported data and the data is passed to the autocomplete via a JSON file. Line break characters are not allowed in a JSON file and therefore the autocomplete couldn’t access the data. I spent some time writing a script that would clean the data of all offending characters and after running this the autocomplete and pre-population scripts worked fine. However, a further issue cropped up with the text editors in the various forms in the CMS. These use the TinyMCE widget to allow formatting to be added to the text area, which works great. However, whenever a new line is created this adds in HTML paragraphs ( ‘<p></p>’, which is good) but the editor also adds a hidden line break character (‘\r’ or ‘\n’ which is bad). When this field is then used to populate a form via the selection of an autocomplete value the line break makes the data invalid and the form fails to populate. After identifying this issue I managed ensured all such characters are stripped out of any uploaded data and that fixed the issue.
I had to spend some time fixing a few more bugs that the team had uncovered during the week. The ‘delete borrower’ option was not appearing, even when a borrower was associated with no records, and I fixed this. There was also an issue with autocompletes not working in certain situations (e.g. when trying to add an existing borrower to a borrowing record that was initially created without a borrower). I tracked down and fixed these. Another issue involved the record page order incrementing whenever the record was edited, even when this had not been manually changed, while another involved book edition data not getting saved in some cases when a borrowing record was created. I tracked down and fixed these issues too.
With these fixes in place I then moved on to adding new features to the CMS, specifically facilities to add and browse the book works, editions and authors that are used across the project. Pressing on the ‘Add Book’ menu item nowloads a page through which you can choose to add a Book Work or a Book Edition (with associated Work, if required). You can also associate authors with the Works and Editions too. Pressing on the ‘Browse Books’ option now loads a page that lists all of the Book Works in a table, with counts of the number of editions and borrowing records associated with each. There’s also a row for all editions that don’t currently have a work. There are currently 1925 such editions so most of the data appears in this section, but this will change.
Through the page you can edit a work (including associating authors) by pressing on the ‘edit’ button. You can delete a work so long as it isn’t associated with an Edition. You can bring up a list of all editions in the work by pressing on the eye icon. Once loaded, the editions are displayed in a table. I may need to change this as there are so many fields relating to editions that the table is very wide. It’s usable if I make my browser take up the full width of my widescreen monitor, but for people using a smaller screen it’s probably going to be a bit unwieldy. From the list of editions you can press the ‘edit’ button to edit one of them – for example assigning one of the ‘no work’ editions to a work (existing or newly created via the edit form). You can also delete an edition if it’s not associated with anything. The Edition table includes a list of borrowing records, but I’ll also need to find a way to add in an option to display a list of all of the associated records for each, as I imagine this will be useful.
Pressing on the ‘Add Author’ menu item brings up a form allowing a new author to be added, which will then be available to associate with books throughout the CMS, while pressing on the ‘Browse Authors’ menu item brings up a list of authors. At the moment this table (and the book tables) can’t be reordered by their various columns. This is something else I still need to implement. You can delete an author if it’s not associated with anything and also edit the author details. As with the book tables I also need to add in a facility to bring up a list of all records the author is associated with, in addition to just displaying counts. I also noticed that there seems to be a bug somewhere that is resulting in blank authors occasionally being generated, and I’ll need to look into this.
I then spent some time setting up the project’s server, which is hosted at Stirling University. I was given access details by Stirling’s IT Support people and managed to sign into the Stirling VPN and get access to the server and the database. There was an issue getting write access to the server, but after that was resolved I was able to upload all of the CMS files, set up the WordPress instance that will be the main project website and migrate the database.
I was hoping I’d be able to get the CMS up and running on the new server without issue, but unfortunately this did not prove to be the case. It turns out that the Stirling server uses a different (and newer) version of the PHP scripting language than the Glasgow server and some of the functionality is different, for example on the Glasgow server you can call a function with less parameters than it is set up to require (e.g. addAuthor(1) when the function is set up to take 2 parameters (e.g.addAuthor(1,2)). The version on the Stirling server doesn’t allow this and instead the script breaks and a blank page is displayed. It took a bit of time to figure out what was going on, and now I know what the issue is I’m going to have to go through every script and check how every function is called, and this is going to be my priority next week.
I also spent a bit of time finalising the website for the project’s pilot project, which deals with borrowing records at Glasgow. This was managed by Matt Sangster, and he’d sent me a list of things we wanted to sort; I spent a few hours going through this, and we’re just about at the point where the website can be made publicly available.
I had intended to spend Friday working on the new way of managing dates for the Historical Thesaurus. The script I’d created to generate the dates for all 790,000-odd lexemes completed during last Friday night and over the weekend I wrote another script that would then shift the connectors up one (so a dash would be associated with the date before the dash rather than the one after it, for example). This script then took many hours to run. Unfortunately I didn’t get a chance to look further into this until Thursday, when I found a bit of time to analyse the output, at which point I realised that while the generation of the new fulldate field had worked successfully, the insertion of bracketed dates into the new dates table had failed, as the column was set as an integer and I’d forgotten to strip out the brackets. Due to this problem I had to set my scripts running all over again. The first one completed at lunchtime on Friday, but the second didn’t complete until Saturday so I didn’t manage to work on the HT this week. However, this did mean that I was able to return to a Scots Thesaurus data processing task that Fraser asked me to look into at the start of May, so it’s not all bad news.
Fraser’s task required me to set up the Stanford Part of Speech tagger on my computer, which meant configuring Java and other such tasks that took a bit of time. I then write a script that took the output of a script I’d written over a year ago that contained monosemous headwords in the DOST data, ran their definitions through the Part of Speech tagger and then outputted this to a new table. This may sound straightforward, but it took quite some time to get everything working, and then another couple of hours for the script to process around 3,000 definitions. But I was able to send the output to Fraser on Friday evening.
Also this week I gave advice to a few members of staff, such as speaking to Matthew Creasy about his new Scottish Cosmopolitanism project, Jane Stuart-Smith about a new project that she’s putting together with QMU, Heather Pagan of the Anglo-Norman Dictionary about a proposal she’s putting together, Rhona Alcorn about the Scots School Dictionary app and Gerry McKeever about publicising his interactive map.
Week seven of lockdown continued in much the same fashion as the preceding weeks, the only difference being Friday was a holiday to mark the 75th anniversary of VE day. I spent much of the four working days on the development of the content management system for the Books and Borrowing project. The project RAs will start using the system in June and I’m aiming to get everything up and running before then so this is my main focus at the moment. I also had a Zoom meeting with project PI Katie Halsey and Co-I Matt Sangster on Tuesday to discuss the requirements document I’d completed last week and the underlying data structures I’d defined in the weeks before. Both Katie and Matt were very happy with the document, although Matt had a few changes he wanted made to the underlying data structures and the CMS. I made the necessary changes to the data design / requirements document and the project’s database that I’d set up last week. The changes were:
Borrowing spans have now been removed from libraries and these will instead be automatically inferred based on the start and end dates of ledger records held in these libraries. Ledgers now have a new ‘ledger type’ field which currently allows the choice of ‘Professorial’, ‘Student’ or ‘Town’. This field will allow borrowing spans for libraries to be altered based on a selected ledger type. The way occupations for borrowers is recorded has been updated to enable both original occupations from the records and a normalised list of occupations to be recorded. Borrowers may not have an original occupation but still might have a standardised occupation so I’ve decided to use the occupations table as previously designed to hold information about standardised occupations. A borrower may have multiple standardised occupations. I have also added a new ‘original occupation’ field to the borrower record where any number of occupations found for the borrower in the original documentation (e.g. river watcher) can be added if necessary. The book edition table now has an ‘other authority URL’ field and an ‘other authority type’ field which can be used if ESTC is not appropriate. The ‘type’ currently features ‘Worldcat’, ‘CERL’ and ‘Other’ and ‘Language’ has been moved from Holding to Edition. Finally, in Book Holding the short title is now original title and long title is now standardised title while the place and date of publication fields have been removed as the comparable fields at Edition level will be sufficient.
In terms of the development of the CMS, I created a Bootstrap-based interface for the system, which currently just uses the colour scheme I used for Matt’s pilot 18th Century Borrowing project. I created the user authentication scripts and the menu structure and then started to create the actual pages. So far I’ve created a page to add a new library record and all of the information associated with a library, such as any number of sources. I then created the facility to browse and delete libraries and the main ‘view library’ page, which will act as a hub through which all book and borrowing records associated with the library will be managed. This page has a further tab-based menu with options to allow the RA to view / add ledgers, additional fields, books and borrowers, plus the option to edit the main library information. So far I’ve completed the page to edit the library information and have started work on the page to add a ledger. I’m making pretty good progress with the CMS, but there is still a lot left to do. Here’s a screenshot of the CMS if you’re interested in how it looks:
Also this week I had a Zoom meeting the Marc Alexander and Fraser Dallachy to discuss update to the Historical Thesaurus as we head towards a second edition. This will include adding in new words from the OED and new dates for existing words. My new date structure will also go live, so there will need to be changes to how the timelines work. Marc is hoping to go live with new updates in August. We also discussed the ‘guess the category’ quiz, with Marc and Fraser having some ideas about limiting the quiz to certain categories, or excluding other categories that might feature inappropriate content. We may also introduce a difficulty level based on date, with an ‘easy’ version only containing words that were in use for a decent span of time in the past 200 years.
Other work I did this week included making some tweaks to the data for Gerry McKeever’s interactive map, fixing an issue with videos continuing to play after the video overlay was closed for Paul Malgrati’s Burns Supper map, replying to a query from Alasdair Whyte about his Place-names of Mull and Ulva project and looking into an issue for Fraser’s Scots Thesaurus project which unfortunately I can’t do anything about as the scripts I’d created for this (which needed to be let running for several days) are on the computer in my office. If this lockdown ever ends I’ll need to tackle this issue then.
Last week was a full five-day strike and the end of the current period of UCU strike action. This week I returned to work, but the Coronavirus situation, which has been gradually getting worse over the past few weeks ramped up considerably, with the University closed for teaching and many staff working from home. I came into work from Monday to Wednesday but the West End was deserted and there didn’t seem much point in me using public transport to come into my office when there was no-one else around so from Thursday onwards I began to work from home, as I will be doing for the foreseeable future.
Despite all of these upheavals and also suffering from a pretty horrible cold I managed to get a lot done this week. Some of Monday was spend catching up with emails that had come in whilst I had been on strike last week, including a request from Rhona Alcorn of SLD to send her the data and sound files from the Scots School Dictionary and responding to Alan Riach from Scottish Literature about some web pages he wanted updated (these were on the main University site and this is not something I am involved with updating). I also noticed that the version of this site that was being served up was the version on the old server, meaning my most recent blog posts were not appearing. Thankfully Raymond Brasas in Arts IT Support was able to sort this out. Raymond had also emailed me about some WordPress sites I mange that had out of date versions of the software installed. There were a couple of sites that I’d forgotten about, a couple that were no longer operational and a couple that had legitimate reasons for being out of date, so I got back to him about those, and also updated my spreadsheet of WordPress sites I manage to ensure the ones I’d forgotten about would not be overlooked again. I also became aware of SSL certificate errors on a couple of websites that were causing the sites to display scary warning messages before anyone could reach the sites, so asked Raymond to fix these. Finally, Fraser Dallachy, who is working on a pilot for a new Scots Thesaurus, contacted me to see if he could get access to the files that were used to put together the first version of the Concise Scots Dictionary. We had previously established that any electronic files relating to the printed Scots Thesaurus have been lost and he was hoping that these old dictionary files may contain data that was used in this old thesaurus. I managed to track the files down, but alas there appeared to be no semantic data in the entries found therein. I also had a chat with Marc Alexander about a little quiz he would like to develop for the Historical Thesaurus.
I spoke to Jennifer Smith on Monday about the follow-on funding application for her SCOSYA project and spent a bit of time during the week writing a first draft of a Data Management Plan for the application, after reviewing all of the proposal materials she had sent me. Writing the plan raised some questions and I will no doubt have to revise the plan before the proposal is finalised, but it was good to get a first version completed and sent off.
I also finished work on the interactive map for Gerry McKeever’s Regional Romanticism project this week. Previously I’d started to use a new plugin to get nice curved lines between markers and all appeared to be working well. This week I began to integrate the plugin with my map, but unfortunately I’m still encountering unusable slowdown with the new plugin. Everything works fine to begin with, but after a bit of scrolling and zooming, especially round an area with lots of lines, the page becomes unresponsive. I wondered whether the issue might be related to the midpoint of the curve being dynamically generated from a function I took from another plugin so instead made a version that generated and then saved these midpoints that could then be used without needing to be calculated each time. This would also have meant that we could have manually tweaked the curves to position them as desired, which would have been great as some lines were not ideally positioned (e.g. from Scotland to the US via the North Pole), but even this seems to have made little impact on the performance issues. I even tried turning everything else off (e.g. icons, popups, the NLS map) to see if I could identify another cause of the slowdown but nothing has worked. I unfortunately had to admit defeat and resort to using straight lines after all. These are somewhat less visually appealing, but they result in no performance issues. Here’s a screenshot of this new version:
With these updates in place I made a version of the map that would run directly on the desktop and sent Gerry some instructions on how to update the data, meaning he can continue to work on it and see how it looks. But my work on this is now complete for the time being.
I was supposed to meet with Paul Malgrati from Scottish Literature on Wednesday to discuss an interactive map of Burns Suppers he would like me to create. We decided to cancel our meeting due to the Coronavirus, but continued to communicate via email. Paul had sent me a spreadsheet containing data relating to the Burns Suppers and I spent some time working on some initial versions of the map, reusing some of the code from the Regional Romanticism map, which in turn used code from the SCOSYA map.
I migrated the spreadsheet to an online database and then wrote a script that exports this data in the JSON format that can be easily read into the map. The initial version uses OpenStreetMap.HOT as a basemap rather than the .DE one that Paul had selected as the latter displays all place-names in German where these are available (e.g. Großbritannien). The .HOT map is fairly similar, although for some reason parts of South America look like they’re underwater. We can easily change to an alternative basemap in future if required. In my initial version all locations are marked with red icons displaying a knife and fork. We can use other colours or icons to differentiate types if or when these are available. The map is full screen with an introductory panel in the top right. Hovering over an icon displays the title of the event while clicking on it replaces the introductory panel with a panel containing the information about the supper. The content is generated dynamically and only displays fields that contain data (e.g. very few include ‘Dress Code’). You can always return to the intro by clicking on the ‘Introduction’ button at the top.
I spotted a few issues with the latitude and longitude of some locations that will need fixed. E.g. St Petersburg has Russia as the country but it is positioned in St Petersburg in Florida while Bogota Burns night in Colombia is positioned in South Sudan. I also realised that we might want to think about grouping icons as when zoomed out it’s difficult to tell where there are multiple closely positioned icons – e.g. the two in Reykjavik and the two in Glasgow. However, grouping may be tricky if different locations are assigned different icons / types.
After further email discussions with Paul (and being sent a new version of the spreadsheet) I created an updated version of my initial map. This version incorporates the data from the spreadsheet and incorporates the new ‘Attendance’ field into the pop-up where applicable. It is also now possible to zoom further out, and also scroll past the international dateline and still see the data (in the previous version if you did this the data would not appear). I also integrated the Leaflet Plugin MarkerCluster (see https://github.com/Leaflet/Leaflet.markercluster) that very nicely handles clustering of markers. In this new version of my map markers are now grouped into clusters that split apart as you zoom in. I also added in an option to hide and show the pop-up area as on small screens (e.g. mobile phones) the area takes up a lot of space, and if you click on a marker that is already highlighted this now deselects the marker and closes the popup. Finally, I added a new ‘Filters’ section in the introduction that you can show or hide. This contains options to filter the data by period. The three periods are listed (all ‘on’ be default’) and you can deselect or select any of them. Doing so automatically updates the map to limit the markers to those that meet the criteria. This is ‘remembered’ as you click on other markers and you can update your criteria by returning to the introduction. I did wonder about adding a summary of the selected filters to the popup of every marker, but I think this will just add too much clutter, especially when viewing the map on smaller screens (these days most people access websites on tablets or phones). Here is an example of the map as it currently looks:
The main things left to do are adding more filters and adding in images and videos, but I’ll wait until Paul sends me more data before I do anything further. That’s all for this week. I’ll just need to see how work progresses over the next few weeks as with the schools now shut I’ll need to spent time looking after my son in addition to tackling my usual work.
I met with Fraser Dallachy on Monday to discuss his ongoing pilot Scots Thesaurus project. It’s been a while since I’ve been asked to do anything for this project and it was good to meet with Fraser and talk through some of the new automated processes he wanted me to try out. One thing he wanted to try was tag the DSL dictionary definitions for part of speech to see if we could then automatically pick out word forms that we could query against the Historical Thesaurus to try and place the headword within a category. I adapted a previous script I’d created that picked out random DSL entries. This script targetted main entries (i.e. not supplements) that were nouns, were monosemous and had one sense, had fewer than 5 variant spellings, single word headwords and ‘short’ definitions, with the option to specify what is meant by ‘short’ in terms of the number of characters. I updated the script to bring back all DOST entries that met these criteria and had definitions that were less than 1000 characters in length, which resulted in just under 18,000 rows being returned (but I will rerun the script with a smaller character count if Fraser wants to focus on shorter entries). The script also stripped out all citations and tags from the definition to prepare it for POS tagging. With this dataset exported as a CSV I then began experimenting with a POS Tagger. I decided to use the Stanford POS Tagger (https://nlp.stanford.edu/software/tagger.html) which can be run at the command line, and I created a PHP script that went through each row of the CSV, passed the prepared definition text to the Tagger, pulled in the output and stored it in a database. I left the process running overnight and it had completed the following morning. I then outputted the rows as a spreadsheet and sent them on to Fraser for feedback. Fraser also wanted to see about using the data from the Scots School Dictionary so I sent that on to him too.
I also did a little bit of work for the DSL, investigating why some geographically tagged information was not being displayed in the citations, and replied to a few emails from Heather Pagan of the Anglo-Norman Dictionary as she began to look into uploading new data to their existing and no longer supported dictionary management system. I also gave some feedback on a proposal written by Rachel Douglas, a lecturer in French. Although this is not within Critical Studies and should be something Luca Guariento looks at, he is currently on holiday so I offered to help out. I also set up an initial WordPress site for Matthew Creasey’s new project. This still needs some further work, but I’ll need further information from Matthew before I can proceed. On Wednesday I met with Jennifer Smith and E Jamieson to discuss a possible follow-on project for the Scots Syntax Atlas. We talked through some of the possibilities and I think the project has huge potential. I’ll be helping to write the Data Management Plan and other such technical things for the proposal in due course.
I met with Marc and Fraser on Friday to discuss our plans for updating the way dates are stored in the Historical Thesaurus, which will make it much more easy to associate labels with specific dates and to update the dates in future as we align the data with revisions from the OED. I’d previously written a script that generated the new dates and from these generated a new ‘full date’ field which I then matched against the original ‘full date’ to spot errors. The script identified 1,116 errors, but this week I updated my script to change the way it handled ‘b’ dates. These are the dates that appear after a slash and where the date after the slash is in the same decade as the main date only one digit should be displayed (e.g. 1975/6), but this is not done so consistently, with dates sometimes appearing as 1975/76. Where this happened my script was noting the row as an error, but Marc wanted these to be ignored. I updated my script to take this into consideration, and this has greatly reduced the number of rows that will need to be manually checked, reducing the output to just 284 rows.
I spent the rest of my time this week working on the Books and Borrowers project. Although this doesn’t officially begin until the summer I’m designing the data structure at the moment (as time allows) so that when the project does start the RAs will have a system to work with sooner rather than later. I mapped out all of the fields in the various sample datasets in order to create a set of ‘core’ fields, mapping the fields from the various locations to these ‘core’ fields. I also designed a system for storing additional fields that may only be found at one or two locations, are not ‘core’ but still need to be recorded. I then created the database schema needed to store the data in this format and wrote a document that details all of this which I sent to Katie Halsey and Matt Sangster for feedback.
Matt also sent me a new version of the Glasgow Student borrowings spreadsheet he had been working on, and I spent several hours on Friday getting this uploaded to the pilot online resource I’m working on. I experimented with a new method of extracting the data from Excel to try and minimise the number of rows that were getting garbled due to Excel’s horrible attempts to save files as HTML. As previously documented, the spreadsheet uses formatting in a number of columns (e.g. superscript, strikethrough). This formatting is lost if the contents of the spreadsheet are copied in a plain text way (so no saving as a CSV, or opening the file in Google Docs or just copying the contents). The only way to extract the formatting in a way that can be used is to save the file as HTML in Excel and then work with that. But the resulting HTML produced by Excel is awful, with hundreds of tags and attributes scattered across the file used in an inconsistent and seemingly arbitrary way.
For example, this is the HTML for one row:
<tr height=23 style=’height:17.25pt’>
<td height=23 width=64 style=’height:17.25pt;width:48pt’></td>
<td width=143 style=’width:107pt’>Charles Wilson</td>
<td width=187 style=’width:140pt’>Charles Wilson</td>
<td width=86 style=’width:65pt’>Charles</td>
<td width=158 style=’width:119pt’>Wilson<span
<td width=88 style=’width:66pt’>Nat. Phil.</td>
<td width=129 style=’width:97pt’>Natural Philosophy</td>
<td width=64 style=’width:48pt’>B</td>
<td class=xl69 width=81 style=’width:61pt’>10</td>
<td class=xl70 width=81 style=’width:61pt’>3</td>
<td width=250 style=’width:188pt’>Wells Xenophon vol. 3<font class=”font6″><sup>d</sup></font></td>
<td width=125 style=’width:94pt’>Mr Smith</td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’>Adam Smith</td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’></td>
<td width=89 style=’width:67pt’>22 Mar 1757</td>
<td width=89 style=’width:67pt’>10 May 1757</td>
<td align=right width=56 style=’width:42pt’>2</td>
<td width=64 style=’width:48pt’>4r</td>
<td class=xl71 width=64 style=’width:48pt’>1</td>
<td class=xl70 width=64 style=’width:48pt’>007</td>
<td class=xl65 width=325 style=’width:244pt’><a
<td width=293 style=’width:220pt’>Xenophon.</td>
<td width=392 style=’width:294pt’>Opera quae extant omnia; unà cum
chronologiâa Xenophonteâ <span style=’display:none’>cl. Dodwelli, et quatuor
tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</span></td>
<td width=110 style=’width:83pt’>Wells, Edward, 16<span style=’display:none’>67-1727.</span></td>
<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll
<td align=right width=64 style=’width:48pt’>1</td>
<td align=right width=121 style=’width:91pt’>1</td>
<td width=64 style=’width:48pt’>T111427</td>
<td width=64 style=’width:48pt’></td>
<td width=64 style=’width:48pt’></td>
<td width=64 style=’width:48pt’></td>
Previously I tried to fix this by running through several ‘find and replace’ passes to try and strip out all of the rubbish, while retaining what I needed, which was <tr>, <td> and some formatting tags such as <sup> for superscript.
This time I found a regular expression that removes all attributes from HTML tags, so for example <td width=64 style=’width:48pt’> becomes <td> (see it here: https://stackoverflow.com/questions/3026096/remove-all-attributes-from-an-html-tag). I could then pass the resulting contents of every <td> through PHP’s strip_tags function to remove any remaining tags that were not required (e.g. <span>) while specifying the tags to retain (e.g. <sup>).
This approach seemed to work very well until I analysed the resulting rows and realised that the columns of many rows were all out of synchronisation, meaning any attempt at programmatically extracting the data and inserting it into the correct field in the database would fail. After some further research I realised that Excel’s save as HTML feature was to blame yet again. Without there being any clear reason, Excel sometimes expands a cell into the next cell or cells if these cells are empty. An example of this can be found above and I’ve extracted it here:
<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll Bi2-g.19-23</td>
The ‘colspan’ attribute means that the cell will stretch over multiple columns, in this case 2 columns, but elsewhere in the output file it was 3 and sometimes 4 columns. Where this happens the following cells simply don’t appear in the HTML. As my regular expression removed all attributes this ‘colspan’ was lost and the row ended up with subsequent cells in the wrong place.
Once I’d identified this I could update my script to check for the existence of ‘colspan’ before removing attributes, adding in the required additional empty cells as needed (so in the above case an extra <td></td>).
With all of this in place the resulting HTML was much cleaner. Here is the above row after my script had finished:
<td>Wells Xenophon vol. 3<sup>d</sup></td>
<td>22 Mar 1757</td>
<td>10 May 1757</td>
<td>Opera quae extant omnia; unà cum
chronologiâa Xenophonteâ cl. Dodwelli, et quatuor
tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</td>
<td>Wells, Edward, 16<span>67-1727.</td>
I then updated my import script to pull in the new fields (e.g. normalised class and professor names), set it up so that it would not import any rows that had ‘Yes’ in the first column, and updated the database structure to accommodate the new fields too. The upload process then ran pretty smoothly and there are now 8145 records in the system. After that I ran the further scripts to generate dates, students, professors, authors, book names, book titles and classes and updated the front end as previously discussed. I still have the old data stored in separate database tables as well, just in case we need it, but I’ve tested out the front-end and it all seems to be working fine to me.
After meeting with Fraser to discuss his Scots Thesaurus project last Friday I spent some time on Monday this week writing a script that returns some random SND or DOST entries that met certain criteria, so as to allow him to figure out how these might be placed into HT categories. The script brings back main entries (as opposed to supplements) that are nouns, are monosemous (i.e. no other noun entries with the same headword), have only one sense (i.e. not multiple meanings within the entry), have fewer than 5 variant spellings, have single word headwords and have definitions that are relatively short (100 characters or less). Whilst writing the script I realised that database queries are somewhat limited on the server and if I try to extract the full SND or DOST dataset to then select rows that meet the criteria in my script these limits are reached and the script just displays a blank page. So what I had to do is to set the script up to bring back a random sample of 5000 main entry nouns that don’t have multiple words in their headword in the selected dictionary. I then have to apply the other checks on this set of 5000 random entries. This can mean that the number of outputted entries ends up being less than the 200 that Fraser was hoping for, but still provides a good selection of data. The output is currently an HTML table, with IDs linking through to the DSL website and I’ve given the option of setting the desired number of returned rows (up to 1000) and the number of characters that should be considered a ‘short’ definition (up to 5000). Fraser seemed pretty happy with how the script is working.
Also this week I made some further updates to the new song story for RNSN and I spent a large amount of time on Friday preparing for my upcoming PDR session. On Tuesday I met with Luca to have a bit of a catch-up, which was great. I also fixed a few issues with the Thesaurus of Old English data for Jane Roberts and responded to a request for developer effort from a member of staff who is not in the College of Arts. I also returned to working on the Books and Borrowing pilot system for Matthew Sangster, going through the data I’d uploaded in June, exporting rows with errors and sending these to Matthew for further checking. Although there are still quite a lot of issues with the data, in terms of its structure things are pretty fixed, so I’m going to begin work on the front-end for the data next week, the plan being that I will work with the sample data as it currently stands and then replace it with a cleaner version once Matthew has finished working with it.
I divided the rest of my time this week between DSL and SCOSYA. For the DSL I integrated the new APIs that I was working on last week with the ‘advanced search’ facilities on both the ‘new’ (v2 data) and ‘sienna’ (v3 data) test sites. As previously discussed, the ‘headword match type’ from the live site has been removed in favour of just using wildcard characters (*?”). Full-text searches, quotation searches and snippets should all be working, in addition to headword searches. I’ve increased the maximum number of full-text / quotation results from 400 to 500 and I’ve updated the warning messages so they tell you how many results your query would have returned if the total number is greater than this. I’ve tested both new versions out quite a bit and things are looking good to me, and I’ve contacted Ann and Rhona to let them know about my progress. I think that’s all the DSL work I can do for now, until the bibliography data is made available.
For SCOSYA I engaged in an email conversation with Jennifer and others about how to cover the costs of MapBox in the event of users getting through the free provision of 200,000 map loads a month after the site launches next month. I also continued to work on the public atlas interface based on discussions we had at a team meeting last Wednesday. The main thing was replacing the ‘Home’ map, which previously just displayed the questionnaire locations, with a new map that highlights certain locations that have sound clips that demonstrate an interesting feature. The plan is that this will then lead users on to finding out more about these features in the stories, whilst also showing people where some of the locations to project visited are. This meant creating facilities in the CMS to manage this data, updating the database, updating the API and updating the front-end, so a fairly major thing.
I updated the CMS to include a page to manage the markers that appear on the new ‘Home’ map. Once logged into the CMS click on the ‘Browse Home Map Clips’ menu item to load the page. From here staff can see all of the locations and add / edit the information for a location (adding an MP3 file and the text for the popup). I added the data for a couple of sample locations that E had sent me. I then added a new endpoint to the API that brings back the information about the Home clips and updated the public atlas to replace the old ‘Home’ map with the new one. Markers are still the bright blue colour and drop into the map. I haven’t included the markers for locations that don’t have clips. We did talk at the meeting about including these, but I think they might just clutter the map up and confuse people.
I also reordered and relabelled the menu, and have changed things so that you can now click on an open section to close it. Currently doing so still triggers the map reload for certain menu items (e.g. Home). I’ll try to stop it doing so, but I haven’t managed to yet.
I also implemented the ‘Full screen’ slide type, although I think we might need to change the style of this. Currently it takes up about 80% of the map width, pinned to the right hand edge (which it needs to be for the animated transitions between slides to work). It’s only as tall as the content of the slide needs it to be, though, so the map is not really being obscured, which is what Jennifer was wanting. Although I could set it so that the slide is taller, this would then shift the navigation buttons down to the bottom of the map and if people haven’t scrolled the map fully into view they might not notice the buttons. I’m not sure what the best approach here might be, and this needs further discussion.
I also changed the way location data is returned from the API this week, to ensure that the GeoJSON area data is only returned from the API when it is specifically asked for, rather than by default. This means such data is only requested and used in the front-end when a user selects the ‘area’ map in the ‘Explore’ menu. The reason for doing this is to make things load quicker and to reduce the amount of data that was being downloaded unnecessarily. The GeoJSON data was rather large (several megabytes) and requesting this each time a map loaded meant the maps took some time to load on slower connections. With the areas removed the stories and ‘explore’ maps that are point based are much quicker to load. I did have to update a lot of code so that things still work without the area data being present, and I also needed to update all API URLs contained in the stories to specifically exclude GeoJSON data, but I think it’s been worth spending the time doing this.
I’d taken Tuesday off this week to cover the last day of the school holidays so it was a four-day week for me. It was a pretty busy four days, though, involving many projects. I had some app related duties to attend to, including setting up a Google Play developer account for people in Sports and Recreation and meeting with Adam Majumdar from Research and Innovation about plans for commercialising apps in future. I also did some further investigation into locating the Anglo-Norman Dictionary data, created a new song story for RNSN and read over Thomas Clancy’s Iona proposal materials one last time before the documents are submitted. I also met with Fraser Dallachy to discuss his Scots Thesaurus plans and will spend a bit of time next week preparing some data for him.
Other than these tasks I split my remaining time between SCOSYA and DSL. For SCOSYA we had a team meeting on Wednesday to discuss the public atlas. There is only about a month left to complete all development work on the project and I was hoping that the public atlas that I’d been working on recently was more or less complete, which would then enable me to move on to the other tasks that still need to be completed, such as the experts interface and the facilities to manage access to the full dataset. However, the team have once again changed their minds about how they want the public atlas to function and I’m therefore going to have to devote more time to this task than I had anticipated, which is rather frustrating at this late stage. I made a start on some of the updates towards the end of the week, but there is still a lot to be done.
For DSL we finally managed to sort out the @dsl.ac.uk email addresses, meaning the DSL people can now use their email accounts again. I also investigated and fixed an issue with the ‘v3’ version of the API which Ann Ferguson had spotted. This version was not working with exact searches, which use speech marks. After some investigation I discovered that the problem was being caused by the ‘v3’ API code missing a line that was present in the ‘v2’ API code. The server automatically escapes quotes in URLs by adding a preceding slash (\). The ‘v2’ code was stripping this slash before processing the query, meaning it correctly identified exact searches. As the ‘v3’ code didn’t get rid of the slashes it wasn’t finding the quotation mark and was not treating it as an exact search.
I also investigated why some DSL entries were missing from the output of my script that prepared data for Solr. I’d previously run the script on my laptop, but running it on my desktop instead seemed to output the full dataset including the rows I’d identified as being missing from the previous execution of the script. Once I’d outputted the new dataset I sent it on to Raymond for import into Solr and then I set about integrating full-text searching into both ‘v2’ and ‘v3’ versions of the API. This involved learning how Solr uses wildcard characters and Boolean searches, running some sample queries via the Solr interface and then updating my API scripts to connect to the Solr interface, format queries in a way that Solr could work with, submit the query and then deal with the results that Solr outputs, integrating these with fields taken from the database as required.
Other than the bibliography side of things I think that’s the work on the API more or less complete now (I still need to reorder the ‘browse’ output). What I haven’t done yet is to work on the advanced search pages of the ‘new’ and ‘sienna’ versions of the website to actually work with the new APIs, so as of yet you can’t perform any free-text searches through these interfaces but only directly through the APIs. Working to connect the front-ends fully to the APIs is my next task, which I will try to start on next week.
I spent quite a bit of time this week helping members of staff with research proposals. Last week I met with Ophira Gamliel in Theology to discuss a proposal she’s putting together and this week I wrote an initial version of a Data Management Plan for her project, which took a fair amount of time as it’s a rather multi-faceted project. I also met with Kirsteen McCue in Scottish Literature to discuss a proposal she’s putting together, and I spent some time after our meeting looking through some of the technical and legal issues that the project is going to encounter.
I also added three new pages to Matthew Creasey’s transcription / translation case study for his Decadence and Translation project (available here: https://dandtnetwork.glasgow.ac.uk/recreations-postales/) and sorted out some user account issues for the Place-names of Kircudbrightshire project and prepared an initial version of my presentation for the conference I’m speaking at in Bergamo the week after next.
I also helped Fraser to get some data for the new Scots Thesaurus project he’s running. This is going to involve linking data from the DSL to the OED via the Historical Thesaurus, so we’re exploring ways of linking up DSL headwords to HT lexemes initially, as this will then give us a pathway to specific OED headwords once we’ve completed the HT/OED linking process.
My first task was to create a script that returned all of the monosemous forms in the DSL, which Fraser suggested would be words that only have one ‘sense’ in their entries. The script I wrote goes through the DSL data and picks out all of the entries that have one <sense> tag in their XML. For each of these it then generates a ‘stripped’ form using the same algorithm that I created for the HT stripped fields (e.g. removing non alphanumeric characters). It then looks through the HT lexemes for an exact match of the HT lexeme ‘stripped’ field. If there is exactly one match then data about the DSL word and the matching HT word is added to the table.
For DOST there are 42177 words with one sense, and of these 2782 are monosemous in the HT and for SND there are 24085 words with one sense, and of these 1541 are monosemous in the HT. However, there are a couple of things to note. Firstly, I have not added in a check for Part of speech as the DSL POS field is rather inconsistent, often doesn’t even contain data and where there are multiple POSes there is no consistent way to split them up. Sometimes a comma is used, sometimes a space. A POS generally ends with a full stop, but not in forms like ‘n.1’ and ‘n.2’. Also, the DSL uses very different terms to the HT for POS, so without lots of extra work mapping out which corresponds to which it’s not possible to automatically match up an HT and a DSL POS. But as there are only a few thousand rows it should be possible to manually pick out the good ones.
Secondly, a word might have one sense but have two completely separate entries in the same POS, so as things currently stand the returned rows are not necessarily ‘monosemous’. See for example ‘bile’ (http://dsl.ac.uk/results/bile) which has four separate entries in SND that are nouns, plus three supplemental entries, so even though an individual entry for ‘bile’ contains one sense it is clearly not monosemous. After further discussions with Fraser I updated my script to count the number of times a DSL headword with one sense appears as a separate headword in the data. If the word is a DOST word and it appears more than once in DOST this number is highlighted in red. If it appears at all in SND the number is highlighted in red. For SND words it’s the same but reversed. There is rather a lot of red in the output, so I’m not sure how useful the data is going to be, but it’s a start. I also generated lists of DSL entries that contain the text ‘comb.’ and ‘attrb.’ as these will need to be handled differently.
All of the above took up most of the week, but I did have a bit of time to devote to HT/OED linking issues, including writing up my notes and listing action items following last Friday’s meeting and beginning to tick off a few of the items from this list. Pretty much all I managed to do was linked to the issue of HT lexemes with identical details appearing in multiple categories, and updating the output of an existing script to make it more useful.
Point 2 on my list was “I will create a new version of the non-unique HT words (where a word with the same ‘word’, ‘startd’ and ‘endd’ in multiple categories) to display how many of these are linked to OED words and how many aren’t“. I updated the script to add in a yes/no column for where there are links. I’ve also added in additional columns that display the linked OED lexeme’s details. Of the 154428 non-unique words 129813 are linked.
Point 3 was “I will also create a version of the script that just looks at the word form and ignores dates”. I’ve decided against doing this as just looking at word form without dates is going to lead to lots of connections being made where they shouldn’t really exist (e.g. all the many forms of ‘strike’).
Point 4 was “I will also create a version of the script that notes where one of the words with the same details is matched and the other isn’t, to see whether the non-matched one can be ticked off” and this has proved both tricky to implement and pretty useful. Tricky because a script can’t just compare the outputted forms sequentially – each identical form needs to be compared with every other. But as I say, it’s given some good results. There are 9056 of words that aren’t matched but probably should be, which could potentially be ticked off. Of course, this isn’t going to affect the OED ‘ticked off’ stats, but rather the HT stats. I’ve also realised that this script currently doesn’t take POS into consideration – it just looks at word form, firstd and lastd, so this might need further work.
I’m going to be on holiday next week and away at a conference for most of the following week, so this is all from me for a while.
This was a week of many different projects, most of which required fairly small jobs doing, but some of which required most of my time. I responded to a query from Simon Taylor about a potential new project he’s putting together that will involve the development of an app. I fixed a couple of issues with the old pilot Scots Thesaurus website for Susan Rennie, and I contributed to a Data Management Plan for a follow-on project that Murray Pittock is working on. I also made a couple of tweaks to the new maps I’d created for Thomas Clancy’s Saints Places project (the new maps haven’t gone live yet) and I had a chat with Rachel Macdonald about some further updates to the SPADE website. I also made some small updates to the Digital Humanities Network website, such as replacing HATII with Information Studies. I also had a chat with Carole Hough about the launch of the REELS resource, which will happen next month, and spoke to Alison Wiggins about fixing the Bess of Hardwick resource, which is currently hosted at Sheffield and is unfortunately no longer working properly. I also continued to discuss the materials for an upcoming workshop on digital editions with Bryony Randall and Ronan Crowley. I also made a few further tweaks to the new Seeing Speech and Dynamic Dialects websites for Jane Stuart-Smith.
I had a meeting with Kirsteen McCue and Brianna Robertson-Kirkland to discuss further updates to the Romantic National Song Network website. There are going to be about 15 ‘song stories’ that we’re going to publish between the new year and the project’s performance event in March, and I’ll be working on putting these together as soon as the content comes through. I also need to look into developing an overarching timeline with contextual events.
I spent some time updating the pilot crowdsourcing platform I had set up for Scott Spurlock. Scott wanted to restrict access to the full-size manuscript images and also wanted to have two individual transcriptions per image. I updated the site so that users can no longer right click on an image to save or view it. This should stop most people from downloading the image, but I pointed out that it’s not possible to completely lock the images. If you want people to be able to view an image in a browser it is always going to be possible for the user to get the image somehow – e.g. saving a screenshot, or looking at the source code for the site and finding the reference to the image. I also pointed out that by stopping people easily getting access to the full image we might put people off from contributing – e.g. some people might want to view the full image in another browser window, or print it off to transcribe from a hard copy.
I also spent a bit of time continuing to work on the Bilingual Thesaurus. I moved the site I’m working on to a new URL, as requested by Louise Sylvester, and updated the thesaurus data after receiving feedback on a few issues I’d raised previously. This included updating the ‘language of citation’ for the 15 headwords that had no data for this, instead making them ‘uncertain’. I also added in first dates for a number of words that previously only had end dates, based on information Louise sent to me. I also noticed that several words have duplicate languages in the original data, for example the headword “Clensing (mashinge, yel, yeling) tonne” has for language of origin: “Old English|?Old English|Middle Dutch|Middle Dutch|Old English”. My new relational structure ideally should have a language of origin / citation linked only once to a word, otherwise things get a bit messy, so I asked Louise whether these duplicates are required, and whether a word can have both an uncertain language of origin (“?Old English”) and a certain language of origin (“Old English”). I haven’t heard back from her about this yet, but I wrote a script that strips out the duplicates, and where both an uncertain and certain connection exists keeps the uncertain one. If needs be I’ll change this. Other than these issues relating to the data, I spent some time working on the actual site for the Bilingual Thesaurus. I’m taking the opportunity to learn more about the Bootstrap user interface library and am developing the website using this. I’ve been replicating the look and feel of the HT website using Bootstrap syntax and have come up with a rather pleasing new version of the HT banner and menu layout. Next week I’ll see about starting to integrate the data itself.
This just leaves the big project of the week to discuss: the ongoing work to align the HT and OED datasets. I continued to implement some of the QA and matching scripts that Marc, Fraser and I discussed at our meeting last week. Last week I ‘dematched’ 2412 categories that don’t have a perfect number of lexemes match and have the same parent category. I created a further script that checks how many lexemes in these potentially matched categories are the same. This script counts the number of words in the potentially matched HT and OED categories and counts how many of them are identical (stripped). A percentage of the number of HT words that are matched is also displayed. If the number of HT and OED words match and the total number of matches is the same as the number of words in the HT and OED categories the row is displayed in green. If the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1 this is also considered a match. If the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1 this is also considered a match. The total matches given are 1154 out of 2412.
I then moved onto creating a script that checks the manually matched data from our ‘version 1’ matching process. There are 1407 manual matches in the system. Of these:
- 795 are full matches (number of words and stripped last word match or have a Levenshtein score of 1 and 100% of HT words match OED words, or the categories are empty)
- There are 205 rows where all words match or the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1, or the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1
- There are 122 rows where the last word matches (or has a Levenshtein score of 1) but nothing else does
- There are 18 part of speech mismatches
- There are 267 rows where nothing matches
I then created a ‘pattern matching’ script, which changes the category headings based on a number of patterns and checks whether this then results in any matches. The following patterns were attempted:
- inhabitant of the -> inhabitant
- inhabitant of -> inhabitant
- relating to -> pertaining to
- spec. -> specific
- spec -> specific
- specific -> specifically
- assoc. -> associated
- esp. -> especially
- north -> n.
- south -> s.
- january -> jan.
- march -> mar.
- august -> aug.
- september -> sept.
- october -> oct.
- november -> nov.
- december -> dec.
- Levenshtein difference of 1
- Adding ‘ing’ onto the end
The script identified 2966 general pattern matches, 129 Levenshtein score 1 matches and 11 ‘ing’ matches, leaving 17660 OED categories that have a corresponding HT catnum with different details and a further 6529 OED categories that have no corresponding HT catnum. Where there is a matching category number of lexemes / last lexeme / total matched lexeme checks as above are applied and rows are colour coded accordingly.
On Friday Marc, Fraser and I had a further meeting to discuss the above, and we came up with a whole bunch of further updates that I am going to focus on next week. It feels like real progress is being made.