This was a four-day week for me as I’d taken Friday off as it was an in-service day at my son’s school before next week’s half-term, which I’ve also taken off. I had rather a lot to try and get done before my holiday so it was a pretty intense week, split mostly between the Historical Thesaurus and the Anglo-Norman Dictionary.
For the Historical Thesaurus I continued with the preparations for the second edition, starting off by creating a little statistics page that lists all of the words and categories that have been updated for the second edition and the changelog code that have been applied to them. Marc had sent a list of all of the category number sequences that we have updated so I then spent a bit of time updating the database to apply the changelog codes to all of these categories. It turns out that almost 200,000 categories have been revised and relocated (out of about 235,000) so it’s pretty much everything. At our meeting last week we had proposed updating the ‘new OED words’ script I’d written last week to separate out some potential candidates into an eighth spreadsheet (these are words that have a slash in them, which now get split up on the slash and each part is compared against the HT’s search words table to see whether they already exist). Whilst working through some of the other tasks I realised that I hadn’t included the unique identifiers for OED lexemes in the output, which was going to make it a bit difficult to work with the files programmatically, especially since there are some occasions where the OED has two identical lexemes in a category. I therefore updated my script and regenerated the output to include the lexeme ID making it possible to differentiate identical lexemes and also making it easier to grab dates for the lexeme in question.
The issue of there being multiple identical lexemes in an OED category was a strange one. For example, one category had two ‘Amber pudding’ lexemes. I wrote a script that extracted all of these duplicates and there are possibly a hundred or so of them, and also other OED lexemes that appear to have no associated dates. I passed these over to Marc and Fraser for them to have a look at. After that I worked on a script to go through each of the almost 12,000 lexemes that we have identified as OED lexemes that are definitely not present in the HT data, extract their OED dates and then format these as HT dates.
The script generates date entries as they would be added to the HT lexeme dates table (used for timelines), the HT fulldate field (used for display) and the HT firstdate and lastdate fields (used for searching). Dates earlier than 1150 are stored as their actual values in the dates table, but are stored at ‘650’ in the ‘firstdate’ field and are displayed as ‘OE’ in the ‘fulldate. Dates after 1945 are stored as ‘9999’ in both the dates table and the ‘lastdate’ field. Where there is a ‘yearend’ in the OED date (i.e. the date is a range) this is stored as the ‘year_b’ in the HT date and appears after a slash in the ‘fulldate’, following the rules for slashes. If the date is the last date then the ‘year_b’ is used as the HT lastdate. If the ‘year_b’ is after 1945 but the ‘year’ isn’t then ‘9999’ is used. So for example ‘maiden-skate’ has a last date of ‘1880/1884’, which appears in the ‘fulldate’ as ‘1880/4’ and the ‘lastdate’ is ‘1884’. Where there is a gap of more than 150 years between dates the connector between dates is a dash and where the gap is less then this it is a plus. One thing that needed further work was how we handle multiple post 1945 dates. In my initial script if there are multiple post 1945 dates then only one of these is carried over as an HT date, and it’s set to ‘9999’. The is because all post-1945 dates are stored as ‘9999’ and having several of these didn’t seem to make sense and confused the generation of the fulldate. There was also an issue with some OED lexemes only having dates after 1945. In my first version of the script these ended up with only one HT date entry of 9999 and 9999 as both firstdate and lastdate, and a fulldate consisting of just a dash, which was not right. After further discussion with Marc I updated the script so that in such cases the date information that is carried over is the first date (even if it’s after 1945) and a dash to show that it is current. For example, ‘ecoregion’ previously had a ‘full date’ of ‘-‘, one HT date of ‘9999’ and a start date of ‘9999’ and in the updated output has a full date of ‘1962-‘, two HT dates and a start date of 1962. Where a lexeme has a single date this also now has a specific end date rather than it being ‘9999’. I passed the output of the script over the Marc and Fraser for them to work with whilst I was on holiday.
For the Anglo-Norman Dictionary I continued to work on the entry page. I added in the cognate references (i.e. references to other dictionaries), which proved to be rather tricky due to the way they have been structured in the Editors’ XML files (in the current live site the cognate references are stored in a separate hash file and are somehow injected into the entry page when it is generated, but we wanted to rationalise this so that the data that appears on the site is all contained in the Editors’ XML where possible). The main issue was with how the links to other online dictionaries were stored, as it was not entirely clear how to generate actual links to specific pages in these resources from them. This was especially true for links to FEW (I have no idea what FEW stands for as the acronym doesn’t appear to be expanded anywhere, even on the FEW website).
They appear in the Editors’ XML like this:
<FEW_refs siglum=”FEW” linkable=”yes”><link_form>A</link_form><link_loc>24,1a</link_loc></FEW_refs>
Which ends up linking to here:
<FEW_refs siglum=”FEW” linkable=”yes”> <link_form>posse</link_form><link_loc>9,231b</link_loc> </FEW_refs>
Which ends up linking to here:
Based on this my script for generating links needed to:
- Store the base URL https://apps.atilf.fr/lecteurFEW/lire/volume
- Split the <link_loc> on the comma
- multiply the part before the comma by 10 (so 24 becomes 240, 9 becomes 90 etc)
- strip out any non-numeric character from the part after the comma (i.e. getting rid of ‘a’ and ‘b’)
- generate the full URL, such as https://apps.atilf.fr/lecteurFEW/lire/volume/240/page/1 using these two values.
After discussion with Heather and Geert at the AND it turned out to be even more complicated than this, as some of the references are further split into subvolumes using a slash and a Roman numeral, so we have things like ‘15/i,108b’ which then needs to link to https://apps.atilf.fr/lecteurFEW/lire/volume/151/page/108. It took some time to write a script that could cope with all of these quirks, but I got there in the end.
Also this week I updated the citation dates so they now display their full information with ‘MS:’ where required and superscript text. I then finished work on the commentaries, adding in all of the required formatting (bold, italic, superscript etc) and links to other AND entries and out to other dictionaries. Where the commentaries are longer than a few lines they are cut off and an ‘expand’ button is shown. I also updated the ‘Results’ tab so it shows you the number of results in the tab header and have added in the ‘entry log’ feature that tracks which entries you have looked at in a session. The number of these also appears in the tab header and I’m personally finding it a very useful feature as I navigate around the entries for test purposes. The log entries appear in the order you opened them and there is no scrolling of entries as I would imagine most people are unlikely to have more than 20 or so listed. You can always clear the log by pressing on the ‘Clear’ button. I also updated the entry page so that the cross references in the ‘browse’ now work. If the entry has a single cross reference then this is automatically displayed when you click on its headword in the ‘browse’, with a note at the top of the page stating it’s a cross reference. If the entry has multiple cross references these are not all displayed but instead links to each entry are displayed. There are two reasons for this: Firstly, displaying multiple entries can result in long and complicated pages that may be hard to navigate; secondly, the entry page as it currently stands was designed to display one entry, and uses HTML IDs to identify certain elements. An HTML ID must be unique on a page so if multiple entries were displayed things would break. There is still a lot of work to do on the site, but the entry page is at least nearing completion. Below is a screenshot showing the entry log, the cognate references and the commentary complete with formatting and the ‘Expand’ option:
I did also work on some other projects this week as well. For Books and Borrowing I set up a user account for a volunteer and talked her through getting access to the system. For the Mull / Ulva site I automatically generated historical forms for all of the place-names that had come from the GB1900 crowdsourced data. These are now associated with the ‘OS 6 inch 2nd edn’ source and about 1670 names have been updated, although many of these are abbreviations like ‘F.P.’. I also updated the database and the CMS to incorporate a new field for deciding which ‘front-end’ the place-name will be displayed on. This is a drop-down list that can be selected when adding or editing a place-name, allowing you to choose from ‘Both’, ‘Mull / Ulva only’ and ‘Iona only’. There is still a further option for stating whether the place-name appears on the website or not (‘on website: Y/N’) so it will be possible to state that a place-name is associated with one project but shouldn’t appear on that project’s website. I also updated the search option on the ‘Browse placenames’ page to allow a user to limit the displayed placenames to those that have ‘front-end display’ set to one of the options. Currently all place-names are set to ‘Mull / Ulva only’. With this all in place I then created user accounts for the CMS for all of the members of the Iona project team who will be using this CMS to work with the data. I also made a few further tweaks to the search results page of the DSL. After all of this I was very glad to get away for a holiday.
I’d taken Thursday and Friday off this week as it was the Glasgow September Weekend holiday, meaning this was a three-day week for. It was a week where focussing on any development tasks was rather tricky as I had four Zoom calls and a dentist’s appointment on the other side of the city during my three working days.
On Monday I had a call with the Historical Thesaurus people to discuss the ongoing task of integrating content from the OED for the second edition. There’s still rather a lot to be done for this, and we’re needing to get it all complete during October, so things are a little stressful. After the meeting I made some further updates to the display of icons signifying a second edition update. I updated the database and front-end to allow categories / subcats to have a changelog (in addition to words). These appear in a transparent circle with a white border and a white number, right aligned. I also updated the display of the icon for words. These also appear as a transparent circle, right aligned, but have the teal colour for a border and the number. I also realised I hadn’t added in the icons for words in subcats, so put these in place too.
After that I set about updated the dates of HT lexemes based on some rules that Fraser had developed. I created and ran scripts that updated the start dates of 91,364 lexemes based on OED dates and then ran a further scrip that updated the end dates of 157,156 lexemes. These took quite a while to run (the latter I was dealing with during my time off) but it’s good that progress is being made.
My second Zoom call of the week was for the Books and Borrowing project, and was with the project PI and Co-I and someone who is transcribing library records from a private library that we’re now intending to incorporate into the project’s system. We discussed the data and the library and made a plan for how we’re going to work with the data in future. My third and fourth Zoom call were for the new Place-names of Iona project that is just starting up. It was a good opportunity to meet the rest of the project team (other than the RA who has yet to be appointed) and discuss how and when tasks will be completed. We’ve decided that we’ll use the same content management system as the one I already set up for the Mull and Ulva project, as this already includes Iona data from the GB1900 project. I’ll need to update the system so that we can differentiate place-names that should only appear on the Iona front-end, the Mull and Ulva front-end or both. This is because for Iona we are going to be going into much more detail, down to individual monuments and other ‘microtoponyms’ whereas the names in the Mull and Ulva project are much more high level.
For the rest of my available time this week I made some further updates to the script I wrote last week for Fraser’s Scots Thesaurus project, ordering the results by part of speech and ensuring that hyphenated words are properly searched for (as opposed to being split into separate words joined by an ‘or’). I also spent some time working for the DSL people, firstly updating the text on the search results page and secondly tracking down the certificate for the Android version of the School Dictionary app. This was located on my PC at work, so I had arranged to get access to my office whilst I was already in the West End for my dentist’s appointment. Unfortunately what I thought was the right file turned out to be the certificate for an earlier version of the app, meaning I had to travel all the way back to my office again later in the week (when I was on holiday) to find the correct file.
I also managed to find a little time to continue to work on the new Anglo-Norman Dictionary site, continuing to work on the display of the ‘entry’ page. I updated my XSLT to ensure that ‘parglosses’ are visible and that cross reference links now appear. Explanatory labels are also now in place. These currently appear with a grey background but eventually these will be links to the label search results page. Semantic labels are also now in place and also currently have a grey background but will be links through to search results. However, the System XML notes whether certain semantic labels should be shown or not. So, for example <label type=”sem” show=”no”>med.</label> doesn’t get shown. Unfortunately there is nothing comparable in the Editors’ XML (it’s just <semantic value=”med.”/>) so I can’t hide such labels. Finally, the initials of the editor who made the last update now appear in square brackets to the right of the end of the entry.
Also, my new PC was delivered on Thursday and I spent a lot of time over the weekend transferring all of my data and programs across from my old PC.
I spent week 9 of Lockdown continuing to implement the content management system for the Books and Borrowing project. I was originally hoping to have completed an initial version of the system by the end of this week, but this was unfortunately not possible due to having to juggle work and home-schooling, commitments to other projects and the complexity of the project’s data. It took several days to complete the scripts for uploading a new borrowing record due to the interrelated nature of the data structure. A borrowing record can be associated with one or more borrowers, and each of these may be new borrower records or existing ones, meaning data needs to be pulled in via an autocomplete to prepopulate the section of the form. Books can also be new or existing records but can also have one or more new or existing book item records (as a book may have multiple volumes) and may be linked to one or more project-wide book edition records which may already exist or may need to be created as part of the upload process, and each of these may be associated with a new or existing top-level book work record. Therefore the script for uploading a new borrowing record needs to incorporate the ‘add’ and ‘edit’ functionality for a lot of associated data as well. However, as I have implemented all of these aspects of the system now it will make it quicker and easier to develop the dedicated pages for adding and editing borrowers and the various book levels once I move onto this. I still haven’t working on the facilities to add in book authors, genres or borrower occupations, which I intend to move onto once the main parts of the system are in place.
After completing the scripts for processing the display of the ‘add borrowing’ form and the storing of all of the uploaded data I moved onto the script for viewing all of the borrowing records on a page. Due to the huge number of potential fields I’ve had to experiment with various layouts, but I think I’ve got one that works pretty well, which displays all of the data about each record in a table split into four main columns (Borrowing, Borrower, Book Holding / Items, Book Edition / Works). I’ve also added in a facility to delete a record from the page. I then moved on to the facility to edit a borrowing record, which I’ve added to the ‘view’ page rather than linking out to a separate page. When the ‘edit’ button is pressed on for a record its row in the table is replace with the ‘edit’ form, which is identical in style and functionality to the ‘add’ form, but is prepopulated with all of the record’s data. As with the ‘add’ form, it’s possible to associated multiple borrowers and book items and editions, and also to manage the existing associations using this script. The processing of the form uses the same logic as the ‘add’ script so thankfully didn’t require much time to implement.
What I still need to do is add authors and borrower occupations to the ‘view page’, ‘add record’ and ‘edit record’ facilities, add the options to view / edit / add / delete a library’s book holdings and borrowers independently of the borrowing records, plus facilities to manage book editions / works, authors, genres and occupations at the top level as opposed to when working on a record. I also still need to add in the facilities to view / zoom / pan a page image and add in facilities to manage borrower cross-references. This is clearly quite a lot, but the core facilities of adding, editing and deleting borrowing, borrower and book records is now in place, which I’m happy about. Next week I’ll continue to work on the system ahead of the project’s official start date at the beginning on June.
Also this week I made a few tweaks to the interface for the Place-names of Mull and Ulva project, spoke to Matthew Creasy some more about the website for his new project, spoke to Jennifer Smith about the follow-on funding proposal for the SCOSYA project and investigated an issue that was affecting the server that hosts several project websites (basically it turned out that the server had run out of disk space).
I also spent some time working on scripts to process data from the OED for the Historical Thesaurus. Fraser is working on incorporating new dates from the OED and needs to work out which dates in the HT data we want to replace and which should be retained. The script makes groups of all of the distinct lexemes in the OED data. If the group has two or more lexemes it then checks that at least one of them is revised. It then makes subgroups of all of the lexemes that have the same date (so for example all the ‘Strike’ words with the same ‘sortdate’ and ‘lastdate’ are grouped together). If one word in the whole group is ‘revised’ and at least two words have the same date then the words with the same dates are displayed in the table. The script also checks for matches in the HT lexemes (based on catid, refentry, refid and lemmaid fields). If there is a match this data is also displayed. I then further refined the output based on feedback from Fraser, firstly highlighting in green those rows where at least two of the HT dates match, and secondly splitting the table into three separate tables, one with the green rows, one containing all other OED lexemes that have a matching HT lexeme and a third containing OED lexemes that (As of yet) do not have a matching HT lexeme.
This was week 8 of Lockdown and I spent the majority of it working on the content management system for the Books and Borrowing project. The project is due to begin at the start of June and I’m hoping to have the CMS completed and ready to use by the project team by then, although there is an awful lot to try and get into place. I can’t really go into too much detail about the CMS, but I have completed the pages to add a library and to browse a list of libraries with the option of deleting a library if it doesn’t have any ledgers. I’ve also done quite a lot with the ‘View library’ page. It’s possible to edit a library record, add a ledger and add / edit / delete additional fields for a library. You can also list all of the ledgers in a library with options to edit the ledger, delete it (if it contains no pages) and add a new page to it. You can also display a list of pages in a ledger, with options to edit the page or delete it (if it contains no records). You can also open a page in the ledger and browse through the next and previous pages.
At the moment I’m in the middle of creating the facility to add a new borrowing record to the page. This is the most complex part of the system as a record may have multiple borrowers, each of which may have multiple occupations, and multiple books, each of which may be associated with higher level book records. Plus the additional fields for the library need to be taken into consideration too. By the end of the week I was at the point of adding in an auto-complete to select an existing borrower record and I’ll continue with this on Monday.
In addition to the B&B project I did some work for other projects as well. For Thomas Clancy’s Place-names of Kirkcudbrightshire project (now renamed Place-names of the Galloway Glens) I had a few tweaks and updates to put in place before Thomas launched the site on Tuesday. I added a ‘Search place-names’ box to the right-hand column of every non-place-names page which takes you to the quick search results page and I added a ‘Place-names’ menu item to the site menu, so users can access the place-names part of the site. Every place-names page now features a sub-menu with access to the place-names pages (Browse, element glossary, advanced search, API, quick search). To return to the place-name introductory page you can press on the ‘Place-names’ link in the main menu bar. I had unfortunately introduced a bug to the ‘edit place-name’ page in the CMS when I changed the ordering of parishes to make KCB parishes appear first. This was preventing any place-names in BMC from having their cross references, feature type and parishes saved when the form was submitted. This has now been fixed. I also added Google Analytics to the site. The virtual launch on Tuesday went well and the site can now be accessed here: https://kcb-placenames.glasgow.ac.uk/.
I also added in links to the DSL’s email and Instagram accounts to the footer of the DSL site and added some new fields to the database and CMS of the Place-names of Mull and Ulva site. I also created a new version of the Burns Supper map for Paul Malgrati that included more data and a new field for video dimensions that the video overlay now uses. I also replied to Matthew Creasy about a query regarding the website for his new Scottish Cosmopolitanism project and a query from Jane Roberts about the Thesaurus of Old English and made a small tweak to the data of Gerry McKeever’s interactive map for Regional Romanticism.
Week seven of lockdown continued in much the same fashion as the preceding weeks, the only difference being Friday was a holiday to mark the 75th anniversary of VE day. I spent much of the four working days on the development of the content management system for the Books and Borrowing project. The project RAs will start using the system in June and I’m aiming to get everything up and running before then so this is my main focus at the moment. I also had a Zoom meeting with project PI Katie Halsey and Co-I Matt Sangster on Tuesday to discuss the requirements document I’d completed last week and the underlying data structures I’d defined in the weeks before. Both Katie and Matt were very happy with the document, although Matt had a few changes he wanted made to the underlying data structures and the CMS. I made the necessary changes to the data design / requirements document and the project’s database that I’d set up last week. The changes were:
Borrowing spans have now been removed from libraries and these will instead be automatically inferred based on the start and end dates of ledger records held in these libraries. Ledgers now have a new ‘ledger type’ field which currently allows the choice of ‘Professorial’, ‘Student’ or ‘Town’. This field will allow borrowing spans for libraries to be altered based on a selected ledger type. The way occupations for borrowers is recorded has been updated to enable both original occupations from the records and a normalised list of occupations to be recorded. Borrowers may not have an original occupation but still might have a standardised occupation so I’ve decided to use the occupations table as previously designed to hold information about standardised occupations. A borrower may have multiple standardised occupations. I have also added a new ‘original occupation’ field to the borrower record where any number of occupations found for the borrower in the original documentation (e.g. river watcher) can be added if necessary. The book edition table now has an ‘other authority URL’ field and an ‘other authority type’ field which can be used if ESTC is not appropriate. The ‘type’ currently features ‘Worldcat’, ‘CERL’ and ‘Other’ and ‘Language’ has been moved from Holding to Edition. Finally, in Book Holding the short title is now original title and long title is now standardised title while the place and date of publication fields have been removed as the comparable fields at Edition level will be sufficient.
In terms of the development of the CMS, I created a Bootstrap-based interface for the system, which currently just uses the colour scheme I used for Matt’s pilot 18th Century Borrowing project. I created the user authentication scripts and the menu structure and then started to create the actual pages. So far I’ve created a page to add a new library record and all of the information associated with a library, such as any number of sources. I then created the facility to browse and delete libraries and the main ‘view library’ page, which will act as a hub through which all book and borrowing records associated with the library will be managed. This page has a further tab-based menu with options to allow the RA to view / add ledgers, additional fields, books and borrowers, plus the option to edit the main library information. So far I’ve completed the page to edit the library information and have started work on the page to add a ledger. I’m making pretty good progress with the CMS, but there is still a lot left to do. Here’s a screenshot of the CMS if you’re interested in how it looks:
Also this week I had a Zoom meeting the Marc Alexander and Fraser Dallachy to discuss update to the Historical Thesaurus as we head towards a second edition. This will include adding in new words from the OED and new dates for existing words. My new date structure will also go live, so there will need to be changes to how the timelines work. Marc is hoping to go live with new updates in August. We also discussed the ‘guess the category’ quiz, with Marc and Fraser having some ideas about limiting the quiz to certain categories, or excluding other categories that might feature inappropriate content. We may also introduce a difficulty level based on date, with an ‘easy’ version only containing words that were in use for a decent span of time in the past 200 years.
Other work I did this week included making some tweaks to the data for Gerry McKeever’s interactive map, fixing an issue with videos continuing to play after the video overlay was closed for Paul Malgrati’s Burns Supper map, replying to a query from Alasdair Whyte about his Place-names of Mull and Ulva project and looking into an issue for Fraser’s Scots Thesaurus project which unfortunately I can’t do anything about as the scripts I’d created for this (which needed to be let running for several days) are on the computer in my office. If this lockdown ever ends I’ll need to tackle this issue then.
This was a three-day week for me as I was participating in the UCU strike action on Thursday and Friday. I spent the majority of Monday to Wednesday tying up loose ends and finishing off items on my ‘to do’ list. The biggest task I tackled was to relaunch the Digital Humanities at Glasgow site. This involved removing all of the existing site and moving the new site from its test location to the main URL. I had to write a little script that changed all of the image URLs so they would work in the new location and I needed to update WordPress so it knew where to find all of the required files. Most of the migration process went pretty smoothly, but there were some slightly tricky things, such as ensuring the banner slideshow continued to work. I also needed to tweak some of the static pages (e.g. the ‘Team’ and ‘About’ pages) and I added in a ‘contact us’ form. I also put in redirects from all of the old pages so that any bookmarks or Google links will continue to work. As of yet I’ve had no feedback from Marc or Lorna about the redesign of the site, so I can only assume that they are happy with how things are looking. The final task in setting up the new site was to migrate my blog over from its old standalone URL to become integrated in the new DH site. I exported all of my blog posts and categories and imported them into the new site using WordPress’s easy to use tools, and that all went very smoothly. The only thing that didn’t transfer over using this method was the media. All of the images embedded in my blog posts still pointed to image files located on the old server so I had to manually copy these images over and then I wrote a script that went through every blog post and found and replaced all image URLs. All now appears to be working, and this is the first blog post I’m making using the new site so I’ll know for definite once I add this post to the site.
Other than setting up this new resource I made a further tweak to the new data for Matthew Sangster’s 18th century student borrowing records that I was working on last week. I had excluded rows from his spreadsheet that had ‘Yes’ in the ‘omit’ column, assuming that these were not to be processed by my upload script, but actually Matt wanted these to be in the system and displayed when browsing pages, but omitted from any searches. I therefore updated the online database to include a new ‘omit’ column, updated my upload script to only process these omitted rows and then changed the search facilities to ignore any rows that have a ‘Y’ in the ‘omit’ column.
I also responded to a query from Alasdair Whyte regarding parish boundaries for his Place-names of Mull and Ulva project, and investigated an issue that Marc Alexander was experiencing with one of the statistics scripts for the HT / OED matching process (it turned out to be an issue with a bad internet connection rather than an issue with the script). I’d had a request from Paul Malgrati in Scottish Literature about creating an online resource that maps Burns Suppers so I wrote a detailed reply discussing the various options.
I also spent some time fixing the issue of citation links not displaying exactly as the DSL people were wanting in the DSL website. This was surprisingly difficult to implement because the structure can vary quite a bit. The ID needed for the link is associated with the ‘cref’ that wraps the whole reference, but the link can’t be applied to the full contents as only authors and titles should be links, not geo and date tags or other non-tagged text that appears in the element. There may be multiple authors or no author so sometimes the link needs to start before the first (and only the first) author whereas other times the link needs to start before the title. As there is often text after the last element that needs to be linked the closing tag of the link can’t just be appended to the text but instead the script needs to find where this last element ends. However, it looks like I’ve figured out a way to do it that appears to work.
I devoted a few spare hours on Wednesday to investigating the Anglo-Norman Dictionary. Heather was trying to figure out where on the server the dataset that the website uses is located, and after reading through the documentation I managed to figure out that the data is stored in a Berkeley DB and it looks like the data the system uses is stored in a file called ‘entry_hash’. There is a file with this name in ‘/var/data’ and it’s just over 200Mb in size, which suggests it contains a lot of data. Software that can read this file can be downloaded from here: https://www.oracle.com/database/technologies/related/berkeleydb-downloads.html and the Java edition should work on a Windows PC if it has Java installed. Unfortunately you have to register to access the files and I haven’t done so yet, but I’ve let Heather know about this.
I then experimented with setting up a simple AND website using the data from ‘all.xml’, which (more or less) contains all of the data for the dictionary. My test website consists of a database, one script that goes through the XML file and inserts each entry into the database, and one script that displays a page allowing you to browse and view all entries. My AND test is really very simple (and is purely for test purposes) – the ‘browse’ from the live site (which can be viewed here: http://www.anglo-norman.net/gate/) is replicated, only it currently displays in its entirety, which makes the page a little slow to load. Cross references are in yellow, main entries in white. Click on a cross reference and it loads the corresponding main entry, click on a regular headword to load its entry. Currently only some of the entry page is formatted, and some elements don’t always display correctly (e.g. the position of some notes). The full XML is displayed below the formatted text. Here’s an example:
Clearly there would still be a lot to do to develop a fully functioning replacement for AND, but it wouldn’t take a huge amount of time (I’d say it could be measured in days not weeks). It just depends whether the AND people want to replace their old system.
I’ll be on strike again next week from Monday to Wednesday.
I worked on several different projects this week. One of the major tasks I tackled was to continue with the implementation of a new way of recording dates for the Historical Thesaurus. Last week I created a script that generated dates in the new format for a specified (or random) category, including handling labels. This week I figured that we would also need a method to update the fulldate field (i.e. the full date as a text string, complete with labels etc that is displayed on the website beside the word) based on any changes that are subsequently made to dates using the new system, so I updated the script to generate a new fulldate field using the values that have been created during the processing of the dates. I realised that if this newly generated fulldate field is not exactly the same as the original fulldate field then something has clearly gone wrong somewhere, either with my script or with the date information stored in the database. Where this happens I added the text ‘full date mismatch’ with a red background at the end of the date’s section in my script.
Following on from this I created a script that goes through every lexeme in the database, temporarily generates the new date information and from this generates a new fulldate field. Where this new fulldate field is not an exact match for the original fulldate field the lexeme is added to a table, which I then saved as a spreadsheet.
The spreadsheet contains 1,116 rows containing lexemes that have problems with their dates, which out of 793,733 lexemes is pretty good going, I’d say. Each row includes a link to the category on the website and the category name, together with the HTID, word, original fulldate, generated fulldate and all original date fields for the lexeme in question. I spent several hours going through previous, larger outputs and fixing my script to deal with a variety of edge cases that were not originally taken into consideration (e.g. purely OE dates with labels were not getting processed and some ‘a’ and ‘c’ dates were confusing the algorithm that generated labels). The remaining rows can mostly be split into the following groups:
- Original and generated fulldate appear to be identical but there must be some odd invisible character encoding issue that is preventing them being evaluated as identical. E.g. ‘1513(2) Scots’ and ‘1513(2) Scots’.
- Errors in the original fulldate. E.g. ‘OE–1614+ 1810 poet.’ doesn’t have a gap between the plus and the preceding number, another lexeme has ‘1340c’ instead of ‘c1340’
- Corrections made to the original fulldate that were not replicated in the actual date columns, E,g, ‘1577/87–c1630’ has a ‘c’ in the fulldate but this doesn’t appear in any of the ‘dac’ fields, and a lexeme has the date ‘c1480 + 1485 + 1843’ but the first ‘+’ is actually stored as a ‘-‘ in the ‘con’ column.
- Inconsistent recording of the ‘b’ dates where a ‘b’ date in the same decade does not appear as a single digit but as two digits. There are lots of these, e.g. ‘1430/31–1630’ should really be ‘1430/1–1630’ following the convention used elsewhere.
- Occasions where two identical dates appear with a label after the second date, resulting in the label not being found, as the algorithm finds the first instance of the date with no label after it. E,g, a lexeme with the fulldate ‘1865 + 1865 rare’.
- Any dates that have a slash connector and a label associated with the date after the slash end up with the label associated with the date before the slash too. E.g. ‘1731/1800– chiefly Dict.’. This is because the script can’t differentiate between a slash used to split a ‘b’ date (in which case a following label ‘belongs’ to the date before the slash) and a slash used to connect a completely different date (in which case the label ‘belongs’ to the other date). I tried fixing this but ended up breaking other things so this is something that will need manual intervention. I don’t think it occurs very often, though. It’s a shame the same symbol was used to mean two different things.
It’s now down to some manual fixing of these rows, probably using the spreadsheet to make any required changes. Another column could be added to note where no changes to the original data are required and then for the remainder make any changes that are necessary (e.g. fixing the original first date, or any of the other date fields). Once that’s done I will be able to write a script that will take any rows that need updating and perform the necessary updates. After that we’ll be ready to generate the new date fields for real.
I also spent some time this week going through the sample data that Katie Halsey had sent me from a variety of locations for the Books and Borrowing project. I went through all of the sample data and compiled a list of all of the fields found in each. This is a first step towards identifying a core set of fields and of mapping the analogous fields across different datasets. I also included the GU students and professors from Matthew’s pilot project but I have not included anything from the images from Inverness as deciphering the handwriting in the images is not something I can spend time doing. With this mapping document in place I can now think about how best to store the different data recorded at the various locations in a way that will allow certain fields to be cross-searched.
I also continued to work on the Place-Names of Mull and Ulva project. I copied all of the place-names taken from the GB1900 data to the Gaelic place-name field, added in some former parishes and updated the Gaelic classification codes and text. I also began to work on the project’s API and front end. By the end of the week I managed to get an ‘in development’ version of the quick search working. Markers appear with labels and popups and you can change base map or marker type. Currently only ‘altitude’ categorisation gives markers that are differentiated from each other, as there is no other data yet (e.g. classification, dates). The links through to the ‘full record’ also don’t currently work, but it is handy to have the maps to be able to visualise the data.
Also this week I had a further email conversation with Heather Pagan about the Anglo-Norman Dictionary, spoke to Rhona Alcorn about a new version of the Scots School Dictionary app, met with Matthew Creasey to discuss the future of his Decadence and Translation Network recourse and a new project of his that is starting up soon, responded to a PhD student who had asked me for some advice about online mapping technologies, arranged a coffee meeting for the College of Arts Developers and updated the layout of the video page of SCOSYA.
I divided most of my time between three projects this week. For the Place-Names of Mull and Ulva my time was spent working with the GB1900 dataset. On Friday last week I’d created a script that would go through the entire 2.5 million row CSV file and extract each entry, adding it to a database for more easy querying. This process had finished on Monday, but unfortunately things had gone wrong during the processing. I was using the PHP function ‘fgetcsv’ to extract the data a line at a time. This splits the CSV up based on a delimiting character (in this case a comma) and adds each part into an array, thus allowing the data to be inserted into my database. Unfortunately some of the data contained commas. In such cases the data was enclosed in double quotes, which is the standard way of handling such things, and I had thought the PHP function would automatically handle this, but alas it didn’t, meaning whenever a comma appeared in the data the row was split up into incorrect chunks and the data was inserted incorrectly into the database. After realising this I added another option to the ‘fgetcsv’ command to specify a character to be identified as the ‘enclosure’ character and set the script off running again. It had completed the insertion by Wednesday morning, but when I came to query the database again I realised that the process had still gone wrong. Further investigation revealed the cause to be the GB1900 CSV file itself, which was encoded with UCS-2 character encoding rather than the more usual UTF-8. I’m not sure why the data was encoded in this way, as it’s not a current standard and it results in a much larger file size than using UTF-8. It also meant that my script was not properly identifying the double quote characters, which is why my script failed a second time. However, after identifying this issue I converted the CSV to UTF-8, picked out a section with commas in the data, tested my script, discovered things were working this time, and let the script loose on the full dataset yet again. Thankfully it proved to be ‘third time lucky’ and all 2.5 million rows had been successfully inserted by Friday morning.
After that I was then able to extract all of the place-names for the three parishes we’re interested in, which is a rather more modest (3908 rows. I then wrote another script that would take this data and insert it into the project’s place-name table. The place-names are a mixture of Gaelic and English (e.g. ‘A’ Mhaol Mhòr’ is pretty clearly Gaelic while ‘Argyll Terrace’ is not) and for now I set the script to just add all place-names to the ‘English’ rather then ‘Gaelic’ field. The script also inserts the latitude and longitude values from the GB1900 data, and associates the appropriate parish. I also found a bit of code that takes latitude and longitude figures and generates a 6 figure OS grid reference from them. I tested this out and it seemed pretty accurate, so I also added this to my script, meaning all names also have the grid reference field populated.
The other thing I tried to do was to grab the altitude for each name via the Google Maps service. This proved to be a little tricky as the service blocks you if you make too many requests all at once. Also, our server was blacklisting my computer for making too many requests in a short space of time too, meaning for a while afterwards I was unable to access any page on the site or the database. Thankfully Arts IT Support managed to stop me getting blocked and I managed to set the script to query Google Maps at a rate that was acceptable to it, so I was able to grab the altitudes for all 3908 place-names (although 16 of them are at 0m so may look like it’s not worked for these). I also added in a facility to upload, edit and delete one or more sound files for each place-name, together with optional captions for them in English and Gaelic. Sound files must be in the MP3 format.
The second project I worked on this week was my redevelopment of the ‘Digital Humanities at Glasgow’ site. I have now finished going through the database of DH projects, trimming away the irrelevant or broken ones and creating new banners, icons, screenshots, keywords and descriptions for the rest. There are now 75 projects listed, including 15 that are currently set as ‘Showcase’ projects, meaning they appear in the banner slideshow and on the ‘showcase’ page. I also changed the site header font and fixed an issue with the banner slideshow and images getting too small on narrow screens. I’ve asked Marc Alexander and Lorna Hughes to give me some feedback on the new site and I hope to be able to launch it in two weeks or so.
My third major project of the week was the Historical Thesaurus. Marc, Fraser and I met last Friday to discuss a new way of storing dates that I’ve been wanting to implement for a while, and this week I began sorting this out. I managed to create a script that can process any date, including associating labels with the appropriate date. Currently the script allows you to specify a category (or to load a random category) and the dates for all lexemes therein are then processed and displayed on screen. As of yet nothing is inserted into the database. I have also updated the structure of the (currently empty) dates table to remove the ‘date order’ field. I have also changed all date fields to integers rather than varchars to ensure that ordering of the columns is handled correctly. At last Friday’s meeting we discussed replacing ‘OE’ and ‘_’ with numerical values. We had mentioned using ‘0000’ for OE, but I’ve realised this isn’t a good idea as ‘0’ can easily be confused with null. Instead I’m using ‘1100’ for OE and ‘9999’ for ‘current’. I’ve also updated the lexeme table to add in new fields for ‘firstdate’ and ‘lastdate’ that will be the cached values of the first and last dates stored in the new dates table.
The script displays each lexeme in a category with its ‘full date’ column. It then displays what each individual entry in the new ‘date’ table would hold for the lexeme in boxes beneath this, and then finishes off with displaying what the new ‘firstdate’ and ‘lastdate’ fields would contain. Processing all of the date variations turned out to be somewhat easier than it was for generating timeline visualisations, as the former can be treated as individual dates (an OE, a first, a mid, a last, a current) while the latter needed to transform the dates into ranges, meaning the script had to check how each individual date connected to the next, had to possibly us ‘b’ dates etc.
I’ve tested the script out and I have so far only encountered one issue, and that is there are 10 rows that have first dates and mid dates and last dates but instead of the ‘firmidcon’ field joining the first and the mid dates together the ‘firlastcon’ field is used instead. Then the ‘midlascon’ field is used to join the mid date to the last. This is an error as ‘firlastcon’ should not be used to join first and mid dates. An example of this happening is htid 28903 in catid 8880 where the ‘full date’ is ‘1459–1642/7 + 1856’. There may be other occasions where the wrong joining column has been used, but I haven’t checked for these so far.
After getting the script to sort out the dates I then began the look at labels. I started off using the ‘label’ field in order to figure out where in the ‘full date’ the label appeared. However, I noticed that where there are multiple labels these appear all joined together in the label field, meaning in such cases the contents of the label field will never be matched to any text in the ‘full date’ field. E.g. htid 6463 has the full date ‘1611 Dict. + 1808 poet. + 1826 Dict.’ And the label field is ‘Dict. poet. Dict.’ which is no help at all.
Instead I abandoned the ‘label’ field and just used the ‘full date’ field. Actually, I still use the ‘label’ field to check whether the script needs to process labels or not. Here’s a description of the logic for working out where a label should be added:
The dates are first split up into their individual boxes. Then, if there is a label for the lexeme I go through each date in turn. I split the full date field and look at the part after the date. I go through each character of this in turn. If the character is a ‘+’ then I stop. If I have yet to find label text (they all start with an a-z character) and the character is a ‘-‘ and the following character is a number then I stop. Otherwise if the character is a-z I note that I’ve started the label. If I’ve started the label and the current character is a number then I stop. Otherwise I add the current character to the label and proceed to the next character until all remaining characters are processed or a ‘stop’ criteria is reached. After that if there is any label text it’s added to the date. This process seems to work. I did, however, have to fix how labels applied to ‘current’ dates are processed. For a current date my algorithm was adding the label to the final year and not the current date (stored as 9999) as the label is found after the final year and ‘9999’ isn’t found in the full date string. I added a further check for ‘current’ dates after the initial label processing that moves labels from the penultimate date to the current date in such cases.
In addition to these three big projects I also had an email conversation with Jane Roberts about some issues she’d been having with labels in the Thesaurus of Old English, I liaised with Arts IT Support to get some server space set up for Rachel Smith and Ewa Wanat’s project, I gave some feedback on a job description for an RA for the Books and Borrowing project, helped Carole Hough with an issue with a presentation of the Berwickshire Place-names resource, gave the PI response for Thomas’s Iona project a final once-over, gave a report on the cost of Mapbox to Jennifer Smith for the SCOSYA project and arranged to meet Matthew Creasey next week to discuss his Decadence and Translation project.
I spent quite a bit of time this week continuing to work on the systems for the Place-names of Mull and Ulva project. The first thing I did was to figure out how WordPress sets the language code. It has a function called ‘get_locale()’ that bring back the code (e.g. ‘En’ or ‘Gd’). Once I knew this I could update the site’s footer to display a different logo and text depending on the language the page is in. So now if the page is in English the regular UoG logo and English text crediting the map and photo are displayed whereas is the page is in Gaelic the Gaelic UoG logo and credit text is displayed. I think this is working rather well.
I managed to get all of the new Gaelic fields added into the CMS and fully tested by Thursday and asked Alasdair to testing things out. I also had a discussion with Rachel Opitz in Archaeology about incorporating LIDAR data into the maps and started to look at how to incorporate data from the GB1900 project for the parishes we are covering. GB1900 (http://www.gb1900.org/) was a crowdsourced project to transcribe every place-name that appears on OS maps from 1888-1914, which resulted in more than 2.5 million transcriptions. The dataset is available to download as a massive CSV file (more than 600Mb). It includes place-names for the three parishes on Mull and Ulva and Alasdair wanted to populate the CMS with this data as a starting point. On Friday I started to investigate how to access the information. Extracting the data manually from such a large CSV file wasn’t feasible so instead I created a MySQL database and wrote a little PHP script that iterated through each line of the CSV and added it to the database. I left this running over the weekend and will continue to work with it next week.
Also this week I continued to add new project records to the new Digital Humanities at Glasgow site. I only have about 30 more sites to add now, and I think it’s shaping up to be a really great resource that we will hopefully be able to launch in the next month or so.
I also spent a bit of further time on the SCOSYA project. I’d asked the university’s research data management people whether they had any advice on how we could share our audio recording data with other researchers around the world. The dataset we have is about 117GB, and originally we’d planned to use the University’s file transfer system to share the files. However, this can only handle files that are up to 20Gb in size, which meant splitting things up. And it turned out to take an awfully long time to upload the files, a process we would have to do each time the data was requested. The RDM people suggested we use the University’s OneDrive system instead. This is part of Office365 and gives each member of staff 1TB of space, and it’s possible to share uploaded files with others. I tried this out and the upload process was very swift. It was also possible to share the files with users based on their email addresses, and to set expiration dates and password for file access. It looks like this new method is going to be much better for the project and for any researchers who want to access our data. We also set up a record about the dataset in the Enlighten Research Data repository: http://researchdata.gla.ac.uk/951/ which should help people find the data.
Also for SCOSYA we ran into some difficulties with Google’s reCAPTCHA service, which we were using to protect the contact forms on our site from spam submissions. There was an issue with version 3 of Google’s reCAPTCHA system when integrated with the contact form plugin. It works fine if Google thinks you’re not a spammer but if you somehow fail its checks it doesn’t give you the option of proving you’re a real person, it just blocks the submission of the form. I haven’t been able to find a solution for this using v3, but thankfully there is a plugin that allows the contact form plugin to revert back to using reCAPTCHA v2 (the ‘I am not a robot’ tickbox). I got this working and have applied it to both the contact form and the spoken corpus form and it works for me as someone Google somehow seems to trust and for me when using IE via remote desktop, where Google makes me select features in images before the form submits.
Also this week I met with Marc and Fraser to discuss further developments for the Historical Thesaurus. We’re going to look at implementing the new way of storing and managing dates that I originally mapped out last summer and so we met on Friday to discuss some of the implications of this. I’m hoping to find some time next week to start looking into this.
We received the reviews for the Iona place-name project this week and I spent some time during the week and over the weekend going through the reviews, responding to any technical matters that were raised and helping Thomas Clancy with the overall response, that needed to be submitted the following Monday. I also spoke to Ronnie Young about the Burns Paper Database, that we may now be able to make publicly available, and made some updates to the NME digital ode site for Bryony Randall.
With work completed on the interactive map for the Regional Romanticism last week, I turned instead this week to some other outstanding items on my ‘to do’ list. I spent quite a bit of time continuing to work on the new version of the Digital Humanities Network site, going through all of the existing projects and deciding which to keep. I keep all projects that still have a functioning website that is in some way connected to digital humanities, and for each of these I then need to generate new icons, banner images and screenshots, associate developers and expand upon the available descriptive text. So far I have set up enhanced records for 48 projects, meaning I’m over halfway to completion. It is a time-consuming process but I believe it is worth it so we have a place to showcase these valuable assets.
I spent at least half the week working on the website for the Place-names of Mull and Ulva, a new project that has started up in Celtic and Gaelic for which I am adapting the place-names system I originally created for the Berwickshire Place-names project. The project PI is Alasdair Whyte and I met with him on Thursday to discuss developments. Before that I worked on the interface for the site, as Alasdair had sent me some images he wanted me to use and had decided which font he wanted the site to use. I made a nice header image using a photograph of Mull that Alasdair sent, blended with an image of a historical map from NLS. I also added the bottom part of the image to the footer, and added in the required logos and image credits. I also got the multilingual side of things, which I’d started working on last week, working properly, added in the required fonts, changed the way the site menus were displayed and did some other tweaking and refinement of the original theme. Below is an example of how things currently look:
I also made some updates to the content management system for the site. As of yet I haven’t added in full multilingual support, but I have added in the additional fields that Alasdair had requested. This includes a new facility to upload captioned images that can be associated with a place-name record, a new dedicated field for ‘translation’ and another new field for specifying which island a place-name is on. I also started to look into how a Lidar map layer might be added to the public maps interface, although this is going to need some further work.
Also this week I spoke to Craig Lamont about the Burns Scotland website, which needs some updating. I looked over the site and gave him some ideas as to what could possibly be done with it. I also helped out Rob Maslen with an issue relating to his ‘Fantasy’ blog.
On Friday afternoon I met with Matt Sangster and Katie Halsey to discuss their Books and Borrowers project. This is a major AHRC project that I helped write the proposal for. We heard before Christmas that the project has been funded, which is excellent news, so we met this week to discuss our next steps. The project doesn’t actually start until June, but I’m going to try and get some of the technical aspects in place before then in order to allow the project’s RAs to get started straight away. It’s all very exciting and hopefully it will be a great project to work on when the time comes.