It was another Data Management Plan heavy week this week. I created an initial version of a DMP for Kirsteen McCue’s project at the start of the week and then participated in a Zoom call with Kirsteen and other members of the proposed team on Thursday where the plan was discussed. I also continued to think through the technical aspects of the metaphor-related proposal involving Wendy and colleagues at Duncan Jordanstone College of Art and Design at Dundee and reviewed another DMP that Katherine Forsyth in Celtic had asked me to look at.
Also this week I spent a bit of time working on the Books and Borrowing project, generating more page image tilesets and their corresponding pages for two more of the Edinburgh ledgers and adding an ‘Events’ page to the project website and giving more members of the project team permission to edit the site. I also had an email chat with Thomas Clancy about the Iona project and created a ‘Call for Papers’ page including submission form on the project website (it’s not live yet, though).
I spent the rest of my week continuing to work on the Anglo-Norman Dictionary. We received the excellent news this week that our AHRC application for funding to complete the remaining letters of the dictionary (and carry out more development work) was successful. This week I mage some further tweaks to the new blog pages, adding in the first image in the blog post to the right of the blog snippet on the blog summary page. I also made the new blog pages live, and you can now access them here: https://anglo-norman.net/blog/.
I also made some updates to the bibliography system based on requests from the editors to separate out the display of links to the DEAF website from the actual URLs (previously just the URLs were displayed). I updated the database, the DMS and the new bibliography page to add in a new ‘DEAF link text’ field for both main source text records and items within source text records. I copied the contents of the DEAF field into this new field for all records, I updated the DMS to add in the new fields when adding / editing sources and I updated the new bibliography page so that the text that gets displayed for the DEAF link uses the new field, whereas the actual link through to the DEAF website uses the original field.
The scripts I written when uploading the new ‘R’ dataset needed to make changes to the data to bring it into line with the data already in the system as the ‘R’ data didn’t include some attributes that were necessary for the system to work with the XML files, namely:
In the <main_entry> tag the attribute ‘lead’, which is used to display the editor’s initials in the front end (e.g. “gdw”) and the ‘id’ attribute, which although not used to uniquely identify the entries in my new system is still used in the XML for things like cross-references and therefore is required and must be unique. In the <sense> tag the attribute ‘n’, which increments from 1 within each part of speech and is used to identify senses in the front-end. In the <senseInfo> tag the ID attribute, which is used in the citation and translation searches and the POS attribute which is used to generate the summary information at the top of each entry page. In the <attestation> tag the ID attribute, which is used in the citation search.
We needed to decide how these will be handled in future – whether they will be manually added to the XML as the editors work on them or whether the upload script needs to add them in at the point of upload. We also needed to consider updates to existing entries. If an editor downloads an entry and then works on it (e.g. adding in a new sense or attestation) then the exported file will already include all of the above attributes, except for any new sections that are added. In such cases should the new sections have the attributes added manually, or do I need to ensure my script checks for the existence of the attributes and only adds the missing ones as required?
We decided that I’d set up the systems to automatically check for the existence of the attributes and add them in if they’re not already present. It will take more time to develop such a system but it will make it more robust and hopefully will result in fewer errors. I’ll also add an option to specify the ‘lead’ initials for the batch of files that are being uploaded, but this will not overwrite the ‘lead’ attribute for any XML files in the batch that already have the attribute specified.
I’ll hopefully get a chance to work on this next week. Thankfully this is the last week of home-schooling for us so I should have a bit more time from next week onwards.
I had a couple of Zoom meetings this week, then first on Monday was with the Historical Thesaurus team and members of the Oxford English Dictionary’s team to discuss how our two datasets will be aligned and updated in future. It was an interesting meeting, but there’s still a lot of uncertainty regarding how the datasets can be tracked and connected as future updates are made, at least some of which will probably only become apparent when we get new data to integrate.
My second Zoom meeting was on Tuesday with the Place-Names of Iona project to discuss how we will be working with the QGIS package that team members will be using to access some of the archaeological data and Lidar maps, and also to discuss the issue of 10 digit grid references and the potential change from the old OSGB-36 means of generating latitude and longitude from grid references to the new WGS84 method. It was a productive meeting and we decided that we would switch over to WGS84 and I would update the CMS to incorporate the new library for generating latitude and longitude from grid references.
Also this week I continued to work on the Books and Borrowing project, generating image tilesets for the scans of several volumes of ledgers from Edinburgh University Library and writing scripts to generate pages in the Content Management System, creating ‘next’ and ‘previous’ links as required and associating the relevant images. I also had an email correspondence about some of the querying methods we will develop for the data, such as collocation information.
I also gave some feedback on a data management plan for a project I’m involved with, had a chat with Wendy Anderson about a possible future project she’s trying to set up and spent some time making updates to the underlying data of the Interactive Map of Burns Suppers that launched last month. I didn’t have the time to do a huge amount of work on the Anglo-Norman Dictionary this week, but I still managed to migrate some of the project’s old blog posts to our new site over the course of the week.
Finally, I made some updates to the bibliography system for the Dictionary of the Scots Language, updating the new system so it works in a similar manner to the live site. I added ‘Author’ and ‘Title’ to the drop-down items when searching for both to help differentiate them and a search for an item when the user ignores the drop-down options and manually submits the search now works as it does in the live site. I also fixed the issue with selecting ‘Montgomerie, Norah & William’ resulting in a 404 error. This was caused by the ampersand. There were some issues with other non-alphanumeric characters that I’ve fixed too, including slashes and apostrophes.
I spent quite a bit of time this week continuing to work on the Anglo-Norman Dictionary, creating a new ‘bibliography’ page that will replace the existing ‘source texts’ page and uses the new source text management scripts that I added to the new content management system recently. This required rather a lot of updates as I needed to update the API to use the new source texts table and also to incorporate source text items as required, which took some time. I then created the new ‘bibliography’ page which uses the new source text data. There is new introductory text and each item features the new fields as requested by the editors. ‘Dean’ references always appear, the title and author are in bold and ‘details’ and ‘notes’ appear when present. If a source text has one or more items these are listed in numeric order, in a slightly smaller font and indented. Brackets for page numbers are added in. I also had to change the way the source texts were ordered as previously the list was ordered by the ‘slug’ but with the updates to the data it sometimes happens that the ‘slug’ doesn’t begin with the same letter as the siglum text and this was messing up the order and the alphabetical buttons. Now the list is ordered by the siglum text stripped of any tags and all seems to be working fine. I will still need to update the links from dictionary items to the bibliography when the new page goes live, and update the search facilities too, but I’ll leave this until we’re ready to launch the new page.
I also had to change the way items within bibliographical entries were ordered. These were previously ordered on the ‘numeral’ field, which contained a Roman numeral. I’d written a bit of a hack to ensure that these were ordered correctly up to 20, but it turns out that there are some entries with more than 60 items, and some of them have non-standard numerals, such as ‘IXa’. I decided that it would be too complicated to use the ‘numeral’ field for ordering as the contents are likely to be too inconsistent for a computer to automatically order successfully. I therefore created a new ‘itemorder’ column in the database that holds a numerical value that decides the order of the items. I wrote a little script that populates this field for the items already in the system and for any bibliographical entry with 20 or fewer items the order should be correct without manual intervention. For the handful of entries with more than 20 items the editors will have to manually update the order. I updated the DMS so that the new ‘item order’ field appears when you add or edit items, and this will need to be used for each item to rearrange the items into the order they should be in. The new bibliography page uses the new itemorder field so updates are reflected on this page.
I also needed to update the system to correctly process multiple DEAF links, which I’d forgotten to do previously, made some changes to the ordering of items (e.g. so that entries with a number appear before entries with the same text but without a number) and added in an option to hide certain fields by adding a special character into the field. Also for the AND I updated the XML of an entry and continued to migrate blog posts from the old blog to our new system.
Beneath the XML section you can view all of the information that is extracted from the XML and used in the system for search and display purposes: forms, parts of speech, cross references, labels, citations and translations. This is to enable the editors to check that the data extracted and used by the system is correct. I could possibly add in options for you to edit this data, but any edits made would then be overwritten the next time an XML file is uploaded for the entry, so I’m not sure how useful this would be. I think it would be better to limit the editing of this information to via a new XML file upload only.
However, we may want to make some of the information in this page directly editable, specifically some of the fields in the first table on the page. The editors may want to change the lemma or homonym number, or the slug or entry order. Similarly the editors may want to manually override the earliest date for the entry (although this would then be overwritten when a new XML version is uploaded) or change the ‘phase’ information.
The scripts to upload a new XML entry are going to take some time to get working, but at least for now you can view and download entries as required. Here’s a screenshot of how the facility works:
Also this week I dealt with a few queries about the Symposium for Seventeenth-Century Scottish Literature, which was taking place online this week and for which I had set up the website. I also spoke to Arts IT Support about getting a test server set up for the Historical Thesaurus. I spent a bit of time working for the Books and Borrowing project, processing images for a ledger from Edinburgh University Library, uploading these to the server and generating page records and links between pages for the ledger. I also gave some advice to the Scots Language Policy RA about how to use the University’s VPN, spoke to Jennifer Smith about her SCOSYA follow-on funding proposal and had a chat with Thomas Clancy about how we will use GIS systems in the Iona project.
I was on holiday from Monday to Wednesday this week to cover the school half-term, so only worked on Thursday and Friday. On Thursday I had a Zoom call with the Historical Thesaurus team to discuss further imports of new data from the OED and how to export our data (such as the revised category hierarchy) in a format that the OED team would be able to use. We have a meeting with the OED the week after next so it was good to go over some of the issues and refresh my memory about where things were left off as it’s been several months since I last did any major work on the HT. As a result of the meeting I also did some further work, namely exporting the current version of the online database and making it available for Fraser to download and access on his own PC, and updating some of the earlier scripts I’d created to generate statistics about the unmatched categories and words so that they used the most recent versions of the database.
Also this week I made some further tweaks to the SCOSYA website and created a user account for a researcher who is going to work with some of the data that is only available in the project’s CMS rather than the public website. I also read through a new funding proposal that Wendy Anderson is involved with and have her some feedback on that and reported a couple of issues with expired SSL certificates that were affecting some websites.
I spent some time on the Books and Borrowing project on two data-related tasks. First was to look through the new set of digitised images from Edinburgh University Library and decide what we should do with them. Each image is of an open book, featuring both recto and verso pages in one image. We may need to split these up into individual images, or we may just create page records that cover both pages. I alerted the project PI Katie Halsey to the issue and the team will make a decision about which approach to take next week. The second task was to look through the data from Selkirk library that another project had generated. We had previously imported data for Selkirk that another researcher had compiled a few years before our project began, but recently discovered that this data did not include several thousand borrowing records of French prisoners of war, as the focus of the researcher was on Scottish borrowers. We need these missing records and another project has agreed to let us use their data. I had intended to completely replace the database I’d previously ingested with this new data, but on closer inspection of the new data I have a number of reservations about doing so.
The data from the other project has been compiled in an Excel spreadsheet and as far as I can tell there is no record of the ledger volume or page that each borrowing record was originally located on. In the data we already have there is a column for ‘source ref’, containing the ledger volume (e.g. ‘volume 1’) and a column for ‘page number’, containing a unique ID for each page in the spreadsheet (e.g. ‘1010159r’). Looking through the various sheets in the new spreadsheet there is nothing comparable to this, which is vital for our project, as borrowing records must be associated with page records, which in turn must be associated with a ledger. It also would make it extremely difficult to trace a record back to the original physical record.
Another issue is that in our existing data the researcher has very handily used unique identifiers for readers (e.g. ‘brodie_james’), borrowing records (e.g. ‘1’) and books (e.g. ‘adam_view_religion’) that tie the various records together very nicely. The new project’s data does not appear to use any unique identifiers to connect bits of data together. For example, there are three ‘John Anderson’ borrowers and in the data we’re currently using these are differentiated by their IDs as ‘anderson_john’, ‘anderson_john2’ and ‘anderson_john3’. This means it’s easy to tell which borrower appears in the borrowing records. In the new project’s data three different fields are required to identify the borrower: surname, forename and residence. This data is stored in separate columns in the ‘All loans’ sheet (e.g. ‘Anderson’, ‘John’, ‘Cramalt’), but in the ‘Members’ sheet everything is joined together in one ‘Name’ field, e.g. ‘Anderson, John (Cramalt)’. This lack of unique identifiers combined with the inconsistent manner of recording name and place will make it very difficult to automatically join up records and I’ve flagged this up with Katie for further discussion with the team. It’s looking like we may want to try and identify the POW records from the new project’s data and amalgamate these with the data we already have, rather than replacing everything.
I also spent a bit of time on the Anglo-Norman Dictionary this week, making some changes to homonym numbers for a few entries and manually updating a couple of commentaries. I also worked for the Dictionary of the Scots Language, preparing the SND and DOST datasets for import into the new editing system that the project is now going to use. This was a little trickier than anticipated as initially I zipped up the data that I’d exported from the old editing system in November when I worked on the new ‘V4’ version of the online API, but we realised that this still contained duplicates that I’d stripped out when uploading the data into the new online database. So instead I exported the XML from the online database, but it turned out that during the upload process a section of the entry XML was being removed. This section (<meta>) contained all of the forms and URLs and my upload process exported these to a separate table and reformatted the XML so that it matched the structure that was defined during the creation of the first version of the API. However, the new editing system requires this <meta> section so that data I’d prepared was not usable. Instead I took the XML exported from the old editing system back in November and ran it through the script I’d written to strip out duplicates, then prepared the resulting XML dataset for transfer. It looks like this approach has worked, but I’ll find out more next week.
I had two Zoom calls this week, the first on Wednesday with Kirsteen McCue to discuss a new, small project to publish a selection of musical settings to Burns poems and the second on Friday with Joanna Kopaczyk and her RA on the Scots Language Policy project to give a tutorial on how to use WordPress.
The majority of my week was divided between the Anglo-Norman Dictionary, the Dictionary of the Scots Language and the Place-names of Iona projects. For the AND I made a few tweaks to the static content of the site and migrated some more blog posts across to the new site (these are not live yet). I also added commentaries to more than 260 entries, which took some time to test. I also worked on the DTD file that the editors reference from their XML editing software to ensure that all of the elements and attributes found within commentaries are ‘allowed’ in the XML. Without doing this it was possible to add the tags in, but this would give errors in the editing software. I also batch updated all of the entries on the site to reference the new DTD and exported all of the files, zipped them up and sent them to the editors so they can work on them as required. I also began to think about migrating the TextBase from the old site to the new one, and managed to source the XML files that comprise this system. It looks like it may be quite tricky to work with these as there are more than 70 book-length XML files to deal with and so far I have not managed to locate the XSLT that was originally used to process these files.
For the DSL I completed work on the new bibliography search pages that use the new ‘V4’ data. These pages allow the authors and titles of bibliographical items to be searched, results to be viewed and individual items to be displayed. I also made some minor tweaks to the live site and had a discussion with Ann Fergusson about transferring the project’s data to the people who have set up a new editing interface for them, something I’m hoping to be able to tackle next week.
For the Place-names of Iona project I had a discussion about implementing a new ‘work of the month’ feature and spent quite a bit of time investigating using 10-digit OS grid references in the project’s CMS. The team need to use up to 10-digit grid references to get 1m accuracy for individual monuments, but the library I use in the CMS to automatically generate latitude and longitude from the supplied grid reference will only work with a 6-digit NGR. The automatically generated latitude and longitude are then automatically passed to Google Maps to ascertain the altitude of the location and all of this information is stored in the database whenever a new place-name record is created or an existing record is edited.
As the library currently in use will only accept 6-digit NGRs I had to do a bit of research into alternative libraries, and I managed to find one that can accept NGRs of 2,4,6,8 or 10 digits. Information about the library, including text boxes where you can enter an NGR and see the results can be found here: http://www.movable-type.co.uk/scripts/latlong-os-gridref.html along with an awful lot of description about the calculations and some pretty scary looking formulae.
This does mean the person filling out the form can see the generated latitude and longitude and also tweak it if required before submitting the form, which is a potentially useful thing. I may even be able to add a Google Map to the form so you can see (and possibly tweak) the point before submitting the form, but I’ll need to look into this further. I also still need to work on the format of the latitude and longitude as the new library generates them with a compass point (e.g. 6.420848° W) and we need to store them as a purely decimal value (e.g. -6.420848) with ‘W’ and ‘S’ figures being negatives.
However, whilst researching this I discovered a potentially worrying thing that needs discussion with the wider team. The way the Ordnance Survey generates latitude and longitude from their grid references was changed in 2014. Information about this can be found in the page linked to above in the ‘Latitude/longitudes require a datum’ section. Previously the OS used ‘OSGB-36’ to generate latitude and longitude, but in 2014 this was changed to ‘WGS84’, which is used by GPS systems. The difference in the latitude / longitude figures generated by the two systems is about 100 metres, which is quite a lot if you’re intending to pinpoint individual monuments.
The new library has facilities to generate latitude and longitude using either the new or old systems, but defaults to the new system. I’ve checked the output of the library we currently use and it uses the old ‘OSGB-36’ system. This means all of the place-names in the system so far (and all those for the previous projects) have latitudes and longitudes generated using the now obsolete (since 2014) system. To give an example of the difference, the place-name A’ Mhachair in the CMS has this location: https://www.google.com/maps/place/56%C2%B019’33.2%22N+6%C2%B025’11.4%22Wemail@example.com,-6.422022,582m/data=!3m2!1e3!4b1!4m5!3m4!1s0x0:0x0!8m2!3d56.325885!4d-6.419828 and with the newer ‘WGS84’ system it would have this location: https://www.google.com/maps/place/56%C2%B019’32.7%22N+6%C2%B025’15.1%22Wfirstname.lastname@example.org,-6.4230367,582m/data=!3m2!1e3!4b1!4m5!3m4!1s0x0:0x0!8m2!3d56.325744!4d-6.420848
So what we need to decide before I replace the old library with the new one in the CMS is whether we switch to using ‘WGS84’ or we keep using ‘OSGB-36’. As I say, this will need further discussion before I implement any changes.
Also this week I responded to a query from Cris Sarg of the Medical Humanities Network project, spoke to Fraser Dallachy about future updates to the HT’s data from the OED, made some tweaks to the structure of the SCOSYA website for Jennifer Smith, added a plugin to the Editing Burns site for Craig Lamont and had a chat with the Books and Borrowing people about cleaning the authors data, importing the Craigston data and how to deal with a lot of borrowers that were excluded from the Selkirk data that I previously imported.
Next week I’ll be on holiday from Monday to Wednesday to cover the school half term.
This was something of an odd week as I tested positive for Covid. I’m not entirely sure how I managed to get it, but I’d noticed on Friday last week that I’d lost my sense of taste and thought it would be sensible to get tested and the result came back positive. I’d been feeling a bit under the weather last week and this continued throughout this week too, but thankfully the virus never affected my chest or throat and I managed to more or less work all week. However, with our household in full-on in isolation our son was off school all week, and will be all next week, which did impact on the work I could do.
My biggest task of the week was to complete the work in preparation for the launch of the second edition of the Historical Thesaurus. This included fixing the full-size timelines to ensure that words that have been updated to have post-1945 end dates display properly. As we had changed the way these were stored to record the actual end date rather than ‘9999’ the end points of the dates on the timeline were stopping short and not having a pointy end to signify ‘current’. New words that only had post-1999 dates were also not displaying properly. Thankfully I managed to get these issues sorted. I also updated the search terms to fix some of the unusual characters that had not migrated over properly but had been replaced by question marks. I then updated the advanced search options to provide two checkboxes to allow a user to limit their search to new word or words that have been updated (or both), which is quite handy, as it means you can fine out all of the new words in a particular decade, for example all of the new words that have a first date some time in the 1980s:
I also tweaked the text that appears beside the links to the OED and added the Thematic Heading codes to the drop-down section of the main category. We also had to do some last-minute renumbering of categories, which affected several hundred categories and subcategories in ’01.02’ and manually moved a couple of other categories to new locations, and after that we were all set for the launch. The new second edition is now fully available, as you can see from the above link.
Other than I worked on a few other projects this week. I helped to migrate a WordPress site for Bryony Randall’s Imprints of New Modernist Editing project, which is now available here: https://imprintsarteditingmodernism.glasgow.ac.uk/ and responded to a query about software purchased from Lisa Kelly in TFTS.
I spent the rest of the week continuing with the redevelopment of the Anglo-Norman Dictionary website. I updated my script that extracts citations and their dates, which I’d started to work on last week. I figured out why my script was not extracting all citations (it was only picking out the citations form the first sense and subsense in each entry rather than all senses) and managed to get all citations out. With dates extracted for each entry I was then able to store the earliest date for each entry and update the ‘browse’ facility to display this date alongside the headword.
With this in place I moved on to looking at the advanced search options. I created the tab-based interface for the various advanced search options and implemented searches for headwords and citations. The headword search works in a very similar way to the quick search – you can enter a term and use wildcards or double quotes for an exact search. You can also combine this with a date search. This allows you to limit your results to only those entries that have a citation in the year or range of years you specify. I would imagine entering a range of years would be more useful than a single year. You can also omit the headword and just specify a citation year to find all entries with a citation in the year or range, e.g. all entries with a citation in 1210.
The citation search is also in place and this works rather differently. As mentioned in the overview document, this works in a similar (but not identical) way to the old ‘concordance search of citations’. You can search for a word or a part of a word using the same wildcards as for the headword and limiting your search to particular citation dates. When you submit the search this then loads an intermediary page that lists all of the word forms in citations that your search matches, plus a count of the number of citations each form is in. From this page you can then select a specific form and view the results. So, for example, a search for words beginning with ‘tre’ with a citation date between 1200 and 1250 lists 331 forms in citations will list all of the ‘tre’ words and you can then choose a specific form, e.g. ‘tref’ to see the results. The citation results include all of the citations for an entry that include the word, with the word highlighted in yellow. I still need to think about how this might work better, as currently there is no quick way to get back to the intermediary list of forms. But progress is being made.
I was back at work this week after having a lovely holiday the previous week. It was a pretty busy week, mostly spent continuing to work on the preparations for the second edition of the Historical Thesaurus, which needs to be launched before the end of the month. I updated the OED date extraction script that formats all of the OED dates as we need them in the HT, including making full entries, in the HT dates table, generating the ‘full date’ text string that gets displayed on the website and generating cached first and last dates that are used for searching. I’d somehow managed to get the plus and dash connectors the wrong way round in my previous version of the script (a plus should be used where there is a gap of more than 150 years, otherwise it’s a dash) so I fixed this. I also stripped out dates that were within a 150 year time span, which really helped to make the full date text more readable. I also updated the category browser so that the category’s thematic heading is displayed in the drop-down section.
Fraser had made some suggested changes to the script I’d written to figure out whether an OED lexeme was new or already in the system so I made some changes to this and regenerated the output. I also made further tweaks to the date extraction script so that we record the actual final date in the system rather than converting it to ‘9999’ and losing this information that will no doubt be useful in future. I then worked on the post-1999 lexemes, which followed a similar set of processes.
With this all in place I could then run a script that would actually import the new lexemes and their associated dates into the HT database. This included changelog codes, new search terms and new dates (cached firstdate and lastdate, fulldate and individual entries in the dates table). A total of 11116 new words were added, although I subsequently noticed there were a few duplicates that had slipped through the net. With these stripped out we had a total of 804,830 lexemes in the HT, and it’s great to have broken through the 800,000 mark. Next week I need to fix a few things (e.g. the fullsize timelines aren’t set up to cope with post-1945 dates that don’t end in ‘9999’ if they’re current) but we’re mostly now good to launch the second edition.
Also this week I worked on setting up a website for the ‘Symposium for Seventeenth-Century Scottish Literature’ for Roslyn Potter in Scottish Literature and set up a subdomain for an art catalogue website for Bryony Randall’s ‘Imprints of the New Modernist Editing’ project. I also helped Megan Coyer out with an issue she was having in transcribing multi-line brackets in Word and travelled to the University to collect a new, higher-resolution monitor and some other peripherals to make working from home more pleasant. I also fixed a couple of bugs in the Books and Borrowing CMS, including one that was resulting in BC dates of birth and death for authors being lost when data was edited. I also spent some time thinking about the structure for the Burns Correspondence system for Pauline Mackay, resulting in a long email with a proposed database structure. I met with Thomas Clancy and Alasdair Whyte to discuss the CMS for the Iona place-names project (it now looks like this is going to have to be a completely separate system from Alasdair’s existing Mull / Ulva system) and replied to Simon Taylor about a query he had regarding the Place-names of Fife data.
I also found some time to continue with the redevelopment of the Anglo-Norman Dictionary website. I updated the way cognate references were processed to enable multiple links to be displayed for each dictionary. I also added in a ‘Cite this entry’ button, which now appears in the top right of the entry that when clicked on opens a pop-up where citation styles will appear (they’re not there yet). I updated the left-hand panel to make it ‘sticky’: If you scroll down a long entry the panel stays visible on screen (unless you’re viewing on a narrow screen like a mobile phone in which case the left-hand panel appears full-width before the entry). I also added in a top bar that appears when you scroll down the screen that contains the site title, the entry headword and the ‘cite’ button. I then began working on extracting the citations, including their dates and text, which will be used for search purposes. I ran an extraction script that extracted about 60,0000 citations, but I released that this was not extracting all of the citations and further work will be required to get this right next week.
This was a four-day week for me as I’d taken Friday off as it was an in-service day at my son’s school before next week’s half-term, which I’ve also taken off. I had rather a lot to try and get done before my holiday so it was a pretty intense week, split mostly between the Historical Thesaurus and the Anglo-Norman Dictionary.
For the Historical Thesaurus I continued with the preparations for the second edition, starting off by creating a little statistics page that lists all of the words and categories that have been updated for the second edition and the changelog code that have been applied to them. Marc had sent a list of all of the category number sequences that we have updated so I then spent a bit of time updating the database to apply the changelog codes to all of these categories. It turns out that almost 200,000 categories have been revised and relocated (out of about 235,000) so it’s pretty much everything. At our meeting last week we had proposed updating the ‘new OED words’ script I’d written last week to separate out some potential candidates into an eighth spreadsheet (these are words that have a slash in them, which now get split up on the slash and each part is compared against the HT’s search words table to see whether they already exist). Whilst working through some of the other tasks I realised that I hadn’t included the unique identifiers for OED lexemes in the output, which was going to make it a bit difficult to work with the files programmatically, especially since there are some occasions where the OED has two identical lexemes in a category. I therefore updated my script and regenerated the output to include the lexeme ID making it possible to differentiate identical lexemes and also making it easier to grab dates for the lexeme in question.
The issue of there being multiple identical lexemes in an OED category was a strange one. For example, one category had two ‘Amber pudding’ lexemes. I wrote a script that extracted all of these duplicates and there are possibly a hundred or so of them, and also other OED lexemes that appear to have no associated dates. I passed these over to Marc and Fraser for them to have a look at. After that I worked on a script to go through each of the almost 12,000 lexemes that we have identified as OED lexemes that are definitely not present in the HT data, extract their OED dates and then format these as HT dates.
The script generates date entries as they would be added to the HT lexeme dates table (used for timelines), the HT fulldate field (used for display) and the HT firstdate and lastdate fields (used for searching). Dates earlier than 1150 are stored as their actual values in the dates table, but are stored at ‘650’ in the ‘firstdate’ field and are displayed as ‘OE’ in the ‘fulldate. Dates after 1945 are stored as ‘9999’ in both the dates table and the ‘lastdate’ field. Where there is a ‘yearend’ in the OED date (i.e. the date is a range) this is stored as the ‘year_b’ in the HT date and appears after a slash in the ‘fulldate’, following the rules for slashes. If the date is the last date then the ‘year_b’ is used as the HT lastdate. If the ‘year_b’ is after 1945 but the ‘year’ isn’t then ‘9999’ is used. So for example ‘maiden-skate’ has a last date of ‘1880/1884’, which appears in the ‘fulldate’ as ‘1880/4’ and the ‘lastdate’ is ‘1884’. Where there is a gap of more than 150 years between dates the connector between dates is a dash and where the gap is less then this it is a plus. One thing that needed further work was how we handle multiple post 1945 dates. In my initial script if there are multiple post 1945 dates then only one of these is carried over as an HT date, and it’s set to ‘9999’. The is because all post-1945 dates are stored as ‘9999’ and having several of these didn’t seem to make sense and confused the generation of the fulldate. There was also an issue with some OED lexemes only having dates after 1945. In my first version of the script these ended up with only one HT date entry of 9999 and 9999 as both firstdate and lastdate, and a fulldate consisting of just a dash, which was not right. After further discussion with Marc I updated the script so that in such cases the date information that is carried over is the first date (even if it’s after 1945) and a dash to show that it is current. For example, ‘ecoregion’ previously had a ‘full date’ of ‘-‘, one HT date of ‘9999’ and a start date of ‘9999’ and in the updated output has a full date of ‘1962-‘, two HT dates and a start date of 1962. Where a lexeme has a single date this also now has a specific end date rather than it being ‘9999’. I passed the output of the script over the Marc and Fraser for them to work with whilst I was on holiday.
For the Anglo-Norman Dictionary I continued to work on the entry page. I added in the cognate references (i.e. references to other dictionaries), which proved to be rather tricky due to the way they have been structured in the Editors’ XML files (in the current live site the cognate references are stored in a separate hash file and are somehow injected into the entry page when it is generated, but we wanted to rationalise this so that the data that appears on the site is all contained in the Editors’ XML where possible). The main issue was with how the links to other online dictionaries were stored, as it was not entirely clear how to generate actual links to specific pages in these resources from them. This was especially true for links to FEW (I have no idea what FEW stands for as the acronym doesn’t appear to be expanded anywhere, even on the FEW website).
They appear in the Editors’ XML like this:
<FEW_refs siglum=”FEW” linkable=”yes”><link_form>A</link_form><link_loc>24,1a</link_loc></FEW_refs>
Which ends up linking to here:
<FEW_refs siglum=”FEW” linkable=”yes”> <link_form>posse</link_form><link_loc>9,231b</link_loc> </FEW_refs>
Which ends up linking to here:
Based on this my script for generating links needed to:
- Store the base URL https://apps.atilf.fr/lecteurFEW/lire/volume
- Split the <link_loc> on the comma
- multiply the part before the comma by 10 (so 24 becomes 240, 9 becomes 90 etc)
- strip out any non-numeric character from the part after the comma (i.e. getting rid of ‘a’ and ‘b’)
- generate the full URL, such as https://apps.atilf.fr/lecteurFEW/lire/volume/240/page/1 using these two values.
After discussion with Heather and Geert at the AND it turned out to be even more complicated than this, as some of the references are further split into subvolumes using a slash and a Roman numeral, so we have things like ‘15/i,108b’ which then needs to link to https://apps.atilf.fr/lecteurFEW/lire/volume/151/page/108. It took some time to write a script that could cope with all of these quirks, but I got there in the end.
Also this week I updated the citation dates so they now display their full information with ‘MS:’ where required and superscript text. I then finished work on the commentaries, adding in all of the required formatting (bold, italic, superscript etc) and links to other AND entries and out to other dictionaries. Where the commentaries are longer than a few lines they are cut off and an ‘expand’ button is shown. I also updated the ‘Results’ tab so it shows you the number of results in the tab header and have added in the ‘entry log’ feature that tracks which entries you have looked at in a session. The number of these also appears in the tab header and I’m personally finding it a very useful feature as I navigate around the entries for test purposes. The log entries appear in the order you opened them and there is no scrolling of entries as I would imagine most people are unlikely to have more than 20 or so listed. You can always clear the log by pressing on the ‘Clear’ button. I also updated the entry page so that the cross references in the ‘browse’ now work. If the entry has a single cross reference then this is automatically displayed when you click on its headword in the ‘browse’, with a note at the top of the page stating it’s a cross reference. If the entry has multiple cross references these are not all displayed but instead links to each entry are displayed. There are two reasons for this: Firstly, displaying multiple entries can result in long and complicated pages that may be hard to navigate; secondly, the entry page as it currently stands was designed to display one entry, and uses HTML IDs to identify certain elements. An HTML ID must be unique on a page so if multiple entries were displayed things would break. There is still a lot of work to do on the site, but the entry page is at least nearing completion. Below is a screenshot showing the entry log, the cognate references and the commentary complete with formatting and the ‘Expand’ option:
I did also work on some other projects this week as well. For Books and Borrowing I set up a user account for a volunteer and talked her through getting access to the system. For the Mull / Ulva site I automatically generated historical forms for all of the place-names that had come from the GB1900 crowdsourced data. These are now associated with the ‘OS 6 inch 2nd edn’ source and about 1670 names have been updated, although many of these are abbreviations like ‘F.P.’. I also updated the database and the CMS to incorporate a new field for deciding which ‘front-end’ the place-name will be displayed on. This is a drop-down list that can be selected when adding or editing a place-name, allowing you to choose from ‘Both’, ‘Mull / Ulva only’ and ‘Iona only’. There is still a further option for stating whether the place-name appears on the website or not (‘on website: Y/N’) so it will be possible to state that a place-name is associated with one project but shouldn’t appear on that project’s website. I also updated the search option on the ‘Browse placenames’ page to allow a user to limit the displayed placenames to those that have ‘front-end display’ set to one of the options. Currently all place-names are set to ‘Mull / Ulva only’. With this all in place I then created user accounts for the CMS for all of the members of the Iona project team who will be using this CMS to work with the data. I also made a few further tweaks to the search results page of the DSL. After all of this I was very glad to get away for a holiday.
I was off on Monday this week for the September Weekend holiday. My four working days were split across many different projects, but the main ones were the Historical Thesaurus and the Anglo-Norman Dictionary.
For the HT I continued with the preparations for the second edition. I updated the front-end so that multiple changelog items are now checked for and displayed (these are the little tooltips that say whether a lexeme’s dates have been updated in the second edition). Previously only one changelog was being displayed but this approach wasn’t sufficient as a lexeme may have a changed start and end date. I also fixed a bug in the assigning of the ‘end date verified as after 1945’ code, which was being applied to some lexemes with much earlier end dates. My script set the type to 3 in all cases where the last HT date was 9999. What it needed to do was to only set it to type 3 if the last HT date was 9999 and the last OED date was after 1945. I wrote a little script to fix this, which affected about 7,400 lexemes.
I also wrote a script to check off a bunch of HT and OED categories that had been manually matched by an RA. I needed to make a few tweaks to the script after testing it out, but after running it on the data we had a further 846 categories matched up, which is great. Fraser had previously worked on a document listing a set of criteria for working out whether an OED lexeme was ‘new’ or not (i.e. unlinked to an HT lexeme). This was a pretty complicated document with many different stages, and the output of the various stages needing to be outputted into seven different spreadsheets and it took quite a long time to write and test a script that would handle all of these stages. However, I managed to complete work on it and after a while it finished executing and resulted in the 7 CSV files, one for each code mentioned in the document. I was very glad that I had my new PC as I’m not sure my old one could have coped with it – for the Levenshtein tests data every word in the HT had to be stored in memory throughout the script’s execution, for example. On Friday I had a meeting with Marc and Fraser where we discussed the progress we’d been making and further tweaks to the script were proposed that I’ll need to implement next week.
For the Anglo-Norman Dictionary I continued to work on the ‘Entry’ page, implementing a mixture of major features and minor tweaks. I updated the way the editor’s initials were being displayed as previously these were the initials of the editor who made the most recent update in the changelog where what was needed were the initials of the person who created the record, contained in the ‘lead’ attribute of the main entry. I also attempted to fix an issue with references in the entry that were set to ‘YBB’. Unlike other references, these were not in the data I had as they were handled differently. I thought I’d managed to fix this, but it looks like ‘YBB’ is used to refer to many different sources so can’t be trusted to be a unique identifier. This is going to need further work.
Minor tweaks included changing the font colour of labels, making the ‘See Also’ header bigger and clearer, removing the final semi-colon from lists of items, adding in line breaks between parts of speech in the summary and other such things. I then spent quite a while integrating the commentaries. These were another thing that weren’t properly integrated with the entries but were added in as some sort of hack. I decided it would be better to have them as part of the editors’ XML rather than attempting to inject them into the entries when they were requested for display. I managed to find the commentaries in another hash file and thankfully managed to extract the XML from this using the Python script I’d previously written for the main entry hash file. I then wrote a script that identified which entry the commentary referred to, retrieved the entry and then inserted the commentary XML into the middle of it (underneath the closing </head> element.
It took somewhat longer than I expected to integrate the data as some of the commentaries contained Greek, and the underlying database was not set up to handle multi-byte UTF-8 characters (which Greek are), meaning these commentaries could not be added to the database. I needed to change the structure of the database and re-import all of the data as simply changing the character encoding of the columns gave errors. I managed to complete this process and import the commentaries and then begin the process of making them appear in the front-end. I still haven’t completely finished this (no formatting or links in the commentaries are working yet) and I’ll need to continue with this next week.
Also this week I added numbers to the senses. This also involved updating the editor’s XML to add a new ‘n’ attribute to the <sense> tag, e.g. <sense id=”AND-201-47B626E6-486659E6-805E33CE-A914EB1F-S001″ n=”1″>. As with the current site, the senses reset to 1 when a new part of speech begins. I also ensured that [sic] now appears, as does the language tag, with a question mark if the ‘cert’ attribute is present and not 100. Uncertain parts of speech are also now visible too (again if ‘cert’ is present and not 100), I increased the font size of the variant forms and citation dates are now visible. There is still a huge amount of work to do, but progress is definitely being made.
Also this week I reviewed the transcriptions from a private library that we are hoping to incorporate into the Books and Borrowing project and tweaked the way ‘additional fields’ are stored to enable the Ras to enter HTML characters into them. I also created a spreadsheet template for a recording the correspondence of Robert Burns for Craig Lamont and spoke to Eila Williamson about the design of the new Names Studies website. I updated the text on the homepage of this site, which Lorna Hughes sent me and gave some advice to Luis Gomes about a data management plan he is preparing. I also updated the working on the search results page for ‘V3’ of the DSL to bring it into line with ‘V2’ and participated in a Zoom call for the Iona project where we discussed the new website and images that might be used in the design.
I’d taken Thursday and Friday off this week as it was the Glasgow September Weekend holiday, meaning this was a three-day week for. It was a week where focussing on any development tasks was rather tricky as I had four Zoom calls and a dentist’s appointment on the other side of the city during my three working days.
On Monday I had a call with the Historical Thesaurus people to discuss the ongoing task of integrating content from the OED for the second edition. There’s still rather a lot to be done for this, and we’re needing to get it all complete during October, so things are a little stressful. After the meeting I made some further updates to the display of icons signifying a second edition update. I updated the database and front-end to allow categories / subcats to have a changelog (in addition to words). These appear in a transparent circle with a white border and a white number, right aligned. I also updated the display of the icon for words. These also appear as a transparent circle, right aligned, but have the teal colour for a border and the number. I also realised I hadn’t added in the icons for words in subcats, so put these in place too.
After that I set about updated the dates of HT lexemes based on some rules that Fraser had developed. I created and ran scripts that updated the start dates of 91,364 lexemes based on OED dates and then ran a further scrip that updated the end dates of 157,156 lexemes. These took quite a while to run (the latter I was dealing with during my time off) but it’s good that progress is being made.
My second Zoom call of the week was for the Books and Borrowing project, and was with the project PI and Co-I and someone who is transcribing library records from a private library that we’re now intending to incorporate into the project’s system. We discussed the data and the library and made a plan for how we’re going to work with the data in future. My third and fourth Zoom call were for the new Place-names of Iona project that is just starting up. It was a good opportunity to meet the rest of the project team (other than the RA who has yet to be appointed) and discuss how and when tasks will be completed. We’ve decided that we’ll use the same content management system as the one I already set up for the Mull and Ulva project, as this already includes Iona data from the GB1900 project. I’ll need to update the system so that we can differentiate place-names that should only appear on the Iona front-end, the Mull and Ulva front-end or both. This is because for Iona we are going to be going into much more detail, down to individual monuments and other ‘microtoponyms’ whereas the names in the Mull and Ulva project are much more high level.
For the rest of my available time this week I made some further updates to the script I wrote last week for Fraser’s Scots Thesaurus project, ordering the results by part of speech and ensuring that hyphenated words are properly searched for (as opposed to being split into separate words joined by an ‘or’). I also spent some time working for the DSL people, firstly updating the text on the search results page and secondly tracking down the certificate for the Android version of the School Dictionary app. This was located on my PC at work, so I had arranged to get access to my office whilst I was already in the West End for my dentist’s appointment. Unfortunately what I thought was the right file turned out to be the certificate for an earlier version of the app, meaning I had to travel all the way back to my office again later in the week (when I was on holiday) to find the correct file.
I also managed to find a little time to continue to work on the new Anglo-Norman Dictionary site, continuing to work on the display of the ‘entry’ page. I updated my XSLT to ensure that ‘parglosses’ are visible and that cross reference links now appear. Explanatory labels are also now in place. These currently appear with a grey background but eventually these will be links to the label search results page. Semantic labels are also now in place and also currently have a grey background but will be links through to search results. However, the System XML notes whether certain semantic labels should be shown or not. So, for example <label type=”sem” show=”no”>med.</label> doesn’t get shown. Unfortunately there is nothing comparable in the Editors’ XML (it’s just <semantic value=”med.”/>) so I can’t hide such labels. Finally, the initials of the editor who made the last update now appear in square brackets to the right of the end of the entry.
Also, my new PC was delivered on Thursday and I spent a lot of time over the weekend transferring all of my data and programs across from my old PC.