It was a return to a full five-day week this week, after taking some days off to cover the Easter school holidays for the previous two weeks. The biggest task I tackled this week was to import the data from the Dictionary of the Scots Language’s new editing system into my online system. I’d received a sample of the data from the company responsible for the new editing system a couple of weeks ago, and we had agreed on a slightly updated structure after that. Last week I was sent the full dataset and I spent some time working with it this week. I set up a local version of the online system on my PC and tweaked the existing scripts I’d previously written to import the XML dataset generated by the old editing system. Thankfully the new XML was not massively different in structure to the old set, and different mostly in the addition of a few new attributes, such as ‘oldid’ that referenced to old ID of each entry, and ‘typeA’ and ‘typeB’, which contain numerical codes that denote which text should be displayed to note when the entry was published. With changes made to the database to store these attributers and updates to the import script to process them I was ready to go, and all 80,432 DOST and SND entries were successfully imported, including extracting all forms and URLs for use in the system.
I had a conversation with the DSL team about whether my ‘browse order’ would still be required, as the entries now appear to be ordered nicely by their new IDs. Previously I ran a script to generate the dictionary order based on the alphanumeric characters in the headword and the ‘posnum’ that I generated based on the classification of parts of speech taken from a document written by Thomas Widmann when he worked for the DSL (e.g. all POS beginning ‘n.’ have a ‘posnum’ of 1, all POS beginning ‘ppl. adj.’ have a ‘posnum’ of 8). Although the new data is now nicely ordered by the new ID field I wanted to check whether I should still be generating and using my browse order columns or whether I should just order things by ID. I suggested that going forward it will not be possible to use the ID field as browse order, as whenever the editors add a new entry its ID will position it in the wrong place (unless the ID field is not static and is regenerated whenever a new entry is added). My assumption was correct and we agreed to continue using my generated browse order.
In a related matter my script extracts the headword of each entry from the XML and this is used in my system and also to generate the browse order. The headword is always taken to be the first <f> of type “form” within <meta> in the <entry>. However, I noticed that there are five entries that have no <f> of type “form” and are therefore missing a headword, and are appearing first in the ‘browseorder’ because of this. This is something that still needs to be addressed.
In our conversations, Ann Ferguson mentioned that my browse system wasn’t always getting the correct order where there were multiple identical headwords all within the same generate part of speech. For example there are multiple noun ‘point’ entries in DOST – n. 1, n. 2 and n. 3. These were appearing in the ‘browse’ feature with n. 3 first. This is because (as per Thomas’s document) all entries with a POS starting with ‘n.’ are given a ‘posorder’ of 1. In cases such as ‘point’ where the headword is the same and there are several entries with a POS beginning ‘n.’ the order is then set to depend on the ID, and ‘Point n.3’ has the lowest ID, so appears first. I therefore updated the script that generates the browse order so that in such cases entries are ordered alphabetically by POS instead.
I also regenerated the data for the Solr full-text search, but I’ll need Arts IT Support to update this, and they haven’t got back to me yet. I then migrated all of the new data to the online server and also created a table for the ‘about’ text that will get displayed based on the ‘typeA’ and ‘tyepB’ number in the entry. I then created a new version of the API that uses the new data and pulls in the necessary ‘about’ data. When I did this I noticed that some slugs (the identifier that will be used to reference an entry in a URL) are still coming out as old IDs because this is what is found in the <url> elements. So for example the entry ‘snd00087693’ had the slug ‘snds165’. After discussion we agreed that in such cases the slug should be the new ID, and I tweaked the import script and regenerated the data to make this the case. I then updated one of our test front-ends to use the new API, updating the XSLT to ensure that the <meta> tag that now appears in the XML is not displayed and updating bibliographical references and cross references to use the new ‘refid’ attribute. I also set up the entry page to display the ‘about’ text, although the actual placement and formatting of this text still needs to be decided upon. I then moved on to the bibliographical data, but this is going to take a bit longer to sort out, as previous bib info was imported from a CSV.
Also this week I read through and gave feedback on a data management plan for a proposal Marc Alexander in involved with and created a new version of the DMP for the new metaphor proposal that Wendy Anderson is involved with. I also gave some advice to Gerry Carruthers about hosting some journal issues at Glasgow.
For the Books and Borrowing project I made some updates to the data of the 18th Century Borrowers pilot project, including fixing some issues with special characters, updating information relating to a few books and merging a couple of book records. I also continued to upload the page images of the Edinburgh registers, finishing the upload of 16 registers and then generating the page records for all of the pages in the content management system. I then started on the St Andrews registers.
I also participated in a Zoom call about GIS for the place-names of Iona project, where we discussed the sort of data and maps that would appear in the QGIS system and how this would relate to the online CMS, and also tweaked the Call of Papers page of the website.
Finally, I continued to make updates to the content management systems for the Comparative Kingship project, adding in Irish versions of the classifications and some of the labels, changing some parishes, adding in the languages that are needed for the Irish system and removing the unnecessary place-names that were imported from the GB1900 dataset. These are things like ‘F.P.’ for ‘footpath’. A total of 2,276 names, with their parish references, historical forms and links to the OS source were deleted by a little script I wrote for the purpose. I think I’m up to date with this project for the moment, so next week I intend to continue with the DSL bibliographical data import and to return to working on the Anglo-Norman Dictionary.
I’d taken Monday and Thursday off this week to cover some of the school Easter holidays, and I also lost some of Friday as I’d arranged to travel through to the University to pick up some equipment that had been ordered for me. So I probably only had about two and a half days of actual work this week, which I mostly spent continuing to develop the content management systems for the new Comparative Kingship place-names project. I created user accounts to enable members of the project team to access the Scottish CMS that I completed last week, and completed work on the 10,000 or so place-names I’d imported from the GB1900 data, setting up a ‘source’ for the map used by this project (OS 6 inch 2nd edition), generating a historical form for each of the names and associating each historical form with the source. This will mean that the team will be able to make changes to the head names and still have a record of the form that appeared in the GB1900 data.
I then began work on the Irish CMS, which required a number of changes to be made. This included importing more than 200 parishes across several counties from a spreadsheet, updating the fields previously marked as Scottish Gaelic to Irish and generating new fields for recording ‘Townland’ in English and Irish. ‘Townland’ also had to be added to the classification codes and a further multi-select option similar to parish needed to be added for ‘Barony’. OS map names ‘Landranger’ and ‘Explorer’ needed to be changed too, in both the main place-name record and in the sources.
The biggest change, however, was to the location system as Ireland has a different grid reference system to the UK. A feature of my CMS is that latitude, longitude and altitude are generated automatically from a supplied grid reference, and in order to retain this functionality for the Irish CMS I needed to figure out a method of working with Irish grid references. In addition, the project team also wanted to store another location coordinate system, the Irish Transverse Mercator (ITM) system, and wanted not only this to be automatically generated from the grid reference, but to be able to supply the ITM field and have all other location fields (including the grid reference) populate automatically. This required some research to see if there was a tool or online service that I could incorporate into my system.
I also continued to work on the Books and Borrowing project this week. I’d been in discussion with the Stirling University IT people about setting up a IIIF server for the project, and I heard this week that they have agreed to this, which is really great news. Previously in order to allow page images to be zoomed and panned like a Google Map we had to generate and store tilesets of each page image at each zoom level. It was taking hours to generate the tilesets for each book and days to upload the images to the server, and was requiring a phenomenal amount of storage space on the server. For example, the tilesets for one of the Edinburgh volumes consisted of around 600,000 files and took up around 14GB of space. This was in addition to the actual full-size images of the pages (about 250 at around 12MB each).
An IIIF server means we only need to store the full-size images of each page and the server dynamically chops up and serves sections of the image at the desired zoom level whenever anyone uses the zoom and pan image viewer. It’s a much more efficient system. However, it does mean I needed to update the ‘Page image’ page of the CMS to use the IIIF server, and it took a little time to get this working. I’d decided to use the OpenLayers library to access the images, as this is what I’d previously been using for the image tilesets, and it has the ability to work with a IIIF server (see https://openlayers.org/en/latest/examples/iiif.html). However, it did take some time to get this working, as the example and all of the documentation is fully dependent on the node.js environment, even though the library itself really doesn’t need to be. I didn’t want to convert my CMS to using node.js and have yet another library to maintain when all I needed was a simple image viewer, so I head to rework the code example linked to above to strip out all of the node dependencies, module syntax and import statements. For example ‘var options = new IIIFInfo(imageInfo).getTileSourceOptions()’ needed to be changed to ‘var options = new ol.format.IIIFInfo(imageInfo).getTileSourceOptions()’. As none of this is documented anywhere on the OpenLayers website it took some time to get right, but I got there in the end and the CMS now has an OpenLayers based IIIF image viewer working successfully.
This week began with Easter Monday, which was a holiday. I’d also taken Tuesday and Thursday off to cover some of the Easter school holidays so it was a two-day working week for me. I spent some of this time continuing to download and process images of library register books for the Books and Borrowing project, including 14 from St Andrews and several further books from Edinburgh. I was also in communication with one of the people responsible for the Dictionary of the Scots Language’s new editor interface regarding the export of new data from this interface and importing it into the DSL’s website. I was sent a ZIP file containing a sample of the data for SND and DOST, plus a sample of the bibliographical data, with some information on the structure of the files and some points for discussion.
I looked through all of the files and considered how I might be able to incorporate the data into the systems that I created for the DSL’s website. I should be able to run the new dictionary XML files through my upload script with only a few minor modifications required. It’s also really great that the bibliographies and cross references are getting sorted via the new Editor interface. One point of discussion is that the new editor interface has generated new IDs for the entries, and the old IDs are not included. I reckoned that it would be good if the old IDs were included in the XML as well, just in case we ever need to match up the current data with older datasets. I did notice that the old IDs already appeared to be included in the <url> fields, but after discussion we decided that it would be safer to include them as an attribute of the <entry> tag, e.g. <entry oldid=”snd848”> or something like that, which is what will happen when I receive the full dataset.
There are also new labels for entries, stating when and how the entry was prepared. The actual labels are stored in a spreadsheet and a numerical ID appears in the XML to reference a row in the spreadsheet. This method of dealing with labels seems fine with me – I can update my system to use the labels from the spreadsheet and display the relevant labels depending on the numerical codes in the entry XML. I reckon it’s probably better to not store the actual labels in the XML as this saves space and makes it easier to change the label text, if required, as it’s only then stored in a single place.
The bibliographies are looking good in the sample data, but I pointed out that it might be handy to have a reference of the old bibliographical IDs in the XML, if that’s possible. There were also spurious xmlns=”” attributes in the new XML, but these shouldn’t pose any problems and I said that it’s ok to leave them in. Once I receive the full dataset with some tweaks (e.g. the inclusion of old IDs) then I will do some further work on this.
I spent most of the rest of my available time working on the new Comparative Kingship place-names systems. I completed work on the Scotland CMS, including adding in the required parishes and former parishes. This means my place-name system has now been fully modernised and uses the Bootstrap framework throughout, which looks a lot better and works more effectively on all screen dimensions.
I also imported the data from GB1900 for the relevant parishes. There are more than 10,000 names, although a lot of these could be trimmed out – lots of ‘F.P.’ for footpath etc. It’s likely that the parishes listed are rather broader than the study will be. All the names in and around St Andrews are in there, for example. In order to generate altitude for each of the names imported from GB1900 I had to run a script I’d written that passes the latitude and longitude for each name in turn to Google Maps, which then returns elevation data. I had to limit the frequency of submissions to one every few seconds otherwise Google blocks access, so it took rather a long time for the altitudes of more than 10,000 names to be gathered, but the process completed successfully.
Also this week I dealt with an issue with the SCOTS corpus, which had broken (the database had gone offline) and helped Raymond at Arts IT Support to investigate why the Anglo-Norman Dictionary server had been blocking uploads to the dictionary management system when thousands of files were added to the upload form. It turns out that while the Glasgow IP address range was added into the whitelist the VPN’s IP address range wasn’t, which is why uploads were being blocked.
Next week I’m also taking a couple of days off to cover the Easter School holidays, and will no doubt continue with the DSL and Comparative Kingship projects then.
This was a four-day week due to Good Friday. I spent a couple of these days working on a new place-names project called Comparative Kingship that involves Aberdeen University. I had several email exchanges with members of the project team about how the website and content management systems for the project should be structured and set up the subdomain where everything will reside. This is a slightly different project as it will involve place-name surveys in Scotland and Ireland that will be recorded in separate systems. This is because slightly different data needs to be recorded for each survey, and Ireland has a different grid reference system to Scotland. For these reasons I’ll need to adapt my existing CMS that I’ve used on several other place-name projects, which will take a little time. I decided to take the opportunity to modernise the CMS whilst redeveloping it. I created the original version of the CMS back in 2016, with elements of the interface based on older projects than this, and the interface now looks pretty dated and doesn’t work so well on touchscreens. I’m migrating the user interface to the Bootstrap user interface framework, which looks more modern and works a lot better on a variety of screen sizes. It is going to take some time to complete this migration, as I need to update all of the forms used in the CMS, but I made good progress this week and I’m probably about half-way through the process. After this I’ll still need to update the systems to reflect the differences in the Scottish and Irish data, which will probably take several more days, especially if I need to adapt the system of automatically generating latitude, longitude and altitude from a grid reference to work with Irish grid references.
I also continued with the development of the Dictionary Management System for the Anglo-Norman Dictionary, fixing some issues relating to how sense numbers are generated (but uncovering further issues that still need to be addressed) and fixing a bug whereby older ‘history’ entries were not getting associated with new versions of entries that were uploaded. I also created a simple XML preview facility, which allows the editor to paste their entry XML into a text area and for this to then be rendered as it would appear in the live site. I also made a large change to how the ‘upload XML entries’ feature works. Previously editors could attach any number of individual XML files to the form (even thousands) and these would then get uploaded. However, I encountered an issue with the server rejecting so many file uploads in such a short period of time and blocking access to the PC that sent the files. To get around this I investigated allowing a ZIP file containing XML files to be uploaded instead. Upon upload my script would then extract the ZIP and process all of the XML files contained therein. It turns out that this approach worked very well – no more issues with the server rejecting files and the processing is much speedier as it all happens in a batch rather than the script being called each time a single file is uploaded. I tested the ZIP approach by zipping up all 3,179 XML files from the recent R data update and the Zip file was uploaded and processed in a few seconds, with all entries making their way into the holding area. However, with this approach there is no feedback in the ‘Upload Log’ until the server-side script has finished processing all of the files in the ZIP, at which point all updates appear in the log at the same time, so there may be a wait of maybe 20-30 seconds (if it’s a big ZIP file) before it looks like anything has happened. Despite this I’d say that with this update the DMS should now be able to handle full letter updates.
Also this week I added a ‘name of the month’ feature to the homepage of the Iona place-names project (https://iona-placenames.glasgow.ac.uk/) and continued to process the register images for the Books and Borrowing project. I also spoke to Marc Alexander about Data Management Plans for a new project he’s involved with.