I spent a fair amount of time on the new ‘Speak for Yersel’ project this week, reading through materials produced by similar projects, looking into ArcGIS Online as a possible tool to use to create the map-based interface and thinking through some of the technical challenges the project will face. I also participated in a project Zoom call on Thursday where we discussed the approaches we might take and clarified the sorts of outputs the project intends to produce.
I also had further discussions with the Sofia from the Iona place-names project about their upcoming conference in December and how the logistics for this might work, as it’s going to be an online-only conference. I had a Zoom call with Sofia on Thursday to go through these details, which really helped us to shape up a plan. I also dealt with a request from another project that wants to set up a top-level ‘ac.uk’ domain, which makes three over the past couple of weeks, and make a couple of tweaks to the text of the Decadence and Translation website.
I had a chat with Mike Black about the new server that Arts IT Support are currently setting up for the Anglo-Norman Dictionary and had a chat with Eleanor Lawson about adding around 100 or so Gaelic videos to the Seeing Speech resource on a new dedicated page.
For the Books and Borrowing project I was sent a batch of images of a register from Dumfries Presbytery Library and I needed to batch process them in order to fix the lighting levels and rename them prior to upload. It took me a little time to figure out how to run a batch process in the ancient version of Photoshop I have. After much hopeless Googling I found some pages from ‘Photoshop CS2 For Dummies’ on Google Books that discussed Photoshop Actions (see https://books.google.co.uk/books?id=RLOmw2omLwgC&lpg=PA374&dq=&pg=PA332#v=onepage&q&f=false) which made me realise the ‘Actions’, which I’d failed to find in any of the menus, were available via the tabs on the right of the screen, and I could ‘record’ and action via this. After running the images through the batch I uploaded them to the server and generated the page records for each corresponding page in the register.
I spent the rest of the week working on the Anglo-Norman Dictionary, considering how we might be able to automatically fix entries with erroneous citation dates caused by a varlist being present in the citation with a different date that should be used instead of the main citation date. I had been wondering whether we could use a Levenshtein test (https://en.wikipedia.org/wiki/Levenshtein_distance) to automatically ascertain which citations may need manual editing, or even as a means of automatically adding in the new tags after testing. I can already identify all entries that feature a varlist, so I can create a script that can iterate through all citations that have a varlist in each of these entries. If we can assume that the potential form in the main citation always appears as the word directly before the varlist then my script can extract this form and then each <ms_form> in the <varlist>. I can also extract all forms listed in the <head> of the XML.
So for example for https://anglo-norman.net/entry/babeder my script would extract the term ‘gabez’ from the citation as it is the last word before <varlist>. It would then extract ‘babedez’ and ‘bauboiez’ from the <varlist>. There is only one form for this entry: <lemma>babeder</lemma> so this would get extracted too. The script would then run a Levenshtein test on each possible option, comparing them to the form ‘babeder’, the results of which would be:
The script would then pick out ‘babedez’ as the form to use (only one character different to the form ‘babeder’) and would then update the XML to note that the date from this <ms_form> is the one that needs to be used.
With a more complicated example such as https://anglo-norman.net/entry/bochet_1 that has multiple forms in <head> the test would be run against each and the lowest score for each variant would be used. So for example for the citation where ‘buchez’ is the last word before the <varlist> the two <ms_form> words would be extracted (huchez and buistez) and these plus ‘buchez’ would be compared against every form in <head>, with the overall lowest Leveshtein score getting logged. The overall calculations in this case would be:
bochet = 2
boket = 4
bouchet = 2
bouket = 4
bucet = 2
buchet = 1
buket = 3
bokés = 5
boketes = 5
bochésç = 6
buchees = 2
bochet = 3
boket = 5
bouchet = 3
bouket = 5
bucet = 3
buchet = 2
buket = 4
bokés = 6
boketes = 6
bochésç = 7
buchees = 3
bochet = 5
boket = 5
bouchet = 5
bouket = 5
bucet = 4
buchet = 4
buket = 4
bokés = 6
boketes = 4
bochésç = 8
buchees = 4
Meaning ‘buchez’ would win with a score of 1 and in this case no <varlist> form would therefore be marked. If the main citation form and a varlist form both have the same lowest score then I guess we’d set it to the main citation form ‘winning’, although in such cases the citation could be flagged for manual checking. However, this algorithm does entirely depend on the main citation form being the word before the <varlist> tag and the editor confirmed that this is not always the case, but despite this I think the algorithm could correctly identify the majority of cases, and if the output was placed in a CSV it would then be possible for someone to quickly check through each citation and tick off those that should be automatically updated and manually fix the rest. I made a start on the script that would work through all of the entries and output the CSV during the remainder of the week, but didn’t have the time to finish it. I’m going to be on holiday next week but will continue with this when I return.
I had two Zoom calls on Monday this week. The first was with the Burns people to discuss the launch of the website for the ‘letters and poems’ part of ‘Editing Burns’, to complement the existing ‘Prose and song’ website (https://burnsc21.glasgow.ac.uk/). The new website will launch in January with some video content and blogs, plus I will be working on a content management system for managing the network of Burns’ letter correspondents, which I will put together some time in November, assuming the team can send me on some sample data by then. This system will eventually power the ‘Burns letter writing trail’ interactive maps that I’ll create for the new site sometime next year.
My second Zoom call was for the Books and Borrowing project to discuss adding data from a new source to the database. The call gave us an opportunity to discuss the issues with the data that I’d highlighted last week. It was good to catch up with the team again and to discuss the issues with the researcher who had originally prepared the spreadsheet containing the data. We managed to address all of the issues and the researcher is going to spend a bit of time adapting the spreadsheet before sending it to me to be batch uploaded into our system.
I spent some further time this week investigating the issue of some of the citation dates in the Anglo-Norman Dictionary being wrong, as discussed last week. The issue affects some 4309 entries where at least one citation features the form only in a variant text. This means that the citation date should not be the date of the manuscript in the citation, but the date when the variant of the manuscript was published. Unfortunately this situation was never flagged in the XML, and there was never any means of flagging the situation. The variant date should only ever be used when the form of the word in the main manuscript is not directly related to the entry in question but the form in the variant text is. The problem is it cannot be automatically ascertained when the form in the main manuscript is the relevant one and when the form in the variant text is as there is so much variation in forms.
For example, the entry https://anglo-norman.net/entry/bochet_1 there is a form ‘buchez’ in a citation and then two variant texts for this where the form is ‘huchez’ and ‘buistez’. None of these forms are listed in the entry’s XML as variants so it’s not possible for a script to automatically deduce which is the correct date to use (the closest is ‘buchet’). In this case the main citation form and its corresponding date should be used. Whereas in the entry https://anglo-norman.net/entry/babeder the main citation form is ‘gabez’ while the variant text has ‘babedez’ and so this is the form and corresponding date that needs to be used. It would be difficult for a script to automatically deduce this. In this case a Levenstein test (which test how many letters need to be changed to turn one string into another) could work, but this would still need to be manually checked.
The editor wanted me to focus on those entries where the date issue affects the earliest date for an entry, as these are the most important as the issue results in an incorrect date being displayed for the entry in the header and the browse feature. I wrote a script that finds all entries that feature ‘<varlist’ somewhere in the XML (the previously exported 4309 entries). It then goes through all attestations (in all sense, subsense and locution sense and subsense sections) to pick out the one with the earliest date, exactly as the code for publishing an entry does. What it then does is checks the quotation XML for the attestation with the earliest date for the presence of ‘<varlist’ and if it finds this it outputs information for the entry, consisting of the slug, the earliest date as recorded in the database, the earliest date of the attestation as found by the script, the ID of the attestation and then the XML of the quotation. The script has identified 1549 entries that have a varlist in the earliest citation, all of which will need to be edited.
However, every citation has a date associated with it and this is used in the advanced search where users have the option to limit their search to years based on the citation date. Only updating citations that affect the entry’s earliest date won’t fix this, as there will still be many citations with varlists that haven’t been updated and will still therefore use the wrong date in the search. Plus any future reordering of citations would require all citations with varlists to be updated to get entries in the correct order. Fixing the earliest citations with varlists in entries based on the output of my script will fix the earliest date as used in the header of the entry and the ‘browse’ feature only, but I guess that’s a start.
Also this week I sorted out some access issues for the RNSN site, submitted the request for a new top-level ‘ac.uk’ domain for the STAR project and spent some time discussing the possibilities for managing access to videos of the conference sessions for the Iona place-names project. I also updated the page about the Scots Dictionary for Schools app on the DSL website (https://dsl.ac.uk/our-publications/scots-dictionary-for-schools-app/) after it won the award for ‘Scots project of the year’.
I also spent a bit of time this week learning about the statistical package R (https://www.r-project.org/). I downloaded and installed the package and the R Studio GUI and spent some time going through a number of tutorials and examples in the hope that this might help with the ‘Speak for Yersel’ project.
For a few years now I’ve been meaning to investigate using a spider / radar chart for the Historical Thesaurus, but I never found the time. I unexpectedly found myself with some free time this week due to ‘Speak for Yersel’ not needing anything from me yet so I thought I’d do some investigation. I found a nice looking d3.js template for spider / radar charts here: http://bl.ocks.org/nbremer/21746a9668ffdf6d8242 and set about reworking it with some HT data.
My idea was to use the chart to visualise the distribution of words in one or more HT categories across different parts of speech in order to quickly ascertain the relative distribution and frequency of words. I wanted to get an overall picture of the makeup of the categories initially, but to then break this down into different time periods to understand how categories changed over time.
As an initial test I chose the categories 02.04.13 Love and 02.04.14 Hatred, and in this initial version I looked only at the specific contents of the categories – no subcategories and no child categories. I manually extracted counts of the words across the various parts of speech and then manually split them up into words that were active in four broad time periods: OE (up to 1149), ME (1150-1449), EModE (1450-1799) and ModE (1800 onwards) and then plotted them on the spider / radar chart, as you can see in this screenshot:
You can quickly move through the different time periods plus the overall picture using the buttons above the visualisation, and I think the visualisation does a pretty good job of giving you a quick and easy to understand impression of how the two categories compare and evolve over time, allowing you to see, for example, how the number of nouns and adverbs for love and hate are pretty similar in OE:
but by ModE the number of nouns for Love have dropped dramatically, as have the number of adverbs for Hate:
We are of course dealing with small numbers of words here, but even so it’s much easier to use the visualisation to compare different categories and parts of speech than it is to use the HT’s browse interface. Plus if such a visualisation was set up to incorporate all words in child categories and / or subcategories it could give a very useful overview of the makeup of different sections of the HT and how they develop over time.
There are some potential pitfalls to this visualisation approach, however. The scale used currently changes based on the largest word count in the chosen period, meaning unless you’re paying attention you might get the wrong impression of the number of words. I could change it so that the scale is always fixed as the largest, but that would then make it harder to make out details in periods that have much fewer words. Also, I suspect most categories are going to have many more nouns than other parts of speech, and a large spike of nouns can make it harder to see what’s going on with the other axes. Another thing to note is that the order of the axes is fairly arbitrary but can have a major impact on how someone may interpret the visualisation. If you look at the OE chart the ‘Hate’ area looks massive compared to the ‘Love’ area, but this is purely because there is only one ‘Love’ adjective compared to 5 for ‘Hate’. If the adverb axis had come after the noun one instead the shapes of ‘Love’ and ‘Hate’ would have been more similar. You don’t necessarily appreciate on first glance that ‘Love’ and ‘Hate’ have very similar numbers of nouns in OE, which is concerning. However, I think the visualisations have a potential for the HT and I’ve emailed the other HT people to see what they think.
This was a four-day week for me as I’d taken Friday off. I went into my office at the University on Tuesday to have my Performance and Development Review with my line-manager Marc Alexander. It was the first time I’d been at the University since before the summer and it felt really different to the last time – much busier and more back to normal, with lots of people in the building and a real bustle to the West End. My PDR session was very positive and it was great to actually meet a colleague in person again – the first time I’d done so since the first lockdown began. I spent the rest of the day trying to get my office PC up to date after months of inaction. One of the STELLA apps (the Grammar one) had stopped working on iOS devices, seemingly because it was still a 32-bit app, and I wanted to generate a new version of it. This meant upgrading MacOS on my dual-boot PC, which I hadn’t used for years and was very out of date. I’m still not actually sure whether the Mac I’ve got will support a version of MacOS that will allow me to engage in app development, as I need to incrementally upgrade the MacOS version, which takes quite some time, and by the end of the day there were still further updates required. I’ll need to continue with this another time.
I spent quite a bit of the remainder of the week working on the new ‘Speak for Yersel’ project. We had a team meeting on Monday and a follow-up meeting on Wednesday with one of the researchers involved in the Manchester Voices project (https://www.manchestervoices.org/) who very helpfully showed us some of the data collection apps they use and some of the maps that they generate. It gave us a lot to think about, which was great. I spent some further time looking through other online map examples, such as the New York Times dialect quiz (https://www.nytimes.com/interactive/2014/upshot/dialect-quiz-map.html) and researching how we might generate the maps we’d like to see. It’s going to take quite a bit more research to figure out how all of this is going to work.
Also this week I spoke to the Iona place-names people about how their conference in December might be moved online and fixed a permissions issue with the Imprints of New Modernist Editing website and discussed the domain name for the STAR project with Eleanor Lawson. I also had a chat with Luca Guariento about the restrictions we have on using technologies on the servers in the College of Arts and how this might be addressed.
I also received a spreadsheet of borrowing records covering five registers for the Books and Borrowing project and went through it to figure out how the data might be integrated with our system. The biggest issue is figuring out which page each record is on. In the B&B system each borrowing record must ‘belong’ to a page, which in turn ‘belongs’ to a register. If a borrowing record has no page it can’t exist in the system. In this new data only three registers have a ‘Page No.’ column and not every record in these registers has a value in this column. We’ll need to figure out what can be done about this, because as I say, having a page is mandatory in the B&B system. We could use the ‘photo’ column as this is present across all registers and every row. However, I noticed that there are multiple photos per page, e.g. for SL137144 page 2 has 2 photos (4538 and 4539) so photo IDs don’t have a 1:1 relationship with pages. If we can think of a way to address the page issue then I should be able to import the data.
Finally, I continued to work on the Anglo-Norman Dictionary project, fixing some issues relating to yoghs in the entries and researching a potentially large issue relating to the extraction of earliest citation dates. Apparently there are a number of cases when the date for a citation that should be used is not the date as coded in the date section of the citation’s XML, but should instead be a date taken from a manuscript containing a variant form within the citation. The problem is there is no flag to state when this situation occurs, instead it occurs whenever the form of the word in the citation is markedly different within the citation but similar in the variant text. It seems unlikely that an automated script would be able to ascertain when to use the variant date as there is just so much variation between the forms. This will need some further investigation, which I hope to be able to do next week.
I then spent some time investigating why part of speech in the <senseInfo> element of senses sometimes used underscores and other times used spaces. This discrepancy was messing up the numbering of senses, as this depends on the POS, with the number resetting to 1 when a new POS is encountered. If the POS is sometimes recorded as ‘p.p._as_a.’ (for example) and other times as ‘p.p. as a.’ then the code thinks these are different parts of speech and resets the counter to 1. I looked at the DTD, which sets the rules for creating or editing the XML files and it uses the underscore form of POS. However, this rule only applies to the ‘type’ attribute of the <pos> element and not to the ‘pos’ attribute of the <senseInfo> element. After investigating it turned out that these ‘pos’ attributes that the numbering system relies on are not manually added in by the editors, but are added in by my scripts at the point of upload. The reason I set up my script to add these in is because the old systems also added these in automatically during the conversion of the editors’ XML into the XML published via the old Dictionary Management System. However, this old system refactored the POS, replacing underscores with spaces and thus storing two different formats of POS within the XML. My upload scripts didn’t do this but instead kept things consistent, and this meant that when an entry was edited to add a new sense the new sense was added with the underscore form of POS, but the existing senses still had the space form of POS.
There were two possible ways I could fix this, I could either write a script that regenerates the <senseInfo> pos for every sense and subsense in every entry, replacing all existing ‘pos’ with the value of the preceding <pos type=””> (i.e. removing all old space forms of POS and ensuring all POS references were consistent); or I could adapt my upload script so that the assignment of <senseInfo> pos treats both ‘underscore’ and ‘space’ versions as the same. I decided on the former approach and wrote a script to first identify and then update all of the dictionary entries.
The script goes through each entry and finds all that have a <senseInfo> pos with a space in. There are 2,538 such entries. I then adapted the script so that for each <senseInfo> in an entry all spaces are changed to underscores and the result is then compared with the preceding <pos> type. I set the script to output content if there was a mismatch between the <senseInfo> pos and the <pos> type, because when I set the script to update it will use the value from <pos> type, so as to ensure consistency. The script identified 41 entries where there was a mismatch between <senseInfo> pos and the preceding <pos> type. These were often due to a question mark being added to the <senseInfo> pos, e.g. ‘a. as s. ?’ vs ‘a._as_s._’, but there were also some where the POS was completely different, e.g. ‘sbst. inf.’ and ‘v.n.’. I spoke to the editor Geert about this and it turned out that these were due to a locution being moved in the XML without having the pos value updated. Geert fixed these and I ran the update to bring all of the POS references into alignment.
My final AND task was to look into some issues regarding the variant and deviant section of the entry (where alternative forms of the headword are listed). Legiturs in this section were not getting displayed, plus there were several formatting issues that needed addressed, such as brackets not appearing in the right place and line breaks not worked as they should. This was a very difficult task to tackle as there is so much variety to the structure of this section, and the XML is not laid out in the most logical of manners, for example references are not added as part of a <variant> or <deviant> tag but are added after the corresponding tag as a sibling <varref> element. This really complicates navigating through the variants and deviants as there may be any number of varrefs at the same level. However, I managed to address the issues with this section, ensuring the legiturs appeared, repositioning semi-colons outside of the <deviant> brackets, ensuring line breaks always occur when a new POS is encountered and don’t occur anywhere else, ensuring multiple occurrences of the same POS label don’t get displayed and fixing the issue with double closing brackets sometimes appearing. It’s likely that there will be other issues with this section, as the content and formatting is so varied, but for now that’s all issues sorted.
The only other project I worked for this week was the Iona place-names project, for which I helped the RA Sofia with the formatting of this month’s ‘name of the month’ feature (https://iona-placenames.glasgow.ac.uk/names-of-the-month/). Next week I’ll continue with the outstanding AND tasks, of which there are still several.
I continued with the import of new data for the Dictionary of the Scots Language this week. Raymond at Arts IT Support has set up a new collection and had imported the full-text search data into the Solr server, and I tested this out via the new front-end I’d configured to work with the new data source. I then began working on the import of the bibliographical data, but noticed that the file exported from the DSL’s new editing system didn’t feature an attribute denoting what source dictionary each record is from. We need this as the bibliography search allows users to limit their search to DOST or SND. The new IDs all start with ‘bib’ no matter what the source is. I had thought I could use the ‘oldid’ to extract the source (db = DOST, sb = SND) but I realised there are also composite records where the ‘oldid’ is something like ‘a200’. In such cases I don’t think I have any data that I can use to distinguish between DOST and SND records. The person in charge of exporting the data from the new editing system very helpfully agreed to add in a ‘source dictionary’ attribute to all bibliographical records and sent me an updated version of the XML file. Whilst working with the data I realised that all of the composite records are DOST records anyway, so I didn’t need the ‘sourceDict’ attribute, but I think it’s better to have this explicitly as an attribute as differentiating between dictionaries is important.
I imported all of the bibliographical records into the online system, including the composite ones as these are linked to from dictionary entries and are therefore needed, even though their individual parts are also found separately in the data. However, I decided to exclude the composite records from the search facilities, otherwise we’d end up with duplicates in the search results. I updated the API to use the new bibliography tables and I updated the new front-end so that bibliographical searches use the new data. One thing that needs some further work is the display of individual bibliographies. These are now generated from the bibliography XML via an XSLT whereas previously they were generated from a variety of different fields in the database. The display doesn’t completely match up with the display on the live and Sienna versions of the bibliography pages and I’m not sure exactly how the editors would like entries to be displayed. I’ll need further input from them on this matter, but the import of data from the new editing system has now been completed successfully. I’d been documenting the process as I worked through it and I sent the documentation and all scripts I wrote to handle the workflow to the editors to be stored for future use.
I also worked on the Books and Borrowing project this week. I received the last of the digitised images of borrowing registers from Edinburgh (other than one register which needs conservation work), and I uploaded these to the project’s content management system, creating all of the necessary page records. We have a total of 9,992 page images as JPEG files from Edinburgh, totalling 105GB. Thank goodness we managed to set up an IIIF server for the image files rather than having to generate and store image tilesets for each of these page images. Also this week I uploaded the images for 14 borrowing registers from St Andrews and generated page records for each of these.
I had a further conversation with GIS expert Piet Gerrits for the Iona project and made a couple of tweaks to the Comparative Kingship content management systems, but other than that I spent the remainder of the week returning to the Anglo-Norman Dictionary, which I hadn’t worked on since before Easter. To start with I went back through old emails and documents and wrote a new ‘to do’ list containing all of the outstanding tasks for the project, some 20 items of varying degrees of size and intricacy. After some communication with the editors I began tackling some of the issues, beginning with the apparent disappearance of <note> tags from certain entries.
In the original editor’s XML (the XML as structured before uploaded into the old DMS) there were ‘edGloss’ notes tagged as ‘<note type=”edgloss” place=”inline”>’ that were migrated to <edGloss> elements during whatever processing happened with the old DMS. However, there were also occasionally notes tagged as ‘<note place=”inline”>’ that didn’t get transformed and remained tagged as this.
I’m not entirely sure how or where, but at some point during my processing of the data these ‘<note place=”inline”>’ notes have been lost. It’s very strange as the new DMS import script is based entirely on the scripts I wrote to process the old DMS XML entries, but I tested the DMS import by uploading the old DMS XML version of ‘poer_1’ to the new DMS and the ‘<note place=”inline”>’ have been retained, yet in the live entry for ‘poer_1’ the <note> text is missing.
I searched the database for all entries where the DMS XML as exported from the old DMS system contains the text ‘<note place=”inline”>’ and there are 323 entries, which I added to a spreadsheet and sent to the editors. It’s likely that the new XML for these entries will need to be manually corrected to reinstate the missing <note> elements. Some entries (as with ‘poer_1’) have several of these. II still have the old DMS XML for these so it is at least possible to recover the missing tags. I wish I could identify exactly when and how the tags were removed, but that would quite likely require many hours of investigation, as I already spent a couple of hours trying to get to the bottom of the issue without success.
Moving on to a different issue, I changed the upload scripts so that the ‘n’ numbers are always fully regenerated automatically when a file is uploaded, as previously there were issues when a mixture of senses with and without ‘n’ numbers were included in an entry. This means that any existing ‘n’ values are replaced, so it’s no longer possible to manually set the ‘n’ value. Instead ‘n’ values for senses within a POS will always increment from 1 depending on the order they appear in the file, with ‘n’ being reset to 1 whenever a new POS is encountered.
Main senses in locutions were not being assigned an ‘n’ on upload, and I changed this so that they are assigned an ‘n’ in exactly the same way as regular main senses. I tested this with the ‘descendre’ entry and it worked, although I encountered an issue. The final locution main sense (to descend to (by way of inheritance)) had a POS of ‘sbst._inf.’ In its <senseInfo> whereas it should have been (based on the POS of the previous two senses) ‘sbst. Inf.’. The script was therefore considering this to be a new POS and gave the sense an ‘n’ of 1. In my test file I updated the POS and re-uploaded the file and the sense was assigned the correct value of 3 to its ‘n’, but we’ll need to investigate why a different form of POS was recorded for this sense.
I also updated the front-end so that locution main senses with an ‘n’ now have the ‘n’ displayed, (e.g. https://anglo-norman.net/entry/descendre) and wrote a script that will automatically add missing ‘n’ attributes to all locution main senses in the system. I haven’t run this on the live database yet as I need further feedback from the editors before I do. As the week drew to a close I worked on a method to hide sense numbers in the front-end in case where there was only one sense in a part of speech, but I didn’t manage to get this completed and will continue with it next week.
It was a return to a full five-day week this week, after taking some days off to cover the Easter school holidays for the previous two weeks. The biggest task I tackled this week was to import the data from the Dictionary of the Scots Language’s new editing system into my online system. I’d received a sample of the data from the company responsible for the new editing system a couple of weeks ago, and we had agreed on a slightly updated structure after that. Last week I was sent the full dataset and I spent some time working with it this week. I set up a local version of the online system on my PC and tweaked the existing scripts I’d previously written to import the XML dataset generated by the old editing system. Thankfully the new XML was not massively different in structure to the old set, and different mostly in the addition of a few new attributes, such as ‘oldid’ that referenced to old ID of each entry, and ‘typeA’ and ‘typeB’, which contain numerical codes that denote which text should be displayed to note when the entry was published. With changes made to the database to store these attributers and updates to the import script to process them I was ready to go, and all 80,432 DOST and SND entries were successfully imported, including extracting all forms and URLs for use in the system.
I had a conversation with the DSL team about whether my ‘browse order’ would still be required, as the entries now appear to be ordered nicely by their new IDs. Previously I ran a script to generate the dictionary order based on the alphanumeric characters in the headword and the ‘posnum’ that I generated based on the classification of parts of speech taken from a document written by Thomas Widmann when he worked for the DSL (e.g. all POS beginning ‘n.’ have a ‘posnum’ of 1, all POS beginning ‘ppl. adj.’ have a ‘posnum’ of 8). Although the new data is now nicely ordered by the new ID field I wanted to check whether I should still be generating and using my browse order columns or whether I should just order things by ID. I suggested that going forward it will not be possible to use the ID field as browse order, as whenever the editors add a new entry its ID will position it in the wrong place (unless the ID field is not static and is regenerated whenever a new entry is added). My assumption was correct and we agreed to continue using my generated browse order.
In a related matter my script extracts the headword of each entry from the XML and this is used in my system and also to generate the browse order. The headword is always taken to be the first <f> of type “form” within <meta> in the <entry>. However, I noticed that there are five entries that have no <f> of type “form” and are therefore missing a headword, and are appearing first in the ‘browseorder’ because of this. This is something that still needs to be addressed.
In our conversations, Ann Ferguson mentioned that my browse system wasn’t always getting the correct order where there were multiple identical headwords all within the same generate part of speech. For example there are multiple noun ‘point’ entries in DOST – n. 1, n. 2 and n. 3. These were appearing in the ‘browse’ feature with n. 3 first. This is because (as per Thomas’s document) all entries with a POS starting with ‘n.’ are given a ‘posorder’ of 1. In cases such as ‘point’ where the headword is the same and there are several entries with a POS beginning ‘n.’ the order is then set to depend on the ID, and ‘Point n.3’ has the lowest ID, so appears first. I therefore updated the script that generates the browse order so that in such cases entries are ordered alphabetically by POS instead.
I also regenerated the data for the Solr full-text search, but I’ll need Arts IT Support to update this, and they haven’t got back to me yet. I then migrated all of the new data to the online server and also created a table for the ‘about’ text that will get displayed based on the ‘typeA’ and ‘tyepB’ number in the entry. I then created a new version of the API that uses the new data and pulls in the necessary ‘about’ data. When I did this I noticed that some slugs (the identifier that will be used to reference an entry in a URL) are still coming out as old IDs because this is what is found in the <url> elements. So for example the entry ‘snd00087693’ had the slug ‘snds165’. After discussion we agreed that in such cases the slug should be the new ID, and I tweaked the import script and regenerated the data to make this the case. I then updated one of our test front-ends to use the new API, updating the XSLT to ensure that the <meta> tag that now appears in the XML is not displayed and updating bibliographical references and cross references to use the new ‘refid’ attribute. I also set up the entry page to display the ‘about’ text, although the actual placement and formatting of this text still needs to be decided upon. I then moved on to the bibliographical data, but this is going to take a bit longer to sort out, as previous bib info was imported from a CSV.
Also this week I read through and gave feedback on a data management plan for a proposal Marc Alexander in involved with and created a new version of the DMP for the new metaphor proposal that Wendy Anderson is involved with. I also gave some advice to Gerry Carruthers about hosting some journal issues at Glasgow.
For the Books and Borrowing project I made some updates to the data of the 18th Century Borrowers pilot project, including fixing some issues with special characters, updating information relating to a few books and merging a couple of book records. I also continued to upload the page images of the Edinburgh registers, finishing the upload of 16 registers and then generating the page records for all of the pages in the content management system. I then started on the St Andrews registers.
I also participated in a Zoom call about GIS for the place-names of Iona project, where we discussed the sort of data and maps that would appear in the QGIS system and how this would relate to the online CMS, and also tweaked the Call of Papers page of the website.
Finally, I continued to make updates to the content management systems for the Comparative Kingship project, adding in Irish versions of the classifications and some of the labels, changing some parishes, adding in the languages that are needed for the Irish system and removing the unnecessary place-names that were imported from the GB1900 dataset. These are things like ‘F.P.’ for ‘footpath’. A total of 2,276 names, with their parish references, historical forms and links to the OS source were deleted by a little script I wrote for the purpose. I think I’m up to date with this project for the moment, so next week I intend to continue with the DSL bibliographical data import and to return to working on the Anglo-Norman Dictionary.
This was a four-day week due to Good Friday. I spent a couple of these days working on a new place-names project called Comparative Kingship that involves Aberdeen University. I had several email exchanges with members of the project team about how the website and content management systems for the project should be structured and set up the subdomain where everything will reside. This is a slightly different project as it will involve place-name surveys in Scotland and Ireland that will be recorded in separate systems. This is because slightly different data needs to be recorded for each survey, and Ireland has a different grid reference system to Scotland. For these reasons I’ll need to adapt my existing CMS that I’ve used on several other place-name projects, which will take a little time. I decided to take the opportunity to modernise the CMS whilst redeveloping it. I created the original version of the CMS back in 2016, with elements of the interface based on older projects than this, and the interface now looks pretty dated and doesn’t work so well on touchscreens. I’m migrating the user interface to the Bootstrap user interface framework, which looks more modern and works a lot better on a variety of screen sizes. It is going to take some time to complete this migration, as I need to update all of the forms used in the CMS, but I made good progress this week and I’m probably about half-way through the process. After this I’ll still need to update the systems to reflect the differences in the Scottish and Irish data, which will probably take several more days, especially if I need to adapt the system of automatically generating latitude, longitude and altitude from a grid reference to work with Irish grid references.
I also continued with the development of the Dictionary Management System for the Anglo-Norman Dictionary, fixing some issues relating to how sense numbers are generated (but uncovering further issues that still need to be addressed) and fixing a bug whereby older ‘history’ entries were not getting associated with new versions of entries that were uploaded. I also created a simple XML preview facility, which allows the editor to paste their entry XML into a text area and for this to then be rendered as it would appear in the live site. I also made a large change to how the ‘upload XML entries’ feature works. Previously editors could attach any number of individual XML files to the form (even thousands) and these would then get uploaded. However, I encountered an issue with the server rejecting so many file uploads in such a short period of time and blocking access to the PC that sent the files. To get around this I investigated allowing a ZIP file containing XML files to be uploaded instead. Upon upload my script would then extract the ZIP and process all of the XML files contained therein. It turns out that this approach worked very well – no more issues with the server rejecting files and the processing is much speedier as it all happens in a batch rather than the script being called each time a single file is uploaded. I tested the ZIP approach by zipping up all 3,179 XML files from the recent R data update and the Zip file was uploaded and processed in a few seconds, with all entries making their way into the holding area. However, with this approach there is no feedback in the ‘Upload Log’ until the server-side script has finished processing all of the files in the ZIP, at which point all updates appear in the log at the same time, so there may be a wait of maybe 20-30 seconds (if it’s a big ZIP file) before it looks like anything has happened. Despite this I’d say that with this update the DMS should now be able to handle full letter updates.
Also this week I added a ‘name of the month’ feature to the homepage of the Iona place-names project (https://iona-placenames.glasgow.ac.uk/) and continued to process the register images for the Books and Borrowing project. I also spoke to Marc Alexander about Data Management Plans for a new project he’s involved with.
I continued to develop the ‘Dictionary Management System’ for the Anglo-Norman Dictionary this week, following on with the work I began last week to allow the editors to drag and drop sets of entry XML files into the system. I updated the form to add in another option underneath the selection of phase statement called ‘Phase Statements for existing records’. Here the editor can choose whether to retain existing statements or replace them. If ‘retain’ is selected then any XML entries attached to the form that either have an existing entry ID in their filename or have a slug that matches an existing entry in the system will retain whatever phase statement the existing entry has, no matter what phase statement is selected in the form. The phase statement selected in the form will still be applied to any XML entries attached to the form that don’t have an existing entry in the system. Selecting ‘replace existing statements’ will ignore all phase statements of existing entries and will overwrite them with whatever phase statement is selected in the form. I also updated the system so that it extracts the earliest date for an entry at the point of upload. I added two new columns to the holding area (for earliest date and the date that is displayed for this) and have ensured that the display date appears on the ‘review’ page too. In addition, I added in an option to download the XML of an entry in the holding area, if it needs further work.
I ran a large-scale upload test, comprising of around 3,200 XML files from the ‘R’ data to see how the system would cope with this, but unfortunately I ran into difficulties with the server rejecting too many requests in a short space of time and only about 600 of the files made it through. I asked Arts IT Support to see whether the server limits can be removed for this script, but haven’t heard anything back yet. I ran into a similar issue when processing files for the Mull and Ulva place-names project in January last year and Raymond was able to update the whitelist for the Apache module mod_evasive that was blocking such uploads and I’m hoping he’ll be able to do something similar this time. Alternatively, I’ll need to try and throttle the speed of uploads in the browser.
In the meantime, I continued with the scripts for publishing entries that had been uploaded to the holding area, using a test version of the site that I set up on my local PC to avoid messing up the live database. I updated the ‘holding area’ page quite significantly. At the top of the page is a box for publishing selected items, and beneath this is the table containing the holding items. Each row now features a checkbox, and there is an option above the table to select / deselect all rows on the page (so currently up to 200 entries can be published in one batch as 200 is the page limit). The ‘preview’ button has been replaced with an ‘eye’ icon but the preview page works in the same way as before. I was intending to add the ‘publish’ options to this page but I’ve moved this to the holding area page instead to allow multiple entries to be selected for publication at any one time.
Once all of the selected items are published there is one final task that the page performs, which is to completely regenerate the cross references data. This is something that unfortunately needs to be done after each batch (even if it’s only one record) because cross references rely on database IDs and when a new version of an existing entry is published it receives a new ID. This means any existing cross references to that item will no longer work. The publication log will state that the regeneration is taking place and then after about 30 seconds another statement will say it is complete. I tested this process on my local PC, publishing single items, a few items and entire pages (200 items) at a time and all seemed to be working fine so I then copied the new scripts to the server.
Also this week I continued with the processing of library registers for the Books and Borrowing project. These are coming in rather quickly now and I’m getting a bit of a backlog. This is because I have to download the image files, then process then to generate tilesets, and then upload all of the images and their tilesets to the server. It’s the tilesets that are the real sticking point, as these consist of thousands of small files. I’m only getting an upload speed of about 70KB/s and I’m having to upload many gigabytes of data. I did a test where I zipped up some of the images and uploaded this zip file instead and was getting a speed of around 900KB/s and as it looks like I can get command-line access to the server I’m going to investigate whether zipping up the files, then uploading them then unzipping them will be a quicker process. I also had to spend some time sorting out connection issues to the server as the Stirling VPN wasn’t letting me connect. It turned out that they had switched to multi-factor authentication and I needed to set this up before I could continue.
Also this week I wrote a summary of the work I’ve done so far for the Place-names of Iona project for a newsletter they’re putting together, spoke to people about the new ‘Comparative Kingship’ place-names project I’m going to be involved with, spoke to the Scots Language Policy people about setting up a mailing list for the project(it turns out that the University has software to handle this, available here: https://www.gla.ac.uk/myglasgow/it/emaillists/) and fixed an issue relating to the display of citations that have multiple dates for the DSL.
It was another Data Management Plan heavy week this week. I created an initial version of a DMP for Kirsteen McCue’s project at the start of the week and then participated in a Zoom call with Kirsteen and other members of the proposed team on Thursday where the plan was discussed. I also continued to think through the technical aspects of the metaphor-related proposal involving Wendy and colleagues at Duncan Jordanstone College of Art and Design at Dundee and reviewed another DMP that Katherine Forsyth in Celtic had asked me to look at.
Also this week I spent a bit of time working on the Books and Borrowing project, generating more page image tilesets and their corresponding pages for two more of the Edinburgh ledgers and adding an ‘Events’ page to the project website and giving more members of the project team permission to edit the site. I also had an email chat with Thomas Clancy about the Iona project and created a ‘Call for Papers’ page including submission form on the project website (it’s not live yet, though).
I spent the rest of my week continuing to work on the Anglo-Norman Dictionary. We received the excellent news this week that our AHRC application for funding to complete the remaining letters of the dictionary (and carry out more development work) was successful. This week I mage some further tweaks to the new blog pages, adding in the first image in the blog post to the right of the blog snippet on the blog summary page. I also made the new blog pages live, and you can now access them here: https://anglo-norman.net/blog/.
I also made some updates to the bibliography system based on requests from the editors to separate out the display of links to the DEAF website from the actual URLs (previously just the URLs were displayed). I updated the database, the DMS and the new bibliography page to add in a new ‘DEAF link text’ field for both main source text records and items within source text records. I copied the contents of the DEAF field into this new field for all records, I updated the DMS to add in the new fields when adding / editing sources and I updated the new bibliography page so that the text that gets displayed for the DEAF link uses the new field, whereas the actual link through to the DEAF website uses the original field.
The scripts I written when uploading the new ‘R’ dataset needed to make changes to the data to bring it into line with the data already in the system as the ‘R’ data didn’t include some attributes that were necessary for the system to work with the XML files, namely:
In the <main_entry> tag the attribute ‘lead’, which is used to display the editor’s initials in the front end (e.g. “gdw”) and the ‘id’ attribute, which although not used to uniquely identify the entries in my new system is still used in the XML for things like cross-references and therefore is required and must be unique. In the <sense> tag the attribute ‘n’, which increments from 1 within each part of speech and is used to identify senses in the front-end. In the <senseInfo> tag the ID attribute, which is used in the citation and translation searches and the POS attribute which is used to generate the summary information at the top of each entry page. In the <attestation> tag the ID attribute, which is used in the citation search.
We needed to decide how these will be handled in future – whether they will be manually added to the XML as the editors work on them or whether the upload script needs to add them in at the point of upload. We also needed to consider updates to existing entries. If an editor downloads an entry and then works on it (e.g. adding in a new sense or attestation) then the exported file will already include all of the above attributes, except for any new sections that are added. In such cases should the new sections have the attributes added manually, or do I need to ensure my script checks for the existence of the attributes and only adds the missing ones as required?
We decided that I’d set up the systems to automatically check for the existence of the attributes and add them in if they’re not already present. It will take more time to develop such a system but it will make it more robust and hopefully will result in fewer errors. I’ll also add an option to specify the ‘lead’ initials for the batch of files that are being uploaded, but this will not overwrite the ‘lead’ attribute for any XML files in the batch that already have the attribute specified.
I’ll hopefully get a chance to work on this next week. Thankfully this is the last week of home-schooling for us so I should have a bit more time from next week onwards.
I had a couple of Zoom meetings this week, then first on Monday was with the Historical Thesaurus team and members of the Oxford English Dictionary’s team to discuss how our two datasets will be aligned and updated in future. It was an interesting meeting, but there’s still a lot of uncertainty regarding how the datasets can be tracked and connected as future updates are made, at least some of which will probably only become apparent when we get new data to integrate.
My second Zoom meeting was on Tuesday with the Place-Names of Iona project to discuss how we will be working with the QGIS package that team members will be using to access some of the archaeological data and Lidar maps, and also to discuss the issue of 10 digit grid references and the potential change from the old OSGB-36 means of generating latitude and longitude from grid references to the new WGS84 method. It was a productive meeting and we decided that we would switch over to WGS84 and I would update the CMS to incorporate the new library for generating latitude and longitude from grid references.
Also this week I continued to work on the Books and Borrowing project, generating image tilesets for the scans of several volumes of ledgers from Edinburgh University Library and writing scripts to generate pages in the Content Management System, creating ‘next’ and ‘previous’ links as required and associating the relevant images. I also had an email correspondence about some of the querying methods we will develop for the data, such as collocation information.
I also gave some feedback on a data management plan for a project I’m involved with, had a chat with Wendy Anderson about a possible future project she’s trying to set up and spent some time making updates to the underlying data of the Interactive Map of Burns Suppers that launched last month. I didn’t have the time to do a huge amount of work on the Anglo-Norman Dictionary this week, but I still managed to migrate some of the project’s old blog posts to our new site over the course of the week.
Finally, I made some updates to the bibliography system for the Dictionary of the Scots Language, updating the new system so it works in a similar manner to the live site. I added ‘Author’ and ‘Title’ to the drop-down items when searching for both to help differentiate them and a search for an item when the user ignores the drop-down options and manually submits the search now works as it does in the live site. I also fixed the issue with selecting ‘Montgomerie, Norah & William’ resulting in a 404 error. This was caused by the ampersand. There were some issues with other non-alphanumeric characters that I’ve fixed too, including slashes and apostrophes.