
Month: September 2020
Week Beginning 21st September 2020
I’d taken Thursday and Friday off this week as it was the Glasgow September Weekend holiday, meaning this was a three-day week for. It was a week where focussing on any development tasks was rather tricky as I had four Zoom calls and a dentist’s appointment on the other side of the city during my three working days.
On Monday I had a call with the Historical Thesaurus people to discuss the ongoing task of integrating content from the OED for the second edition. There’s still rather a lot to be done for this, and we’re needing to get it all complete during October, so things are a little stressful. After the meeting I made some further updates to the display of icons signifying a second edition update. I updated the database and front-end to allow categories / subcats to have a changelog (in addition to words). These appear in a transparent circle with a white border and a white number, right aligned. I also updated the display of the icon for words. These also appear as a transparent circle, right aligned, but have the teal colour for a border and the number. I also realised I hadn’t added in the icons for words in subcats, so put these in place too.
After that I set about updated the dates of HT lexemes based on some rules that Fraser had developed. I created and ran scripts that updated the start dates of 91,364 lexemes based on OED dates and then ran a further scrip that updated the end dates of 157,156 lexemes. These took quite a while to run (the latter I was dealing with during my time off) but it’s good that progress is being made.
My second Zoom call of the week was for the Books and Borrowing project, and was with the project PI and Co-I and someone who is transcribing library records from a private library that we’re now intending to incorporate into the project’s system. We discussed the data and the library and made a plan for how we’re going to work with the data in future. My third and fourth Zoom call were for the new Place-names of Iona project that is just starting up. It was a good opportunity to meet the rest of the project team (other than the RA who has yet to be appointed) and discuss how and when tasks will be completed. We’ve decided that we’ll use the same content management system as the one I already set up for the Mull and Ulva project, as this already includes Iona data from the GB1900 project. I’ll need to update the system so that we can differentiate place-names that should only appear on the Iona front-end, the Mull and Ulva front-end or both. This is because for Iona we are going to be going into much more detail, down to individual monuments and other ‘microtoponyms’ whereas the names in the Mull and Ulva project are much more high level.
For the rest of my available time this week I made some further updates to the script I wrote last week for Fraser’s Scots Thesaurus project, ordering the results by part of speech and ensuring that hyphenated words are properly searched for (as opposed to being split into separate words joined by an ‘or’). I also spent some time working for the DSL people, firstly updating the text on the search results page and secondly tracking down the certificate for the Android version of the School Dictionary app. This was located on my PC at work, so I had arranged to get access to my office whilst I was already in the West End for my dentist’s appointment. Unfortunately what I thought was the right file turned out to be the certificate for an earlier version of the app, meaning I had to travel all the way back to my office again later in the week (when I was on holiday) to find the correct file.
I also managed to find a little time to continue to work on the new Anglo-Norman Dictionary site, continuing to work on the display of the ‘entry’ page. I updated my XSLT to ensure that ‘parglosses’ are visible and that cross reference links now appear. Explanatory labels are also now in place. These currently appear with a grey background but eventually these will be links to the label search results page. Semantic labels are also now in place and also currently have a grey background but will be links through to search results. However, the System XML notes whether certain semantic labels should be shown or not. So, for example <label type=”sem” show=”no”>med.</label> doesn’t get shown. Unfortunately there is nothing comparable in the Editors’ XML (it’s just <semantic value=”med.”/>) so I can’t hide such labels. Finally, the initials of the editor who made the last update now appear in square brackets to the right of the end of the entry.
Also, my new PC was delivered on Thursday and I spent a lot of time over the weekend transferring all of my data and programs across from my old PC.
Week Beginning 14th September 2020
This was another busy week involving lots of projects. For the Books and Borrowing project I wrote an import script to process the Glasgow Professors borrowing records, comprising of more than 7,000 rows in a spreadsheet. It was tricky to integrate this with the rest of the project’s data and it took about a day to write the necessary processing scripts. I can only run the scripts on the real data in the evening as I need to take the CMS offline to do so, otherwise changes made to the database whilst I’m integrating the data will be lost and unfortunately it took three attempts to get the import to work properly. There are a few reasons why this data has been particularly tricky. Firstly, it needs to be integrated with existing Glasgow data, rather than being a ‘fresh’ upload to a new library. This caused some problems as my scripts that match up borrowing records and borrowers were getting confused with the existing Student borrowers. Secondly, the spreadsheet order was not in page order for each register – the order appears to have been ‘10r’, ‘10v’, then ‘11r’ etc then after ‘19v’ came ‘1r’. This is presumably to do with Excel ordering numbers as text. I tried reordering on the ‘sort order’ column but this also ordered things weirdly (all the numbers beginning with 1, then all the numbers beginning with 2 etc). I tried changing the data type of this field to a number rather than text but that just resulted in Excel giving errors in all of the fields. What this meant was I needed to sort the data in my own script before I could use it (otherwise the ‘next’ and ‘previous’ page links would all have been wrong), and it took time to implement this. However, I got there in the end.
I also continued working on the Historical Thesaurus database and front-end to allow us to use the new date fields and to enable us to keep track of what lexemes and categories had been updated in the new Second Edition. I have now fully migrated my second edition test site to using the new date system, including the advanced search for labels and both the ‘simple’ and ‘advanced’ date search. I have also now created the database structure for dealing with second edition updates. As we agreed at the last team meeing, the lexeme and category tables have been updated to each have two new fields – ‘last_updated’, which holds a human-readable date (YYYY-MM-DD) that will be automatically populated when rows are updated and ‘changelogcode’ which holds the ID of the row in the new ‘changelog’ table that applies to the lexeme or category. This new table consists of an ID, a ‘type’ (lexeme or category) and the text of the changelog. I’ve created two changelogs for test purposes: ‘This word was antedated in the second edition’ and ‘This word was postdated in the second edition’. I’ve realised that this structure means only one changelog can be associated with a lexeme, with a new one overwriting the old one. A more robust system would record all of the changelogs that have been applied to a lexeme or category and the dates these were applied, and depending on what Marc and Fraser think I may update the system with an extra joining table that would allow this papertrail to be recorded.
For now I’ve updated two lexemes in category 1 to use the two changelogs for test purposes. I’ve updated the category browser in the front end to add in a ‘2’ in a circle where ‘second edition’ changelog IDs are present. These have tooltips that when hovered over display the changelog text and the following screenshot demonstrates:
I haven’t added these circles to the search results yet or the full timeline visualisations, but it is likely that they will need to appear there too.
I also spent some time working on a new script for Fraser’s Scots Thesaurus project. This script allows a user to select an HT category to bring back all of the words contained in it. It then queries the DSL for each of these words and returns a list of those entries that contain at least two of the category’s words somewhere in the entry text. The script outputs the name of the category that was searched for, a list of returned HT words so you can see exactly what is being searched for, and the DSL entries that feature at least two of the words in a table that contains fields such as source dictionary, parts of speech, a link through to the DSL entry, headword etc. I may have to tweak this further next week, but it seems to be working pretty well.
I spent most of the rest of the week working on the redevelopment of the Anglo-Norman Dictionary. We had a bit of a shock at the start of the week because the entire old site was offline and inaccessible. It turned out that the domain name subscription had expired, and thankfully it was possible to renew it and the site became available again. I spent a lot of time this week continuing to work on the entry page, trying to untangle the existing XSLT script and work out how to apply the necessary rules to the editors’ version of the XML, which differs from the system version of the XML that was generated by an incomprehensible and undocumented series of processes in the old system.
I started off with the references located within the variant forms. In the existing site these link through to source texts, with information appearing in a pop-up when the reference is clicked on. To get these working I needed to figure out where the list of source texts was being stored and also how to make the references appear properly. The Editors’ XML and the System XML differ in structure, and only the latter actually contains the text that appears as the link. So, for example, while the latter has:
<cit> <bibl siglum=”Secr_waterford1″ loc=”94.787″><i>Secr</i> <sc>waterford</sc><sup>1</sup> 94.787</bibl> </cit>
The former only has:
<varref> <reference><source siglum=”Secr_waterford1″ target=””><loc>94.787</loc></source></reference> </varref>
This meant that the text to display and its formatting (<i>Secr</i> <sc>waterford</sc><sup>1</sup>) is not available to me. Thankfully I managed to track down an XML file that contained the list of texts, which contained this formatting and also all of the information that should appear in the pop-up that is opened when the link is clicked on, e.g.
<item id=”Secr_waterford1″ cits=”552″>
<siglum><i>Secr</i> <span class=”sc”>WATERFORD</span><sup>1</sup></siglum>
<bibl>Yela Schauwecker, <i>Die Diätetik nach dem ‘Secretum secretorum’ in der Version
von Jofroi de Waterford: Teiledition und lexikalische Untersuchung</i>, Würzburger medizinhistorische Forschungen 92, Würzburg, 2007
</bibl>
<date>c.1300 (text and MS)</date>
<deaf><a href=”/cgi-bin/deaf?siglum=SecrSecrPr2S”>SecrSecrPr<sup>2</sup>H</a></deaf>
<and><a href=”/cgi-bin-s/and-lookup-siglum?term=’Secr_waterford1′”>citations</a></and>
</item>
I imported all of the data from this XML file into the new system and have created new endpoints in the API to access the data. There is one that brings back the slug (the text used in the URL) and the ‘siglum’ text of all sigla and another that brings back all of the information about a given siglum. I then updated the ‘Entry’ page to add in the references in the forms and used a bit of JavaScript to grab the ‘all sigla’ output of the API and then adding the full siglum text in for each reference as required. You can also now click on a reference to open a pop-up that contains the full information.
I then turned my attention cognate references section, and there were also some issues here with the Editors’ XML not including information that is in the system XML. The structure of the cognate references in the system XML is like this:
<xr_group type=”cognate” linkable=”yes”> <xr><ref siglum=”FEW” target=”90/page/231″ loc=”9,231b”>posse</ref></xr> </xr_group>
Note that there is a ‘target’ attribute that provides a link. The Editor’s XML does not include this information – here’s the same reference:
<FEW_refs siglum=”FEW” linkable=”yes”> <link_form>posse</link_form><link_loc>9,231b</link_loc> </FEW_refs>
There’s nothing in there that I can use to ascertain the correct link to add in. However, I have found a ‘hash’ file called ‘cognate_hash’ that when extracted I found contains a list of cognate references and targets. These don’t include entry identifiers so I’m not sure how they were connected to entries, but by combining the ‘siglum’ and the ‘loc’ it looks like it might be possible to find the target, e.g:
<xr_group type=”cognate” linkable=”yes”>
<xr>
<ref siglum=”FEW” target=”90/page/231″ loc=”*9,231b”>posse</ref>
</xr>
</xr_group>
I’m not sure why there’s an asterisk, though. I also found another hash file called ‘commentary_hash’ that I guess contains the commentaries that appear in some entries but not in their XML. We’ll probably need to figure out whether we want to properly integrate these with the editor’s XML as well.
I completed work on the ‘cognate references’ section, omitting the links out for now (I’ll add these in later) and then moved on to the ‘summary’ box that contains links through to lower sections of the entry. Unfortunately the ‘sense’ numbers are something else that are not present in any form in the Editor’s XML. In the System XML each entry has a number, e.g. ‘<sense n=”1″>’ but in the Editor’s XML there is no such number. I spent quite a bit of time trying to increment a number in XSLT and apply it to each sense but it turns out you can’t increment a number in XSLT, even though there are ‘for’ loops where such an incrementing number would be easy to implement in other languages.
For now the numbers all just say ‘n’. Again, I’m not sure how best to tackle this. I could update the Editors’ XML to add in numbers, meaning I would then have to edit the XML the editors are working on when you’re ready to send it to me. Or I could dynamically add in the numbers to the XML when the API displays it. Or I could dynamically add in the numbers in the client’s browser using JavaScript. I’m also still in the process of working on the senses, subsenses and locutions. A lot of the information is now being displayed, for example part of speech, the number (again just ‘n’ for now), the translations, attestations, references, variants, legiturs, edglosses, locutions, xrefs.
I still need to add in the non-locution xrefs, labels and some other things, but overall I’m very happy with the progress I’ve made this week. Below is an example of an entry in the old site, with the entry as it currently looks in the new test site I’m working on (be aware that the new interface is only a placeholder). Before:
After:
Week Beginning 7th September 2020
This was a pretty busy week, involving lots of different projects. I set up the systems for a new place-name project focusing on Ayrshire this week, based on the system that I initially developed for the Berwickshire project and has subsequently been used for Kirkcudbrightshire and Mull. It didn’t take too long to port the system over, but the PI also wanted the system to be populated with data from the GB1900 crowdsourcing project. This project has transcribed every place-name on the GB1900 Ordnance Survey maps across the whole of the UK and is an amazing collection of data totalling some 2.5 million names. I had previously extracted a subset of names for the Mull and Ulva project so thankfully had all of the scripts needed to get the information for Ayrshire. Unfortunately what I didn’t have was the data in a database, as I’d previously extracted it to my PC at work. This meant that I had to run the extraction script again on my home PC, which took about three days to work through all of the rows in the monstrous CSV file. Once this was complete I could then extract the names found in the Ayrshire parishes that the project will be dealing with, resulting in almost 4,000 place-names. However, this wasn’t the end of the process as while the extracted place-names had latitude and longitude they didn’t have grid references or altitude. My place-names system is set up to automatically generate these values and I could customise the scripts to automatically apply the generated data to each of the 4000 places. Generating the grid reference was pretty straightforward but grabbing the altitude was less so, as it involved submitting a query to Google Maps and then inserting the returned value into my system using an AJAX call. I ran into difficulties with my script exceeding the allowed number of Google Map queries and also the maximum number of page requests on our server, resulting in my PC getting blocked by the server and a ‘Forbidden’ error being displayed instead, but with some tweaking I managed to get everything working within the allowed limits.
I also continued to work on the Second Edition of the Historical Thesaurus. I set up a new version of the website that we will work on for the Second Edition, and created new versions of the database tables that this new site connects to. I also spent some time thinking about how we will implement some kind of changelog or ‘history’ feature to track changes to the lexemes, their dates and corresponding categories. I had a Zoom call with Marc and Fraser on Wednesday to discuss the developments and we realised that the date matching spreadsheets I’d generated last week could do with some additional columns from the OED data, namely links through to the entries on the OED website and also a note to say whether the definition contains ‘(a)’ or ‘(also’ as these would suggest the entry has multiple senses that may need a closer analysis of the dates.
I then started to update the new front-end to use the new date structure that we will use for the Second Edition (with dates stored in a separate date table rather than split across almost 20 different date fields in the lexeme table). I updated the timeline visualisations (mini and full) to use this new date table, and although this took quite some time to get my head around the resulting code is MUCH less complicated than the horrible code I had to write to deal with the old 20-odd date columns. For example, the code to generate the data for the mini timelines is about 70 lines long now as opposed to over 400 previously.
The timelines use the new data tables in the category browse and the search results. I also spotted some dates weren’t working properly with the old system but are working properly now. I then updated the ‘label’ autocomplete in the advanced search to use the labels in the new date table. What I still need to do is update the search to actually search for the new labels and also to search the new date tables for both ‘simple’ and ‘complex’ year searches. This might be a little tricky, and I will continue on this next week.
Also this week I gave Gerry McKeever some advice about preserving the data of his Regional Romanticism project, spoke to the DSL people about the wording of the search results page, gave feedback on and wrote some sections for Matthew Creasy’s Chancellor’s Fund proposal, gave feedback to Craig Lamont regarding the structure of a spreadsheet for holding data about the correspondence of Robert Burns and gave some advice to Rob Maslen about the stats for his ‘City of Lost Books’ blog. I also made a couple of tweaks to the content management system for the Books and Borrowers project based on feedback from the team.
I spent the remainder of the week working on the redevelopment of the Anglo-Norman dictionary. I updated the search results page to style the parts of speech to make it clearer where one ends and the next begins. I also reworked the ‘forms’ section to add in a cut-off point for entries that have a huge number of forms. In such cases the long list of cut off and an ellipsis is added in, together with an ‘expand’ button. Pressing on this scrolls down the full list of forms and the button is replaced with a ‘collapse’ button. I also updated the search so that it no longer includes cross references (these are to be used for the ‘Browse’ list only) and the quick search now defaults to an exact match search whether you select an item from the auto-complete or not. Previously it performed an exact match if you selected an item but defaulted to a partial match if you didn’t. Now if you search for ‘mes’ (for example) and press enter or the search button your results are for “mes” (exactly). I suspect most people will select ‘mes’ from the list of options, which already did this, though. It is also still possible to use the question mark wildcard with an ‘exact’ search, e.g. “m?s” will find 14 entries that have three letter forms beginning with ‘m’ and ending in ‘s’.
I also updated the display of the parts of speech so that they are in order of appearance in the XML rather than alphabetically and I’ve updated the ‘v.a.’ and ‘v.n.’ labels as the editor requested. I also updated the ‘entry’ page to make the ‘results’ tab load by default when reaching an entry from the search results page or when choosing a different entry in the search results tab. In addition, the search result navigation buttons no longer appear in the search tab if all the results fit on the page and the ‘clear search’ button now works properly. Also, on the search results page the pagination options now only appear if there is more than one page of results.
On Friday I began to process the entry XML for display on the entry page, which was pretty slow going, wading through the XSLT file that is used to transform the XML to HTML for display. Unfortunately I can’t just use the existing XSLT file from the old site because we’re using the editor’s version of the XML and not the system version, and the two are structurally very different in places.
So far I’ve been dealing with forms and have managed to get the forms listed, with grammatical labels displayed where available and commas separating forms and semi-colons separating groups of forms. Deviant forms are surrounded by brackets. Where there are lots of forms the area is cut off as with the search results. I still need to add in references where these appear, which is what I’ll tackle next week. Hopefully now I’ve started to get my head around the XML a bit progress with the rest of the page will be a little speedier, but there will undoubtedly be many more complexities that will need to be dealt with.
Week Beginning 31st August 2020
I worked on many different projects this week, and the largest amount of my time went into the redevelopment of the Anglo-Norman Dictionary. I processed a lot of the data this week and have created database tables and written extraction scripts to export labels, parts of speech, forms and cross references from the XML. The data extracted will be used for search purposes, for display on the website in places such as the search results or will be used to navigate between entries. The scripts will also be used when updating data in the new content management system for the dictionary when I write it. I have extracted 85,397 parts of speech, 31,213 cross references, 150,077 forms and their types (lemma / variant / deviant) and 86,269 labels which correspond to one of 157 unique labels (usage or semantic), which I also extracted.
I have also finished work on the quick search feature, which is now fully operational. This involved creating a new endpoint in the API for processing the search. This includes the query for the predictive search (i.e. the drop-down list of possible options that appears as you type), which returns any forms that match what you’re typing in and the query for the full quick search, which allows you to use ‘?’ and ‘*’ wildcards (and also “” for an exact match) and returns all of the data about each entry that is needed for the search results page. For example, if you type in ‘from’ in the ‘Quick Search’ box a drop-down list containing all matching forms will appear. Note that these are forms not only headwords so they include lemmas but also variants and deviants. If you select a form that is associated with one single entry then the entry’s page will load. If you select a form that is associated with more than one entry then the search results page will load. You can also choose to not select an item from the drop-down list and search for whatever you’re interested in. For example, enter ‘*ment’ and press enter or the search button to view all of the forms ending in ‘ment’, as the following screenshot demonstrates (note that this is not the final user interface but one purely for test purposes):
With this example you’ll see that the results are paginated, with 100 results per page. You can browse through the pages using the next and previous buttons or select one of the pages to jump directly to it. You can bookmark specific results pages too. Currently the search results display the lemma and homonym number (if applicable) and display whether the entry is an xref or not. Associated parts of speech appear after the lemma. Each one currently has a tooltip and we can add in descriptions of what each POS abbreviation means, although these might not be needed. All of the variant / deviant forms are also displayed as otherwise it can be quite confusing for users if the lemma does not match the term the user entered but a form does. All associated semantic / usage labels are also displayed. I’m also intending to add in earliest citation date and possibly translations to the results as well, but I haven’t extracted them yet.
When you click on an entry from the search results this loads the corresponding entry page. I have updated this to add in tabs to the left-hand column. In addition to the ‘Browse’ tab there is a ‘Results’ tab and a ‘Log’ tab. The latter doesn’t contain anything yet, but the former contains the search results. This allows you to browse up and down the search results in the same way as the regular ‘browse’ feature, selecting another entry. You can also return to the full results page. I still need to do some tweaking to this feature, such as ensuring the ‘Results’ tab loads by default if coming from a search result. The ‘clear’ option also doesn’t currently work properly. I’ll continue with this next week.
For the Books and Borrowing project I spent a bit of time getting the page images for the Westerkirk library uploaded to the server and the page records created for each corresponding page image. I also made some final tweaks to the Glasgow Students pilot website that Matthew Sangster and I worked on and this is now live and available here: https://18c-borrowing.glasgow.ac.uk/.
There are three new place-name related projects starting up at the moment and I spent some time creating initial websites for all of these. I still need to add in the place-name content management systems for two of them, and I’m hoping to find some time to work on this next week. I also spoke to Joanna Kopaczyk about a website for an RSE proposal she’s currently putting together and gave some advice to some people in Special Collections about a project that they are planning.
On Tuesday I had a Zoom call with the ‘Editing Robert Burns’ people to discuss developing the website for phase two of the Editing Robert Burns project. We discussed how the website would integrate with the existing website (https://burnsc21.glasgow.ac.uk/) and discussed some of the features that would be present on the new site, such as an interactive map of Burns’ correspondence and a database of forged items.
I also had a meeting with the Historical Thesaurus people on Tuesday and spent some time this week continuing to work on the extraction of dates from the OED data, which will feed into a new second edition of the HT. I fixed all of the ‘dot’ dates in the HT data. This is where there isn’t a specific date but a dot is used instead (e.g. 14..) but sometimes a specific year is given in the year attribute (e.g. 1432) but at other times a more general year is given (e.g. 1400). We worked out a set of rules for dealing with these and I created a script to process them. I then reworked my script that extracts dates for all lexemes that match a specific date pattern (YYYY-YYYY, where the first year might be Old English and the last year might be ‘Current’) and sent this to Fraser so that the team can decide which of these dates should be used in the new version of the HT. Next week I’ll begin work on a new version of the HT website that uses an updated dataset so we can compare the original dates with the newly updated ones.