This was the last week before Christmas and it’s a four-day week as the University has generously given us all an extra day’s holiday on Christmas Eve. I also lost a bit of time due to getting my Covid booster vaccine on Wednesday. I was booked in for 9:50 and got there at 9:30 to find a massive queue snaking round the carpark. It took an hour to queue outside, plus about 15 minutes inside, but I finally got my booster just before 11. The after-effects kicked in during Wednesday night and I wasn’t feeling great on Thursday, but I managed to work.
My major task of the week was to deal with the new Innerpeffray data for the Books and Borrowing project. I’d previously uploaded data from an existing spreadsheet in the early days of the project, but it turns out that there were quite a lot of issues with the data and therefore one of the RAs has been creating a new spreadsheet containing reworked data. The RA Kit got back to me this week after I’d checked some issues with her last week and I therefore began the process of deleting the existing data and importing the new data.
I was a pretty torturous process but I managed to finish deleting the existing Innerpeffray data and imported the new data. This required a pretty complex amount of processing and checking via a script I wrote this week. I managed to retain superscript characters in the transcriptions, something that proved to be very tricky as there is no way to find and replace superscript characters in Excel. Eventually I ended up copying the transcription column into Word, then saving the table as HTML, stripping out all of the rubbish Word adds in when it generates an HTML file and then using this resulting file alongside the main spreadsheet file that I saved as a CSV. After several attempts at running the script on my local PC, then fixing issues, then rerunning, I eventually reckoned the script was working as it should – adding page, borrowing, borrower, borrower occupation, book holding and book item records as required. I then ran the script on the server and the data is now available via the CMS.
There were a few normalised occupations that weren’t right and I updated these. There were also 287 standardised titles that didn’t match any existing book holding records in Innerpeffray. For these I created a new holding record and (if there’s an ESTC number) linked to a corresponding edition.
Also this week I completed work on the ‘Guess the Category’ quizzes for the Historical Thesaurus. Fraser had got back to me about the spreadsheets of categories and lexemes that might cause offence and should therefore never appear in the quiz. I added a new ‘inquiz’ column to both the category and lexeme table which has been set to ‘N’ for each matching category and lexeme. I also updated the code behind the quiz so that only categories and lexemes with ‘inquiz’ set to ‘Y’ are picked up.
The category exclusions are pretty major – a total of 17,111 are now excluded. This is due to including child categories where noted, and 8340 of these are within ’03.08 Faith’. For lexemes there are a total of 2174 that are specifically noted as excluded based on both tabs of the spreadsheet (but note that all lexemes in excluded categories are excluded by default – a total of 69099). The quiz picks a category first and then a lexeme within it, so there should never be a case where a lexeme in an excluded category is displayed. I also ensured that when a non-noun category is returned if there isn’t a full trail of categories (because there isn’t a parent in the same part of speech) then the trail is populated from the noun categories instead.
The two quizzes (a main one and an Old English one) are now live and can be viewed here:
Also this week I made a couple of tweaks to the Comparative Kingship place-names systems, adding in Pictish as a language and tweaking how ‘codes’ appear in the map. I also helped Raymond migrate the Anglo-Norman Dictionary to the new server that was purchased earlier this year. We had to make a few tweaks to get the site to work at a temporary URL but it’s looking good now. We’ll update the DNS and make the URL point to the new server in the New Year.
That’s all for this year. If there is anyone reading this (doubtful, I know) I wish you a merry Christmas and all the best for 2022!
My big task of the week was to return to working for the Speak For Yersel project after a couple of weeks when my services haven’t been required. I had a meeting with PI Jennifer Smith and RA Mary Robinson on Monday where we discussed the current status of the project and the tasks I should focus on next. Mary had finished work on the geographical areas we are going to use. These are based on postcode areas but a number of areas have been amalgamated. We’ll use these to register where a participant is from and also to generate a map marker representing their responses at a random location within their selected area based on the research I did a few weeks ago about randomly positioning a marker in a polygon.
The original files that Mary sent me were plus two exports from ArcGIS as JSON and GeoJSON. Unfortunately both files used a different coordinates system rather than latitude and longitude, the GeoJSON file didn’t include any identifiers for the areas so couldn’t really be used and while the JSON file looked promising when I tried to use it in Leaflet it gave me an ‘invalid GeoJSON object’ error. Mary then sent me the original ArcGIS file for me to work with and I spent some time in ArcGIS figuring out how to export the shapefile data as GeoJSON with latitude and longitude.
Using ArcGIS I exported the data by typing in ‘to json’ in the ‘Geoprocessing’ pane on the right of the map then selecting ‘Features to JSON’. I selected ‘output to GeoJSON’ and also checked ‘Project to WGS_1984’ which converts the ArcGIS coordinates to latitude and longitude. When not using the ‘formatted JSON option’ (which adds in line breaks and tabs) this gave me a file size of 115Mb. As a starting point I created a Leaflet map that uses this GeoJSON file but I ran into a bit of a problem: the data takes a long time to load into the map – about 30-60 seconds for me – and the map feels a bit sluggish to navigate around even after it’s loaded in. And this is without there being any actual data. The map is going to be used by school children, potentially on low-spec mobile devices connecting to slow internet services (or even worse, mobile data that they may have to pay for per MB). We may have to think about whether using these areas is going to be feasible. A option might be to reduce the detail in the polygons, which would reduce the size of the JSON file. The boundaries in the current file are extremely detailed and each twist and turn in the polygon requires a latitude / longitude pair in the data, and there are a lot of twists and turns. The polygons we used in SCOSYA are much more simplified (see for example https://scotssyntaxatlas.ac.uk/atlas/?j=y#9.75/57.6107/-7.1367/d3/all/areas) but would still suit our needs well enough. However, manually simplifying each and every polygon would be a monumental and tedious task. But perhaps there’s a method in ArcGIS that could do this for us. There’s a tool called ‘Simplify Polygon’: https://desktop.arcgis.com/en/arcmap/latest/tools/cartography-toolbox/simplify-polygon.htm which might work.
I spoke to Mary about this and she agreed to experiment with the tool. Whilst she worked on this I continued to work with the data. I extracted all of the 411 areas and stored these in a database, together with all 954 postcode components that are related to these areas. This will allow us to generate a drop-down list of options as the user types – e.g. type in ‘G43’ and options ‘G43 2’ and ‘G43 3’ will appear, and both of these are associated with ‘Glasgow South’.
I also wrote a script to generate sample data for each of the 411 areas using the ‘turf.js’ script I’d previously used. For each of the 411 areas a random number of markers between 0 and 100 are generated and stored in the database, each with a random rating of between 1 and 4. This has resulted in 19946 sample ratings, which I then added to the map along with the polygonal area data, as you can see here:
Currently these are given the colours red=1, orange=2, light blue=3, dark blue=4, purely for test purposes. As you can see, including almost 20,000 markers swamps the map when it’s zoomed out, but when you zoom in things look better. I also realised that we might not even need to display the area boundaries to users. They can be used in the background to work out where a marker should be positioned (as is the case with the map above) but perhaps they’re not needed for any other reasons? It might be sufficient to include details of area in a popup or sidebar and if so we might not need to rework the areas at all.
However, whilst working on this Mary had created four different versions of the area polygons using four different algorithms. These differ in how the simplify the polygons and therefore result in different boundaries – some missing out details such as lochs and inlets. All four versions were considerably smaller in file size than the original, ranging from 4Mb to 20Mb. I created new maps for each of the four simplified polygon outputs. For each of these I regenerated new random marker data. For algorithms ‘DP’ and ‘VW’ I limited the number of markers to between 0 and 20 per area, giving around 4000 markers in each map. For ‘WM’ and ‘ZJ’ I limited the number to between 0 and 50 per area, giving around 10,000 markers per map.
All four new maps look pretty decent to me, with even the smaller JSON files (‘DP’ and ‘VW’) containing a remarkable level of detail. I think the ‘DP’ one might be the one to go for. It’s the smallest (just under 4MB compared to 115MB for the original) yet also seems to have more detail than the others. For example for the smaller lochs to the east of Loch Ness the original and ‘DP’ include the outline of four lochs while the other three only include two. ‘DP’ also includes more of the smaller islands around the Outer Hebrides.
We decided that we don’t need to display the postcode areas on the map to users but instead we’ll just use these to position the map markers. However, we decided that we do want to display the local authority area so people have a general idea of where the markers are positioned. My next task was to add these in. I downloaded the administrative boundaries for Scotland from here: https://raw.githubusercontent.com/martinjc/UK-GeoJSON/master/json/administrative/sco/lad.json as referenced on this website: https://martinjc.github.io/UK-GeoJSON/ and added them into my ‘DP’ sample map, giving the boundaries a dashed light green that turns a darker green when you hover over the area, as you can see from the screenshot below:
Also this week I added in a missing text to the Anglo-Norman Dictionary’s Textbase. To do this I needed to pass the XML text through several scripts to generate page records and all of the search words and ‘keyword in context’ data for search purposes. I also began to investigate replacing the Innerpeffray data for Books and Borrowing with a new dataset that Kit has worked on. This is going to be quite a large and complicated undertaking and after working through the data I had a set of questions to ask Kit before I proceeded to delete any of the existing data. Unfortunately she is currently on jury duty so I’ll need to wait until she’s available again before I can do anything further. Also this week a huge batch of images became available to us from the NLS and I spent some time downloading these and moving them to an external hard drive as they’d completely filled up the hard drive of my PC.
I also spoke to Fraser about the new radar diagrams I had been working on for the Historical Thesaurus and also about the ‘guess the category’ quiz that we’re hoping to launch soon. Fraser sent on a list of categories and words that we want to exclude from the quiz (anything that might cause offence) but I had some questions about this that will need clarification before I take things further. I’d suggested to Fraser that I could update the radar diagrams to include not only the selected category but also all child categories and he thought this would be worth investigating so I spent some time updating the visualisations.
I was a little worried about the amount of processing that would be required to include child categories but thankfully things seem pretty speedy, even when multiple top-level categories are chosen. See for example the visualisation of everything within ‘Food and drink’, ‘Faith’ and ‘Leisure’:
This brings back many tens of thousands of lexemes but doesn’t take too long to generate. I think including child categories will really help make the visualisations more useful as we’re now visualising data at a scale that’s very difficult to get a grasp on simply by looking at the underlying words. It’s interesting to note in the above visualisation how ‘Leisure’ increases in size dramatically throughout the time periods while ‘Faith’ shrinks in comparison (but still grows overall). With this visualisation the ‘totals’ rather than the ‘percents’ view is much more revealing.
I spent a bit of time this week writing as second draft of a paper for DH2022 after receiving feedback from Marc. This one targets ‘short papers’ (500-750 words) and I managed to get it submitted before the deadline on Friday. Now I’ll just need to see if it gets accepted – I should find out one way or the other in February. I also made some further tweaks to the locution search for the Anglo-Norman Dictionary, ensuring that when a term appears more than once the result is repeated for each occurrence, appearing in the results grouped by each word that matches the term. So for example ‘quatre tempres, tens’ now appears twice, once amongst the ‘tempres’ and once amongst the ‘tens’ results.
I also had a chat with Heather Pagan about the Irish Dictionary eDIL (http://www.dil.ie/) who are hoping to rework the way they handle dates in a similar way to the AND. I said that it would be difficult to estimate how much time it would take without seeing their current data structure and getting more of an idea of how they intend to update it, and also what updates would be required to their online resource to incorporate the updated date structure, such as enhanced search facilities and whether further updates to their resource would also be part of the process. Also whether any back-end systems would also need to be updated to manage the new data (e.g. if they have a DMS like the AND).
Also this week I helped out with some issues with the Iona place-names website just before their conference started on Thursday. Someone had reported that the videos of the sessions were only playing briefly and then cutting out, but they all seemed to work for me, having tried them on my PC in Firefox and Edge and on my iPad in Safari. Eventually I managed to replicate the issue in Chrome on my desktop and in Chrome on my phone, and it seemed to be an issue specifically related to Chrome, and didn’t affect Edge, which is based on Chrome. The video file plays and then cuts out due to the file being blocked on the server. I can only assume that the way Chrome accesses the file is different to other browsers and it’s sending multiple requests to the server which is then blocking access due to too many requests being sent (the console in the browser shows a 403 Forbidden error). Thankfully Raymond at Arts IT Support was able to increase the number of connections allowed per browser and this fixed the issue. It’s still a bit of a strange one, though.
I also had a chat with the DSL people about when we might be able to replace the current live DSL site with the ‘new’ site, as the server the live site is on will need to be decommissioned soon. I also had a bit of a catch-up with Stevie Barrett, the developer in Celtic and Gaelic, and had a video call with Luca and his line-manager Kirstie Wild to discuss the current state of Digital Humanities across the College of Arts. Luca does a similar job to me at college-level and it was good to meet him and Kirstie to see what’s been going on outside of Critical Studies. I also spoke to Jennifer Smith about the Speak For Yersel project, as I’d not heard anything about it for a couple of weeks. We’re going to meet on Monday to take things further.
I spent the rest of the week working on the radar diagram visualisations for the Historical Thesaurus, completing an initial version. I’d previously created a tree browser for the thematic headings, as I discussed last week. This week I completed work on the processing of data for categories that are selected via the tree browser. After the data is returned the script works out which lexemes have dates that fall into the four periods (e.g. a word with dates 650-9999 needs to appear in all four periods). Words are split by Part of speech, and I’ve arranged the axes so that N, V, Aj and Av appear first (if present), with any others following on. All verb categories have also been merged.
I’m still not sure how widely useful these visualisations will be as they only really work for categories that have several parts of speech. But there are some nice ones. See for example a visualisation of ‘Badness/evil’, ‘Goodness, acceptability’ and ‘Mediocrity’ which shows words for ‘Badness/evil’ being much more prevalent in OE and ME while ‘Mediocrity’ barely registers, only for it and ‘Goodness, acceptability’ to grow in relative size EModE and ModE:
I also added in an option to switch between visualisations which use total counts of words in each selected category’s parts of speech and visualisations that use percentages. With the latter the scale is fixed at a maximum of 100% across all periods and the points on the axes represent the percentage of the total words in a category that are in a part of speech in your chosen period. This means categories of different sizes are more easy to compare, but does of course mean that the relative sizes of categories is not visualised. I could also add a further option that fixes the scale at the maximum number of words in the largest POS so the visualisation still represents relative sizes of categories but the scale doesn’t fluctuate between periods (e.g. if there are 363 nouns for a category across all periods then the maximum on the scale would stay fixed at 363 across all periods, even if the maximum number of nouns in OE (for example) is 128. Here’s the above visualisation using the percentage scale:
The other thing I did was to add in a facility to select a specific category and turn off the others. So for example if you’ve selected three categories you can press on a category to make it appear bold in the visualisation and to hide the other categories. Pressing on a category a second time reverts back to displaying all. Your selection is remembered if you change the scale type or navigate through the periods. I may not have much more time to work on this before Christmas, but the next thing I’ll do is to add in access to the lexeme data behind the visualisation. I also need to fix a bug that is causing the ModE period to be missing a word in its counts sometimes.
I participated in the UCU strike action on Wednesday to Friday this week, so it was a two-day working week for me. During this time I gave some help to the students who are migrating the International Journal of Scottish Theatre and Screen and talked to Gerry Carruthers about another project he’s hoping to put together. I also passed on information about the DNS update to the DSL’s IT people, added a link to the DSL’s new YouTube site to the footer of the DSL site and dealt with a query regarding accessing the DSL’s Google Analytics data. I also spoke with Luca about arranging a meeting with him and his line manager to discuss digital humanities across the college and updated the listings for several Android apps that I created a few years ago that had been taken down due to their information being out of date. As central IT services now manages the University Android account I hadn’t received notifications that this was going to take place. Hopefully the updates have done the trick now.
Other than this I made some further updates to the Anglo-Norman Dictionary’s locution search that I created last week. This included changing the ordering to list results by the word that was search for rather than by headword, changing the way the search works so that a wildcard search such as ‘te*’ now matches the start of any word in the locution phrase rather than just the first work and fixing a number of bugs that had been spotted.
I spent the rest of my available time starting to work on an interactive version of the radar diagram for the Historical Thesaurus. I’d made a static version of this a couple of months ago which looks at a the words in an HT category by part of speech and visualises how the numbers of words in each POS change over time. What I needed to do was find a way to allow users to select their own categories to visualise. We had decided to use the broader Thematic Categories for the feature rather than regular HT categories so my first task was to create a Thematic Category browser from ‘AA The World’ to ‘BK Leisure’. It took a bit of time to rework the existing HT category browser to work with thematic categories, and also to then enable the selection of multiple categories by pressing on the category name. Selected categories appear to the right of the browser, and I added in an option to remove a selected category if required. With this in place I began work on the code to actually grab and process the data for the selected categories. This finds all lexemes and their associated dates for each lexeme in each HT category in each of the selected thematic categories. For now the data is just returned and I’m still in the middle of processing the dates to work out which period each word needs to appear in. I’ll hopefully find some time to continue with this next week. Here’s a screenshot of the browser: