Week 19 of Lockdown, and it was a short week for me as the Monday was the Glasgow Fair holiday. I spent a couple of days this week continuing to add features to the content management system for the Books and Borrowing project. I have now implemented the ‘normalised occupations’ part of the CMS. Originally occupations were just going to be a set of keywords, allowing one or more keyword to be associated with a borrower. However, we have been liaising with another project that has already produced a list of occupations and we have agreed to share their list. This is slightly different as it is hierarchical, with a top-level ‘parent’ containing multiple main occupations. E.g. ‘Religion and Clergy’ features ‘Bishop’. However, for our project we needed a third hierarchical level do differentiate types of minister/priest, so I’ve had to add this in too. I’ve achieved this by means of a parent occupation ID in the database, which is ‘null’ for top-level occupations and contains the ID of the parent category for all other occupations.
I completed work on the page to browse occupations, arranging the hierarchical occupations in a nested structure that features a count of the number of borrowers associated with the occupation to the right of the occupation name. These are all currently zero, but once some associations are made the numbers will go up and you’ll be able to click on the count to bring up a list of all associated borrowers, with links through to each borrower. If an occupation has any child occupations a ‘+’ icon appears beside it. Press on this to view the child occupations, which also have counts. The counts for ‘parent’ occupations tally up all of the totals for the child occupations, and clicking on one of these counts will display all borrowers assigned to all child occupations. If an occupation is empty there is a ‘delete’ button beside it. As the list of occupations is going to be fairly fixed I didn’t add in an ‘edit’ facility – if an occupation needs editing I can do it directly through the database, or it can be deleted and a new version created. Here’s a screenshot showing some of the occupations in the ‘browse’ page:
I also created facilities to add new occupations. You can enter an occupation name and optionally specify a parent occupation from a drop-down list. Doing so will add the new occupation as a child of the selected category, either at the second level if a top level parent is selected (e.g. ‘Agriculture’) or at the third level if a second level parent is selected (e.g. ‘Farmer’). If you don’t include a parent the occupation will become a new top-level grouping. I used this feature to upload all of the occupations, and it worked very well.
I then updated the ‘Borrowers’ tab in the ‘Browse Libraries’ page to add ‘Normalised Occupation’ to the list of columns in the table. The ‘Add’ and ‘Edit’ borrower facilities also now feature ‘Normalised Occupation’, which replicates the nested structure from the ‘browse occupations’ page, only features checkboxes beside each main occupation. You can select any number of occupations for a borrower and when you press the ‘Upload’ or ‘Edit’ button your choice will be saved. Deselecting all ticked checkboxes will clear all occupations for the borrower. If you edit a borrower who has one or more occupations selected, in addition to the relevant checkboxes being ticked, the occupations with their full hierarchies also appear above the list of occupations, so you can easily see what is already selected. I also updated the ‘Add’ and ‘Edit’ borrowing record pages so that whenever a borrower appears in the forms the normalised occupations feature also appears.
I also added in the option to view page images. Currently the only ledgers that have page images are the three Glasgow ones, but more will be added in due course. When viewing a page in a ledger that includes a page image you will see the ‘Page Image’ button above the table of records. Press on this and a new browser tab will open. It includes a link through to the full-size image of the page if you want to open this in your browser or download it to open in a graphics package. It also features the ‘zoom and pan’ interface that allows you to look at the image in the same manner as you’d look at a Google Map. You can also view this full screen by pressing on the button in the top right of the image.
Also this week I made further tweaks to the script I’d written to update lexeme start and end dates in the Historical Thesaurus based on citation dates in the OED. I’d sent a sample output of 10,000 rows to Fraser last week and he got back to me with some suggestions and observations. I’m going to have to rerun the script I wrote to extract the more than 3 million citation dates from the OED as some of the data needs to be processed differently, but as this script will take several days to run and I’m on holiday next week this isn’t something I can do right now. However, I managed to change the way the date matching script runs to fix some bugs and make the various processes easier to track. I also generated a list of all of the distinct labels in the OED data, with counts of the number of times these appear. Labels are associated with specific citation dates, thankfully. Only a handful are actually used lots of times, and many of the others appear to be used as a ‘notes’ field rather than as a more general label.
In addition to the above I also had a further conversation with Heather Pagan about the data management plan for the AND’s new proposal, responded to a query from Kathryn Cooper about the website I set up for her at the end of last year, responded to a couple of separate requests from post-grad students in Scottish Literature, spoke to Thomas Clancy about the start date for his Place-Names of Iona project, which got funded recently, helped with some issues with Matthew Creasy’s Scottish Cosmopolitanism website and spoke to Carole Hough about making a few tweaks to the Berwickshire Place-names website for REF.
I’m going to be on holiday for the next two weeks, so there will be no further updates from me for a while.
During week 11 of Lockdown I continued to work on the Books and Borrowing project, but also spent a fair amount of time catching up with other projects that I’d had to put to one side due to the development of the Books and Borrowing content management system. This included reading through the proposal documentation for Jennifer Smith’s follow-on funding application for SCOSYA, and writing a new version of the Data Management Plan based on this updated documentation and making some changes to the ‘export data for print publication’ facility for Carole Hough’s REELS project. I also spent some time creating as new export facility to format the place-name elements and any associated place-names for print publication too.
During this week a number of SSL certificates expired for a bunch of websites, which meant browsers were displaying scary warning messages when people visited the sites. I had to spend a bit of time tracking these down and passing the details over to Arts IT Support for them to fix as it is not something I have access rights to do myself. I also liaised with Mike Black to migrate some websites over from the server that houses many project websites to a new server. This is because the old server is running out of space and is getting rather temperamental and freeing up some space should address the issue.
I also made some further tweaks to Paul Malgrati’s interactive map of Burns’ Suppers and created a new WordPress-powered project website for Matthew Creasy’s new ‘Scottish Cosmopolitanism at the Fin de Siècle’ project. This included the usual choosing a theme, colour schemes and fonts, adding in header images and footer logos and creating initial versions of the main pages of the site. I’d also received a query from Jane Stuart-Smith about the audio recordings in the SCOTS Corpus so I did a bit of investigation about that.
Fraser Dallachy had got back to me with some further tasks for me to carry out on the processing of dates for the Historical Thesaurus, and I had intended to spend some time on this towards the end of the week, but when I began to look into this I realised that the scripts I’d written to process the old HT dates (comprising 23 different fields) and to generate the new, streamlined date system that uses a related table with just 6 fields were sitting on my PC in my office at work. Usually all the scripts I work on are located on a server, meaning I can easily access them from anywhere by connecting to the server and downloading them. However, sometimes I can’t run the scripts on the server as they may need to be left running for hours (or sometimes days) if they’re processing large amounts of data or performing intensive tasks on the data. In these cases the scripts run directly on my office PC, and this was the situation with the dates script. I realised I would need to get into my office at work on retrieve the scripts, so I put in a request to be allowed into work. Staff are not currently allowed to just go into work – instead you need to get approval from your Head of School and then arrange a time that suits security. Thankfully it looks like I’ll be able to go in early next week.
Other than these issues, I spent my time continuing to work for the Books and Borrowing project. On Tuesday we had a Zoom call with all six members of the core project team, during which I demonstrated the CMS as it currently stands. This gave me an opportunity to demonstrate the new Author association facilities I had created last week. The demonstration all went very smoothly and I think the team are happy with how the system works, although no doubt once they actually begin to use it there will be bugs to fix and workflows to tweak. I also spent some time before the meeting testing the system again, and fixing some issues that were not quite right with the author system.
I spent the remainder of my time on the project completing work on the facility to add, edit and view book holding records directly via the library page, as opposed to doing so whilst adding / editing a borrowing record. I also implemented a similar facility for borrowers as well. Next week I will begin to import some of the sample data from various libraries into the system and will allow the team to access the system to test it out.
It was a four-day week due to Good Friday, and I spent the beginning of the week catching up on things relating to last week’s conference – writing up my notes from the sessions and submitting my expenses claims. Marc also dropped off a bunch of old STELLA materials, such as old CD-ROMS, floppy disks, photos and books, so I spent a bit of time sorting through these. I also took delivery of a new laptop this week, so spent some time installing things on it and getting it ready for work.
Apart from these tasks I completed a final version of the Data Management Plan for Ophira Gamliel’s project, which has now been submitted to college and I met with Luca Guariento to discuss his new role. Luca will be working across the College in a similar capacity to my role within Critical Studies, which is great news both for him and the College. I also made a number of tweaks to one of the song stories for the RNSN project and engaged in an email discussion about REF and digital outputs.
I had two meetings with the SCOSYA project this week to discuss the development of the front end features for the project. It’s getting to the stage where the public atlas will need to be developed and we met to discuss exactly what features it will need to include. There will actually be four public interfaces – an ‘experts interface’ which will be very similar to the atlas I developed for the project team, a simplified atlas that will only include a selection of features and search types, the ‘story maps’ about 15-25 particular features, and a ‘listening atlas’ that will present the questionnaire locations and samples of speech at each place. There’s a lot to develop and Jennifer would like as much as possible to be in place for a series of events the project is running in mid-June, so I’ll need to devote most of May to the project.
I spent about a day this week working for DSL. Rhona contacted me to say they were employing a designer who needed to know some details about the website (e.g. fonts and colours), so I got that information to her. The DSL’s Facebook page has also changed so I needed to update that on the website too. Last week Ann sent me a list of further updates that needed to be made to the WordPress version of the DSL website that we’ve been working on, so I implemented those. This included sorting out the colours of various boxes, and ensuring that these are relatively easy to update in future, sorting out the contents box that stay fixed on the page as the user scrolls on one of the background essay sections, creating some new versions of images and ensuring the citation pop-up worked on a new essay.
I spent a further half-day or so working on the REELS project, making updates to the ‘export for publication’ feature I’d created a few weeks ago. This feature grabs all of the information about place-names and outputs it in a format that reflects how it should look on the printed page. It is then possible to copy this output into Word and retain the formatting. Carole has been using the feature and had sent me a list of updates. This included enabling paragraph divisions in the analysis section. Previously each place-name entry was a single paragraph, therefore paragraph tags in any sections contained within were removed. I have changed this now so that each place-name uses an HTML <div> element rather than a <p> element, meaning any paragraphs can be represented. However, this has potentially resulted in there being more vertical space between parts of the information than there were previously.
Carole had also noted that in some places on import into Word spaces were being interpreted as non-breaking spaces, meaning some phrases were moved down to a new line even though some of the words would fit on the line above. I investigated this, but it was a bit of a weird one. It would appear that Word is using a ‘non-breaking space’ in some places and not in others. Such characters (represented in Word if you turn markers on by what looks like a superscript ‘o’ as opposed to a mid-dot) link words together and prevent them being split over multiple lines. I couldn’t figure out why Word was using them in some places and not in others as they’re not used consistently. For this reason this is something that will need to be fixed after pasting into Word. The simplest way is to turn markers on, select one of the non-breaking space characters then choose ‘replace’ from the menu, paste this character into the ‘Find what’ box and then put a regular space in the ‘replace with’ box. There were a number of other smaller tweaks to make to the script, such as fixing the appearance of tildes for linear features, adjusting the number of spaces between the place-name and other text and changing the way parishes were displayed, which brings me to the end of this week’s report. It will be another four-day week next week due to Easter Monday, and I intend to focus on the new Glasgow Medical Humanities resource for Gavin Miller, and some further Historical Thesaurus work.
I spent about half of this week working on the SCOSYA project. On Monday I met with Jennifer and E to discuss a new aspect of the project that will be aimed primarily at school children. I can’t say much about it yet as we’re still just getting some ideas together, but it will allow users to submit their own questionnaire responses and see the results. I also started working with the location data that the project’s researchers had completed mapping out. As mentioned in previous posts, I had initially created Voronoi diagrams that extrapolate our point-based questionnaire data to geographic areas. The problem with this approach was that the areas were generated purely on the basis of the position of the points and did not take into consideration things like the varying coastline of Scotland or the fact that a location on one side of a body of water (e.g. the Firth of Forth) should not really extend into the other side, giving the impression that a feature is exhibited in places it quite clearly doesn’t. Having the areas extend over water also made it difficult to see the outline of Scotland and to get an impression of which cell corresponded to which area. So instead of this purely computational approach to generating geographical areas we decided to create them manually, using the Voronoi areas as a starting point, but tweaking them to take geographical features into consideration. I’d generated the Voronoi cells as GeoJSON files and the researchers then used this very useful online tool https://geoman.io/studio to import the shapes and tweak them, saving them in multiple files as their large size caused some issues with browsers.
Upon receiving these files I then had to extract the data for each individual shape and work out which of our questionnaire locations the shape corresponded to, before adding the data to the database. Although GeoJSON allows you to incorporate any data you like, in addition to the latitude / longitude pairings, I was not able to incorporate location names and IDs into the GeoJSON file I generated using the Voronoi library (it just didn’t work – see an earlier post for more information), meaning this ‘which shape corresponds to which location’ process needed to be done manually. This involved grabbing the data for an individual location from the GeoJSON files, saving this and importing it into the GeoMan website, comparing the shape to my initial Voronoi map to find the questionnaire location contained within the area, adding this information to the GeoJOSN and then uploading it to the database. There were 147 areas to do, and the process took slightly over a day to complete.
I then produced a ‘proof of concept’ that simply grabbed the location data, pulled all the GeoJSON for each location together and processed it via Leaflet to produce area overlays, as you can see in the following screenshot:
With this in place I then began looking at how to incorporate our intended ‘story’ interface with the Choropleth map – namely working with a number of ‘slides’ that a user can navigate between, with a different dataset potentially being loaded and displayed on each slide, and different position and zoom levels being set on each slide. This is actually proving to be quite a complicated task, as much of the code I’d written for my previous Voronoi version of the storymap was using older, obsolete libraries. Thankfully with the new approach I’m able to use the latest version of Leaflet, meaning features like the ‘full screen’ option and smoother panning and zooming will work.
By the end of the week I’d managed to get the interface to load in data for each slide and colour code the areas. I’d also managed to get the slide contents to display – both a ‘big’ version that contains things like video clips and a ‘compact’ version that sits to one side, as you can see in the following screenshot:
There is still a lot to do, though. One area is missing its data, which I need to fix. Also the ‘click on an area’ functionality is not yet working. Locations as map points still need to be added in too, and the formatting of the areas still needs some work. Also, the pan and zoom functionality isn’t there yet either. However, I hope to get all of this working next week.
Also this week I had had a chat with Gavin Miller about the website for his new Medical Humanities project. We have been granted the top-level ‘.ac.uk’ domain we’d requested so we can now make a start on the website itself. I also made some further tweaks to the RNSN data, based on feedback. I also spent about a day this week working on the REELS project, creating a script that would output all of the data in the format that is required for printing. The tool allows you to select one or more parishes, or to leave the selection blank to export data for all parishes. It then formats this in the same way as the printed place-name surveys, such as the Place-Names of Fife. The resulting output can then be pasted into Word and all formatting will be retained, which will allow the team to finalise the material for publication.
I spent the rest of the week working on Historical Thesaurus tasks. I met with Marc and Fraser on Friday, and ahead of this meeting I spent some time starting to look at matching up lexemes in the HT and OED datasets. This involved adding seven new fields to the HT’s lexeme database to track the connection (which needs up to four fields) and to note the status of the connection (e.g. whether it was a manual or automatic match, which particular process was applied). I then ran a script that matched up all lexemes that are found in matched categories where every HT lexeme matches an OED lexeme (based on the ‘stripped’ word field plus first dates).
Whilst doing this I’m afraid I realised I got some stats wrong previously. When I calculated the percentage of total matched lexemes in matched categories and it gave figures of about 89% matched lexemes this was actually the number of matched lexemes across all categories (whether they were fully matched or not). The number of matched lexemes in fully matched categories is unfortunately a lot lower. For ‘01’ there are 173,677 matched lexemes, for ‘02’ there are 45,943 matched lexemes and for ‘03’ there are 110,087 matched lexemes. This gives a total of 329,707 matched lexemes in categories where every HT word matches an OED word (including categories where there are additional OED words) out of 731307 non-OE words in the HT, which is about 45% matched. I ticked these off in the database with check code 1 but these will need further checking, as there are some duplicate matches (where the HT lexeme has been joined to more than one OED lexeme). Where this happens the last occurrence currently overwrites any earlier occurrence. Some duplicates are caused by a word’s resulting ‘stripped’ form being the same – e.g. ‘chine’ and ‘to-chine’.
When we met on Friday we figured out another big long list of updates and new experiments that I would carry out over the next few weeks, but Marc spotted a bit of a flaw in the way we are linking up HT and OED lexemes. In order to ensure the correct OED lexeme is uniquely identified we rely on the OED’s category ID field. However, this is likely to be volatile: during future revisions some words will be moved between categories. Therefore we can’t rely on the category ID field as a means of uniquely identifying an OED lexeme. This will be a major problem when dealing with future updates from the OED ad we will need to try and find a solution – for example updating the OED data structure so that the current category ID is retained in a static field. This will need further investigation next week.
I met with Marc and Fraser on Monday to discuss the current situation with regards to the HT / OED linking task. As I mentioned last week, we had run into an issue with linking HT and OED lexemes up as there didn’t appear to be any means of uniquely identifying specific OED lexemes as on investigation the likely candidates (a combination of category ID, refentry and refid) could be applied to multiple lexemes, each with different forms and dates. James McCracken at the OED had helpfully found a way to include a further ID field (lemmaid) that should have differentiated these duplicates, and for the most part it did, but there were still more than a thousand rows where the combination of the four columns was not unique.
At our meeting we decided that this number of duplicates was pretty small (we are after all dealing with more than 700,000 lexemes) and we’d just continue with our matching processes and ignore these duplicates until they can be sorted. Unexpectedly, James got back to me soon after the meeting and had managed to fix the issue. He sent me an updated dataset that after processing resulted in there being only 28 duplicate rows, which is going to be a great help.
As a result of our meeting I made a number of further changes to scripts I’d previously created, including fixing the layout of the gap matching script, to make it easier for Fraser to manually check the rows, and I also updated the ‘duplicate lexemes in categories’ script (these are different sorts of duplicates – word forms that appear more than once in a category, but with their own unique identifiers) so that HT words where the ‘wordoed’ field is the same but the ‘word’ field is different are not considered duplicates. This should filter out words of OE origin that shouldn’t be considered duplicates. So for example, ‘unsweet’ with ‘unsweet’ and ‘unsweet’ with ‘unsweet < unswete’ no longer appear as duplicates. This has reduced the number of rows listed from 567 to 456. Not as big a drop as I’d expected, but a bit less.
At the meeting I’d also pointed out that the new data from the OED has deleted some categories that were present in the version of the OED data we’d been working with up to this point. There are 256 OED categories that have been deleted, and these contain 751 words. I wanted to check what was going on with these categories so wrote a little script that lists the deleted categories and their words. I added a check to see which of these are ‘quarantined’ categories (categories that were duplicated in the existing data that we had previously marked as ‘quarantined’ to keep them separate from other categories) and I’m very glad to say that 202 such categories have been deleted (out of a total of 207 quarantined categories – we’ll need to see what’s going on with the remainder). I also added in a check to see whether any of the deleted OED categories are matched up to HT categories. There are 42 such categories, unfortunately, which appear in red. We’ll need to decide what to do about these, ideally before I switch to using the new OED data, otherwise we’re left with OED catids in the HT’s category table that point to nothing.
In addition to the HT / OED task, I spent about half the week working on DSL related issues too, including a trip to the DSL offices in Edinburgh on Wednesday. The team have been making updates to the data on a locally hosted server for many years now, and none of these updates have yet made their way into the live site. I’m helping them to figure out how to get the data out of the systems they have been using and into the ‘live’ system. This is a fairly complicated task as the data is stored in two separate systems, which need to be amalgamated. Also, the ‘live’ data stored at Glasgow is made available via an API that I didn’t develop, for which there is very little documentation, and which appears to dynamically make changes to the data extracted from the underlying database and refactor it each time a request is made. As this API uses technologies that Arts IT Support are not especially happy to host on their servers (Django / Python and Solr) I am going to develop a new API using technologies that Arts IT Support are happy to deal with (PHP), and eventually replace the old API, and also the old data with the new, merged data that the DSL people have been working on. It’s going to be a pretty big task, but really needs to be tackled.
Last week Ann Ferguson from the DSL had sent me a list of changes she wanted me to make to the ‘Wordpressified’ version of the DSL website. These ranged from minor tweaks to text, to reworking the footer, to providing additional options for the ‘quick search’ on the homepage to allow a user to select whether their search looks in SND, DOST or both source dictionaries. It took quite some time to go through this document, and I’ve still not entirely finished everything, but the bulk of it is now addressed.
The request for updating the ‘save screenshot’ feature refers to the option to save an image of the atlas, complete with all icons and the legend, at a resolution that is much greater than the user’s monitor in order to use the image in print publications. Unfortunately getting the map position correct when using this feature is very difficult – small changes to position can result in massively different images.
I took another look at the screengrab plugin I’m using to see if there’s any way to make it work better. The plugin is leaflet.easyPrint (https://github.com/rowanwins/leaflet-easyPrint). I was hoping that perhaps there had been a new version released since I installed it, but unfortunately there hasn’t. The standard print sizes all seem to work fine (i.e. positioning the resulting image in the right place). The A3 size is something I added in, following the directions under ‘Custom print sizes’ on the page above. This is the only documentation there is, and by following it I got the feature working as it currently does. I’ve tried searching online for issues relating to the custom print size, but I haven’t found anything relating to map position. I’m afraid I can’t really attempt to update the plugin’s code as I don’t know enough about how it works and the code is pretty incomprehensible (see it here: https://raw.githubusercontent.com/rowanwins/leaflet-easyPrint/gh-pages/dist/bundle.js).
I’d previously tried several other ‘save map as image’ plugins but without success, mainly because they are unable to incorporate HTML map elements (which we use for icons and the legend). For example, the plugin https://github.com/mapbox/leaflet-image which rather bluntly says “This library does not rasterize HTML because browsers cannot rasterize HTML. Therefore, L.divIcon and other HTML-based features of a map, like zoom controls or legends, are not included in the output, because they are HTML.”
I think that with the custom print size in the plugin we’re using we’re really pushing the boundaries of what it’s possible to do with interactive maps. They’re not designed to be displayed bigger than a screen and they’re not really supposed to be converted to static images either. I’m afraid the options available are probably as good as it’s going to get.
Also this week I made some further changes to the RNSN timelines, had a chat with Simon Taylor about exporting the REELS data for print publication, undertook some App store admin duties and had a chat with Helen Kingstone about a research database she’s hoping to put together.
As with the past few weeks, I spent a fair amount of time this week on the HT / OED data linking issue. I updated the ‘duplicate lexemes’ tables to add in some additional information. For HT categories the catid now links through to the category in the HT website and each listed word has an [OED] link after it that performs a search for the word on the OED website, as currently happens with words on the HT website. For OED categories the [OED] link leads directly to the sense on the OED website, using a combination of ‘refentry’ and ‘refid’.
I then created a new script that lists HT / OED categories where all the words match (HT and OED stripped forms are the same and HT startdate matches OED GHT1 date) or where all HT words match and there are additional OED forms (hopefully ‘new’ words), with the latter appearing in red after the matched words. Quite a large percentage of categories either have all their words matching or have everything matching except a few additional OED words (note that ‘OE’ words are not included in the HT figures):
For 01: 82300 out of 114872 categories (72%) are ‘full’ matches. 335195 out of 388189 HT words match (86%). 335196 out of 375787 OED words match (89%). For 02: 20295 out of 29062 categories (70%) are ‘full’ matches. 106845 out of 123694 HT words match (86%). 106842 out of 119877 OED words match (89%). For 03: 57620 out of 79248 categories (73%) are ‘full’ matches. 193817 out of 223972 HT words match (87%). 193186 out of 217771 OED words match (89%). It’s interesting how consistent the level of matching is across all three branches of the thesaurus.
I also received a new batch of XML data from the OED, which will need to replace the existing OED data that we’re working with. Thankfully I have set things up so that the linking of OED and HT data takes place in the HT tables, for example the link between an HT and OED category is established by storing the primary key of the OED category as a foreign key in the corresponding row of HT category table. This means that swapping out the OED data should (or at least I thought it should) be pretty straightforward.
I ran the new dataset through the script I’d previously created that goes through all of the OED XML, extracts category and lexeme data and inserts it into SQL tables. As was expected, the new data contains more categories than the old data. There are 238697 categories in the new data and 237734 categories in the old data, so it looks like 963 new categories. However, I think it’s likely to be more complicated than that. Thankfully the OED categories have a unique ID (called ‘CID’ in our database). In the old data this increments from 1 to 237734 with no gaps. In the new data there are lots of new categories that start with an ID greater than 900000. In fact, there are 1219 categories with such IDs. These are presumably new categories, but note that there are more categories with these new IDs than there are ‘new’ categories in the new data, meaning some existing categories must have been deleted. There are 237478 categories with an ID less than 900000, meaning 256 categories have been deleted. We’re going to have to work out what to do with these deleted categories and any lexemes contained within them (which presumably might have been moved to other categories).
Another complication is that the ‘Path’ field in the new OED data has been reordered to make way for changes to categories. For example, the OED category with the path ’02.03.02’ and POS ‘n’ in the old data is 139993 ‘Ancient Greek philosophy’. In the new OED data the category with the path ’02.03.02’ and POS ‘n’ is 911699 ‘badness or evil’, while ‘Ancient Greek philosophy’ now appears as ’02.01.15.02’. Thankfully the CID field does not appear to have been changed, for example, CID 139993 in the new data is still ‘Ancient Greek philosophy’ and still therefore links to the HT catid 231136 ‘Ancient Greek philosophy’, which has the ‘oedmainat’ of 02.03.02. I note that our current ‘t’ number for this category is actually ‘02.01.15.02’, so perhaps the updates to the OED’s ‘path’ field bring it into line with the HT’s current numbering. I’m guessing that the situation won’t be quite as simple as that in all cases, though.
Moving on to lexemes, there are 751156 lexemes in the new OED data and 715546 in the old OED data, meaning there are some 35,610 ‘new’ lexemes. As with categories I’m guessing it’s not quite as simple as that as some old lexemes may have been deleted too. Unfortunately, the OED does not have a unique identifier for lexemes in its data. I generate an auto-incrementing ID when I import the data, but as the order of the lexemes has changed between data the ID for the ‘old’ set does not correspond to the ID in the ‘new’ set. For example, the last lexeme in the ‘old’ set has an ID of 715546 and is ‘line’ in the category 237601. In the new set the lexeme with the ID 715546 is ‘melodica’ in the category 226870.
The OED lexeme data has two fields which sort of look like unique identifiers: ‘refentry’ and ‘refid’. The former is the ID for a dictionary entry while the latter is the ID for the sense. So for example refentry 85205 is the dictionary entry for ‘Heaven’ and refid 1922174 is the second sense, allowing links to individual senses, as follows: http://www.oed.com/view/Entry/85205#eid1922174. Unfortunately in the OED lexeme table neither of these IDs is unique, either on its own or in combination. For example, the lexeme ‘abaca’ has a refentry of 37 and a refid of 8725393, but there are three lexemes with these IDs in the data, associated with categories 22927, 24826 and 215239.
I was hoping that the combination of refentry, refid and category ID would be unique and and serve as a primary key, and I therefore wrote a script to check for this. Unfortunately this script demonstrated that these three fields are not sufficient to uniquely identify a lexeme in the OED data. There are 5586 times that refentry and refid appear more than once in a category. Even more strangely, these occurrences frequently have different lexemes and different dates associated with them. For example: ‘Ecliptic circle’ (1678-1712) and ‘ecliptic way’ (1712-1712) both have 59369 as refentry and 5963672 as refid.
While there are some other entries that are clearly erroneous duplicates (e.g. half-world (1615-2013) and 3472: half-world (1615-2013) have the same refentry (83400, 83400) and refid (1221624180, 1221624180)), the above example and others are (I guess) legitimate and would not be fixed by removing duplicates, so we can’t rely on a combination of cid, refentry and refid to uniquely identify a lexeme.
Based on the data we’d been given from the OED, in order to uniquely identify an OED lexeme we would need to include the actual ‘lemma’ field and/or date fields. We can’t introduce our own unique identifier as it will be redefined every time new OED data is inputted, so we will have to rely on a combination of OED fields to uniquely identify a row, in order to link up one OED lexeme and one HT lexeme. But if we rely on the ‘lemma’ or date fields the risk is these might change between OED versions, so the link would break.
To try and find a resolution to this issue I contacted James McCracken, who is the technical guy at the OED. I asked him whether there is some other field that the OED uses to uniquely identify a lexeme that was perhaps not represented in the dataset we had been given. James was extremely helpful and got back to me very quickly, stating that the combination of ‘refentry’ and ‘refid’ uniquely identifies the dictionary sense, but that a sense can contain several different lemmas, each of which may generate a distinct item in the thesaurus, and these distinct items may co-occur in the same thesaurus category. He did, however, note that in the source data, there’s also a pointer to the lemma (‘lemmaid’), which wasn’t included in the data we had been given. James pointed out that this field is only included when a lemma appears more than once in a category, but that we should therefore be able to use CID, refenty, refid and (where present) lemmaid to uniquely identify a lexeme. James very helpfully regenerated the data so that it included this field.
Once I received the updated data I updated my database structure to add in a new ‘lemmaid’ field and ran the new data through a slightly updated version of my migration script. The new data contains the same number of categories and lexemes as the dataset I’d been sent earlier in the week, so that all looks good. Of the lexemes there are 33283 that now have a lemmaid, and I also updated my script that looks for duplicate words in categories to check the combination of refentry, refid and lemmaid.
After adding in the new lemmaid field, the number of listed duplicates has decreased from 5586 to 1154. Rows such as ‘Ecliptic way’ and ‘Ecliptic circle’ have now been removed, which is great. There are still a number of duplicates listed that are presumably erroneous, for example ‘cock and hen (1785-2006)’ appears twice in CID 9178 and neither form has a lemmaid. Interestingly, the ‘half-world’ erroneous(?) duplicate example I gave previously has been removed as one of these has a ‘lemmaid’.
Unfortunately there are still rather a lot of what look like legitimate lemmas that have the same refentry and refid but no lemmaid. Although these point to the same dictionary sense they generally have different word forms and in many cases different dates. E.g. in CID 24296: poor man’s treacle (1611-1866) [Lemmaid 0] and countryman’s treacle (1745-1866) [Lemmaid 0] have the same refentry (205337, 205337) and refid (17724000, 17724000). We will need to continue to think about what to do with these next week as we really need to be able to identify individual lexemes in order to match things up properly with the HT lexemes. So this is a ‘to be continued’.
Also this week I spent some time in communication with the DSL people about issues relating to extracting their work in progress dictionary data and updating the ‘live’ DSL data. I can’t really go into detail about this yet, but I’ve arranged to visit the DSL offices next week to explore this further. I also made some tweaks to the DSL website (including creating a new version of the homepage) and spoke to Ann about the still in development WordPress version of the website and a log list of changes that she had sent me to implement.
I also tracked down a bug in the REELS system that was resulting in place-name element descriptions being overwritten with blanks in some situations. It would appear to only occur when associating place-name elements with a place when the ‘description’ field had carriage returns in it. When you select an element by typing characters into ‘element’ box to bring up a list of matching elements and then select an element from the list, a request is sent to the server to bring back all the information about the element in order to populate the various boxes in the form relating to the element. However, special characters used to represent carriage returns (\n and \r) are not valid in the JSON format. When an element description contained such characters, the returned file couldn’t be read properly by the script. Form elements up to the description field were getting automatically filled in, but then the description field was being left blank. Then when the user pressed the ‘update’ button the script assumed the description field had been updated (to clear the contents) and deleted the text in the database. Once I identified this issue I updated the script that grabs the information about an element so that special characters that break JSON files are removed, so hopefully this will not happen again.
Also this week I updated the transcription case study on the Decadence and Translation website to tweak a couple of things that were raised during a demonstration of the system and I created a further timeline for the RNSN project, which took most of Friday afternoon.
I continued to work on the outstanding ‘stories’ for the Romantic National Song Network this week, completing work on another story using the storymap.js library. I have now completed seven of these stories, which is more than half of the total the project intends to make.
On Monday I met with E Jamieson, the new RA on the SCOSYA project, to discuss the maps we are going to make available to the public. We discussed various ways in which the point-based data might be extrapolated, such as heat maps and Voronoi diagrams. I found a nice example of a Leaflet.js / D3.js based Voronoi diagram that I think could work very well for the project (see https://chriszetter.com/voronoi-map/examples/uk-supermarkets/) so I might start to investigate using such an approach. I think we’d want to be able to colour-code the cells, although other D3.js examples of Voronoi diagrams suggest that this is possible (see this one: ). We also discussed how the more general public views of the data (as opposed to expert view) might work. The project team like the interface offered by this site: https://ygdp.yale.edu/phenomena/done-my-homework, although we want something that presents more of the explanatory information (including maybe videos) via the map interface itself. It looks like the storymap.js library (https://storymap.knightlab.com/) I’m using for RNSN might actually work very well for this. For RNSN I’m using the library with images rather than maps, but it was primarily designed for use with maps, and could hopefully be adapted to work with a map showing data points or even Voronoi layers.
I spent a further couple of days this week working on the HT / OED data linking task. This included reading through and giving feedback on Fraser’s abstract for DH2019 and updating the v1 / v2 comparison script to add in an additional column to show whether the v1 match was handled automatically or manually. I also created a new script to look at the siblings of categories that contain monosemous forms to see whether any of these might have matches at the same level. This script takes all of the monosemous matches as listed in the monosemous QA script and for each OED and HT category finds their unmatched siblings that don’t otherwise also appear in the list. The script then iterates through the OED siblings and for each of these compares the contents to the contents of each of the HT siblings. If there is a match (matches for this script being anything that’s green, lime green, yellow or orange) the row is displayed on screen. Where there are multiple monosemous categories at the same level the siblings will be analysed for each of the categories, so there is some duplication. E.g. the first monosemous link is ‘OED category 2797 01.02.06.01|03 (n) deep place or part matches HT category 1017 01.02.06.01.01|03 (n) deep place/part’ and there are two unmatched OED siblings (‘shallow place’ and ‘accumulation of water behind barrier’), so these are analysed. But the next monosemous category (OED category 2803 01.02.06.01|07 (n) bed of matches HT category 1024 01.02.06.01.01|07 (n) bed of) is at the same level, so the two siblings are analysed again. This happens quite a lot, but even so there are still some matches that this script finds that wouldn’t otherwise have been found due to changes is category number. I’ve made a count of the total unique matches (all colours) and it’s 162. I fear we are getting to the point where the amount of time it takes to write scripts to identify matches is taking longer than the time it would take to manually identify matches, though. It took several hours to write this script for 162 potential matches.
I also created a script that lists all of the non-matched OED and HT categories, split into various smaller lists, such as main categories or sub-categories, and on Wednesday I attended a meeting with Marc and Fraser to discuss our next steps. I came out of the meeting with another long list of items to try and tackle, and I spent some of the rest of the week going through the list. I ticked off the outstanding green, lime green and yellow matches on the lexeme pattern matching, sibling matching and monosemous matching scripts.
I then updated the sibling matching script to look for matches at any subcat level, but unfortunately this didn’t really uncover much new, at least initially. It found just one extra green and three yellows, although the 86 oranges look like they would mostly be ok too, with manual checking. I went over my script and it was definitely doing what I’m expecting it to do, namely: Get all of the unmatched OED cats (e.g. 03.05.01.02.01.02|05.04 (vt)); for subcats get all of the unmatched HT subcats of the maincat in the same POS (e.g. all the unmatched subcats of 03.05.01.02.01.02 that are vt); list all of the subcats; if one of the stripped headings matches or has a Levenshtein score of 1 then this is highlighted in green and its contents are compared.
I then updated the script so that it didn’t compare category headings at all, but instead only looked at the contents. In this script each possible match appears in its own row (e.g. cat 120031 appears 4 times, once as an orange, 3 times as purple). It has brought back 8 greens, 1 lime green, 4 yellows and 1617 oranges.
I then updated the monosemous QA script to identify categories where the monosemous form has dates that match and one further date matches, the idea being if these criteria are met the category match is likely to be legitimate. This was actually really difficult to implement and took most of a day to do. This is because the identification of monosemous forms was done at a completely different point (and actually by a completely different script) to the listing and comparing of the full category contents. I had to rewrite large parts of the function that gets and compares lexemes in order to integrate the monosemous forms. The script now makes all monosemous forms in the OED word list for each category bold and compares these forms and their dates to all of the HT words in the category. A count of all of the monsemous forms that match an HT form in terms of stripped / pattern matched content and start date is stored. If this count is 1 or more and the count of ‘Matched stripped lexemes (including dates)’ is 2 or more then the match is bumped up to yellow. This has identified 512 categories, which is about a sixth of the total OED unmatched categories with words, which is pretty good.
Other tasks this week included creating a new (and possibly final) blog post for the REELS project, dealing with some App related questions from someone in MVLS, having a brief meeting with Clara Cohen from English Language to discuss the technical aspects of a proposal she’s putting together and making a few further tweaks to the Bilingual Thesaurus website.
On Friday I attended the Corpus Linguistics in Scotland event at Edinburgh University. There were 12 different talks over the course of the day on a broad selection of subjects. As I’m primarily interested in the technologies used rather than the actual subject matter, here are some technical details. One presenter used Juxta (https://www.juxtaeditions.com/) to identify variation in manuscripts. Another used TEI to mark up pre-modern manuscripts for lexicographical use (looking at abbreviations, scribes, parts of speech, gaps, occurrences of particular headwords). Another speaker had created a plugin for the text editor Emacs that allows you to look at things like word frequencies, n-grams and collocations. A further speaker handled OCR using Google Cloud Vision (https://cloud.google.com/vision/) that can take images and analyse them in lots of ways, including extracting the text. A couple of speakers used AntConc (http://www.laurenceanthony.net/software/antconc/) and another couple used the newspaper collections available through LexisNexis (https://www.lexisnexis.com/ap/academic/form_news_wires.asp) as source data. Other speakers used Wordsmith tools (https://www.lexically.net/wordsmith/), Sketch Engine (https://www.sketchengine.eu) and WMatrix (http://ucrel.lancs.ac.uk/wmatrix/). It was very interesting to learn about the approaches taken by the speakers.
I spent most of my time this week split between three projects: The HT / OED category linking, the REELS project and the Bilingual Thesaurus. For the HT I continued to work on scripts to try and match up the HT and OED categories. This week I updated all the currently in use scripts so that date checks now extract the first four numeric characters (OE is converted to 1000 before this happens) from the ‘GHT_date1’ field in the OED data and the ‘fulldate’ field in the HT data. Doing this has significantly improved the matching on the first date lexeme matching script. Greens have gone from 415 to 1527, lime greens from 2424 to 2253, yellows from 988 to 622 and oranges from 2363 to 1788. I also updated the word lists to make them alphabetical, so it’s easier to compare the two lists and included two new columns. The first is for matched dates (ignoring lexeme matching), which is a count of the number of dates in the HT and OED categories that match while the second is this figure as a percentage of the total number of OED lexemes.
However, taking dates in isolation currently isn’t working very well, as if a date appears multiple times it generates multiple matches. So, for example, the first listed match for OED CID 94551 has 63 OED words, and all 63 match for both lexeme and date. But lots of these have the same dates, meaning a total count of matched dates is 99, or 152% of the number of OED words. Instead I think we need to do something more complicated with dates, making a note of each one AND the number of times each one appears in a category as its ‘date fingerprint’.
I created a new script to look at ‘date fingerprints’. The script generates arrays of categories for HT and OED unmatched categories. The dates of each word (or each word with a GHT date in the case of OED) in every category is extracted and a count of these is created (e.g. if the OED category 5678 has 3 words with 1000 as a date and 1 word with 1234 as a date then its ‘fingerprint’ is 5678[1000=>3,1234=>1]. I ran this against the HT database to see what matches.
The script takes about half an hour to process. It grabs each unmatched OED category that contains words, picks out those that have GHT dates, gets the first four numerical figures of each and counts how many times this appears in the category. It does the same for all unmatched HT categories and their ‘fulldate’ column too. The script then goes through each OED category and for each goes through every HT category to find any that have not just the same dates, but the same number of times each date appears too. If everything matches the information about the matched categories is displayed.
The output has the same layout as the other scripts but where a ‘fingerprint’ is not unique a category (OED or HT) may appear multiple times, linked to different categories. This is especially common for categories that only have one or two words, as the combination of dates is less likely to be unique. For an example of this search for our old favourite ‘extra-terrestrial’ and you’ll see that as this is the only word in its category, any HT categories that also have one word and the same start date (1963) are brought back as potential matches. Nothing other than the dates are used for matching purposes – so a category might have a different POS, or be in a vastly different part of the hierarchy. But I think this script is going to be very useful.
I also created a script that ignores POS when looking for monosemous forms, but this hasn’t really been a success. It finds 4421 matches as opposed to 4455, I guess because some matches that were 1:1 are being complicated by polysemous HT forms in different parts of speech.
With these updates in place, Marc and Fraser gave the go-ahead for connections to be ticked off. Greens, lime greens and yellows from ‘lexeme first date matching’ script have now been ticked off. There were 1527, 2253 and 622 in these respective sections, so a total of 4402 ticked off. That takes us down to 6192 unmatched OED categories that have a POS and are not empty, or 11380 unmatched that have a POS if you include empty ones. I then ‘unticked’ the 350 purple rows from the script I’d created to QA the ‘erroneous zero’ rows that had been accidentally ticked off last week. This means we now have 6450 unmatched OED categories with words, or 11730 including those without words. I then ticked off all of the ‘thing heard’ matches other than some rows tht Marc had spotted as being wrong. 1342 have been ticked off, bringing our unchecked but not empty total down to 5108 and our unchecked including empty total down to 10388. On Friday, Marc, Fraser and I had a further meeting to discuss our next steps, which I’ll continue with next week.
For the REELS project I continued going through my list of things to do before the project launch. This included reworking the Advanced Search layout, adding in tooltip text, updating the start date browse, which was including ‘inactive’ data in it’s count, created some further icons for combinations of classification codes, added in Creative Commons logos and information, added an ‘add special character’ box to the search page, added a ‘show more detail’ option to the record page that displays the full information about place-name elements, added an option to the API and Advanced Search that allows you to specify if your element search looks at current forms, historical forms or both, added in Google Analytics, updated the site text and page structure to make the place-name search and browse facilities publicly available, created a bunch of screenshots for the launch, set up the server on my laptop for the launch and made everything live. You can now access the place-names here: https://berwickshire-placenames.glasgow.ac.uk/ (e.g. by doing a quick search or choosing to browse place-names)
I also investigated a strange situation Carole had encountered with the Advanced Search, whereby a search for ‘pn’ and ‘<1500’ brings back ‘Hassington West Mains’, even though it only has a ‘pn’ associated with a Historical form from 1797. The search is really ‘give me all the place-names that have an associated ‘pn’ element and also have an earliest historical form before 1500’. The usage of elements in particular historical forms and their associated dates is not taken into consideration – we’re only looking at the earliest recorded date for each place-name. Any search involving historical form data is treated in the same way – e.g. if you search for ‘<1500’ and ‘Roy’ as a source you also get Hassington West Mains as a result, because its earliest recorded historical form is before 1500 and it includes a historical form that has ‘Roy’ as a source. Similarly if you search for ‘<1500’ and ‘N. mains’ as a historical form you’ll also get Hassignton West Mains, even though the only historical form before 1500 is ‘(lands of) Westmaynis’. This is because again the search is ‘get me all of the place-names with a historical form before 1500 that have any historical form including the text ‘N. mains’. We might need to make it clearer that ‘Earliest start date’ refers to the earliest historical form for a place-name record as a whole, not the earliest historical form in combination with ‘historical form’, ‘source’, ‘element language’ or ‘element’.
On Saturday I attended the ‘Hence the Name’ conference run by the Scottish Place-name Society and the Scottish Records Association, where we launched the website. Thankfully everything went well and we didn’t need to use the screenshots or the local version of the site on my laptop, and the feedback we received about the resource was hugely positive.
For the Bilingual Thesaurus I continued to implement the search facilities for the resource. This involved stripping out a lot of code from the HT’s search scripts that would not be applicable to the BTH’s data, and getting the ‘quick search’ feature to work. After getting this search to actually bring back data I then had to format the results page to incorporate the fields that were appropriate for the project’s data, such as the full hierarchy, whether the word results are Anglo Norman or Middle English, dates, parts of speech and such things. I also had to update the category browse page to get search result highlighting to work and to get the links back to search results working. I then made a start on the advanced search form.
Other than these projects I also spoke to fellow developer David Wilson to give him some advice on Data Management Plans, I emailed Gillian Shaw with some feedback on the University’s Technician Commitment, I helped out Jane with some issues relating to web stats, I gave some advice to Rachel Macdonald on server specifications for the SPADE project, I replied to two PhD students who had asked me for advice on some technical matters, and I gave some feedback to Joanna Kopaczyk about hardware specifications for a project she’s putting together.
After a rather hectic couple of weeks this was a return to a more regular sort of week, which was a relief. I still had more work to do than there was time to complete, but it feels like the backlog is getting smaller at least. As with previous weeks, I continued with the HT / OED linking of categories processes this week, following on from the meeting Marc, Fraser and I had the Friday before. For the lexeme / data matching script I separated out categories with zero matches that have words from the orange list into a new list with a purple background. So orange now only contains categories where at least one word and its start date match. The ones now listed in purple are almost certainly incorrect matches. I also changed the ordering of results so that categories are listed by the largest number of matches, to make it easier to spot matches that are likely ok.
I also updated the ‘monosemous’ script, so that the output only contains OED categories that feature a monosemous word and is split into three tables (with links to each at the top of the page). The first table features 4455 OED categories that include a monosemous word that has a comparable form in the HT data. Where there are multiple monosemous forms they each correspond to the same category in the HT data. The second table features 158 OED categories where the linked HT forms appear in more than one category. This might either be because the word is not monosemous in the HT data and appears in two different categories (these are marked with the text ‘red|’ they can be search for in page. An OED category can also appear in this table even if there are no red forms if (for example) one of the matched HT words is in a different category to all of the others (see OED catid 45524) where the word ‘Puncican’ is found in a different HT category to the other words). The final table contains those OED categories that feature monosemous words that have no match in the HT data. There are 1232 of these. I also created a QA script for the 4455 matched monosemous categories, which applies the same colour coding and lexeme matching as other QA scripts I’ve created. On Friday we had another meeting to discuss the findings and plan our next steps, which I will continue with next week.
Also this week I wrote an initial version of a Data Management Plan for Thomas Clancy’s Iona project, and commented on the DMP assessment guidelines that someone from the University’s Data Management people had put together. I can’t really say much more about these activities, but it took at least a day to get all of this done. I also did some app management duties, setting up an account for a new developer, and made the new Seeing Speech and Dynamic Dialects websites live. These can now be viewed here: https://www.seeingspeech.ac.uk/ and here: https://www.dynamicdialects.ac.uk/. I also had an email conversation with Rhona Alcorn about Google Analytics for the DSL site.
With the REELS project’s official launch approaching, I spent a bit of time this week going through the 23 point ‘to do’ list I’d created last week. In fact, I added another three items to it. I’m going to tackle the majority of the outstanding issues next week, but this week I investigated and fixed an issue with the ‘export’ script in the Content Management System. The script is very memory intensive and it was exceeding the server’s memory limits, so asking Chris to increase this limit sorted the issue. I also updated the ‘browse place-names’ feature of the CMS, adding a new column and ordering facility to make it clearer which place-names actually appear on the website. I also updated the front-end so that it ‘remembers’ whether you prefer the map or the text view of the data using HTML5 local storage and added in information about the Creative Commons license to the site and the API. I investigated the issue of parish boundary labels appearing on top of icons, but as of yet I’ve not found a way to address this. I might return to it before the launch if there’s time, but it’s not a massive issue. I moved all of the place-name information on the record page above the map, other than purely map-based data such as grid reference. I also removed the option to search the ‘analysis’ field from the advanced search and updated the element ‘auto-complete’ feature so that it only now matches the starting letters of an element rather than any letters. I also noticed that the combination of ‘relief’ and ‘water’ classifications didn’t have an icon on the map, so I created one for it.
I also continued to work on the Bilingual Thesaurus website this week. I updated the way in which source links work. Links to dictionary sources now appear as buttons in the page, rather in a separate pop-up. They feature the abbreviation (AND / MED / OED) and the magnifying glass icon and if you hover over a button the non-abbreviated form appears. For OED links I’ve also added the text ‘subscription required’ to the hover-over text. I also updated the word record so that where language of origin is ‘unknown’ the language of origin no longer gets displayed, and I made the headword text a bit bigger so it stands out more. I also added the full hierarchy above the category heading in the category section of the browse page, to make it easier to see exactly where you are. This will be especially useful for people using the site on narrow screens as the tree appears beneath the category section so is not immediately visible. You can click on any of the parts of the hierarchy here to jump to that point.
I then began to work on the search facility, and realised I needed to implement a ‘search words’ list that features variants. I did this for the Historical Thesaurus and it’s really useful. What I’ve done so far is generate alternatives for words that have brackets and dashes. For example, the headword ‘Bond(e)-man’ has the following search terms: Bond(e)-man, Bond-man, Bonde-man, Bond(e) man, Bond man, Bonde man, Bond(e)man, Bondman, Bondeman. None of these varieties will ever appear on the website, but instead will be used to find the word when people search. I’ll need some feedback as to whether these options will suffice, but for now I’ve uploaded variants to a table and began to get the quick search working. It’s not entirely there yet, but I should get this working next week. I also need to know what should be done about accented characters for search purposes. The simplest way to handle them would be to just treat them as non-accented characters – e.g. searching for ‘alue’ will find ‘alué’. However, this does mean you won’t be able to specifically search for words that include accented characters – e.g. a search for all the words featuring an ‘é’ will just bring back all characters with an ‘e’ in them.
I was intending to add a count of the number of words in each hierarchical level to the browse, or at least to make hierarchical levels that include words bold in the browse, so as to let users know whether it’s worthwhile clicking on a category to view the words at this level. However, I’ve realised that this will just confuse users as levels that have no words in them but include child categories that do have words in them would be listed with a zero or not in bold, giving the impression that there is no content lower down the hierarchy.
My last task for the week was to create a new timeline for the RNSN project based on data that had been given to me. I think this is looking pretty good, but unfortunately making these timelines and related storymaps is very time-intensive, as I need to extract and edit the images, upload them to WordPress, extract the text and convert it into HTML and fill out the template with all of the necessary fields. It took about 2 and a half hours to make this timeline. However, hopefully the end result will be worth it.
This was a slightly unusual week for me, as I don’t often speak at events but I had sessions at workshops on Tuesday and Wednesday. The first one was an ArtsLab event about AHRC Data Management Plans while the second one was a workshop organised by Bryony Randall about digital editions. I think both workshops went well, and my sessions went pretty smoothly. It does take time to prepare for these sorts of things, though, especially when the material needs to be written from scratch, so most of the start of the week was spent preparing for and attending these events.
I also had a REELS project meeting on Tuesday morning where we discussed the feedback we’d received about the online resource and made a plan for what still needs to be finalised before the resource goes live at an event on the 17th of November. There are 23 items on the plan I drew up, so there’s rather a lot to get sorted in the next couple of weeks. Also relating to place-name studies, I made the new, Leaflet powered maps for Thomas Clancy’s Saints Places website live this week. I made the new maps for this legacy resource to replace older Google-based maps that were no longer working due to Google now requiring credit card details to use their mapping services. An example of one of the new maps can be found here: https://saintsplaces.gla.ac.uk/saint.php?id=64.
Also this week I updated the ‘support us’ page of the DSL to include new information and a new structure (http://dsl.ac.uk/support-us/), arranged to meet Matthew Creasy to discuss future work on his Decadence and Translation project, and responded to a few more requests from Jeremy Smith about the last-minute bid he was putting together, which he managed to submit on Tuesday. I also spoke to Scott Spurlock about his crowdsourcing project and spoke to Jane Stuart-Smith about the questionnaire for the new Seeing Speech / Dynamic Dialects websites which are nearing completion. I set up an Google Play / App Store account for someone in MVLS who wanted to keep track of the stats for one of their apps and I spoke to Kirsteen McCue about timelines for her RNSN project.
By Thursday I managed to get settled back into my more regular work routine, and returned to work on the Bilingual Thesaurus for the first time in a few weeks. Louise Sylvester had supplied me with some text for the homepage and the about page, so I added that in. I also fixed the date for ‘Galiot’, which was previously only recorded with an end date, and changed the ‘there are no words in this category’ text to ‘there are no words at this level of the hierarchy’, which is hopefully less confusing.
I also split the list of words for each category into two separate lists, one for Anglo Norman and one for Middle English. Originally I was thinking of having these as separate tabs, but as there are generally not very many words in a category it seemed a little unnecessary, and would have made it harder for a user to compare AN and ME words at the same time. So instead the words are split into two sections of one list. I also added in the language of origin and language of citation text. This information currently appears underneath the line containing the headword, POS and dates. Finally, I added in the links to the source dictionaries. To retain the look of the HT site and to reduce clutter these appear in a pop-up that’s opened when you click on a ‘search’ icon to the right of the word (tooltip text appears if you hover over the search icon too). These might be replaced with in-page links for each word instead, though. Here’s a screenshot of how things currently look, but note that the colour scheme is likely to change as Louise has specified a preference for blue and red. I’ll probably reuse the colours below for the main ‘Thesaurus’ portal page.
I spent the rest of the week working through the HT / OED category linking issues. This included ticking off 6621 matches that were identified by the lexeme / first date matching script, ticking off 78 further matches that Fraser had checked manually, and creating a script that matches up 1424 categories within the category ‘Thing heard’ that had things done to their category numbers that had prevented these from being paired up by previous scripts. I haven’t ticked these off yet as Marc wanted to QA them first, so I created a further script to help with this process. I also wrote a script to fix the category numbers of some of the HT categories where an erroneous zero appears in the number – e.g. ‘016’ is used rather than ‘16’. There were 1355 of these errors, which have now been fixed, which should mean the previous matching scripts should be able to match up at least some of these. Marc, Fraser and I met on Friday to discuss the process, and unfortunately one of the scripts we looked at still had its ‘update’ code active, meaning the newly fixed ‘erroneous zero’ categories were passed through it and ticked off. After the meeting I deactivated the ‘update’ code and identified which rows had been ticked off, creating a script to help to QA these, so no real damage was done.
I also realised that the page I’d created to list statistics about matched / unmatched categories was showing an incorrect figure for unmatched categories that are not empty. Rather than having 2931 unmatched OED categories that have a POS and are not empty the figure is actually 10594. The stats page was subtracting the total matched figure (currently 213,553) from the total number of categories that have a POS and are not empty (216,484). I’m afraid I hadn’t included a count of matched categories that have a POS and are not empty (currently 205,890), which is what should have been used rather than the total matched figure. So unfortunately we have more matches to deal with than we thought.
I also made a tweak to the lexeme / first date matching script, removing ‘to ‘ from the start of lexemes in order to match them. This helped bump a number of categories up into our thresholds for potential matches. I also changed the thresholds and added in a new grouping. The criteria for potential matches has been reduced by one word to 5 matching words and a total of 80% matching words. I also created a new grouping for categories that don’t meet this threshold but still have 4 matching words. I’ll continue with this next week.