I met with Marc and Fraser on Monday to discuss the current situation with regards to the HT / OED linking task. As I mentioned last week, we had run into an issue with linking HT and OED lexemes up as there didn’t appear to be any means of uniquely identifying specific OED lexemes as on investigation the likely candidates (a combination of category ID, refentry and refid) could be applied to multiple lexemes, each with different forms and dates. James McCracken at the OED had helpfully found a way to include a further ID field (lemmaid) that should have differentiated these duplicates, and for the most part it did, but there were still more than a thousand rows where the combination of the four columns was not unique.
At our meeting we decided that this number of duplicates was pretty small (we are after all dealing with more than 700,000 lexemes) and we’d just continue with our matching processes and ignore these duplicates until they can be sorted. Unexpectedly, James got back to me soon after the meeting and had managed to fix the issue. He sent me an updated dataset that after processing resulted in there being only 28 duplicate rows, which is going to be a great help.
As a result of our meeting I made a number of further changes to scripts I’d previously created, including fixing the layout of the gap matching script, to make it easier for Fraser to manually check the rows, and I also updated the ‘duplicate lexemes in categories’ script (these are different sorts of duplicates – word forms that appear more than once in a category, but with their own unique identifiers) so that HT words where the ‘wordoed’ field is the same but the ‘word’ field is different are not considered duplicates. This should filter out words of OE origin that shouldn’t be considered duplicates. So for example, ‘unsweet’ with ‘unsweet’ and ‘unsweet’ with ‘unsweet < unswete’ no longer appear as duplicates. This has reduced the number of rows listed from 567 to 456. Not as big a drop as I’d expected, but a bit less.
At the meeting I’d also pointed out that the new data from the OED has deleted some categories that were present in the version of the OED data we’d been working with up to this point. There are 256 OED categories that have been deleted, and these contain 751 words. I wanted to check what was going on with these categories so wrote a little script that lists the deleted categories and their words. I added a check to see which of these are ‘quarantined’ categories (categories that were duplicated in the existing data that we had previously marked as ‘quarantined’ to keep them separate from other categories) and I’m very glad to say that 202 such categories have been deleted (out of a total of 207 quarantined categories – we’ll need to see what’s going on with the remainder). I also added in a check to see whether any of the deleted OED categories are matched up to HT categories. There are 42 such categories, unfortunately, which appear in red. We’ll need to decide what to do about these, ideally before I switch to using the new OED data, otherwise we’re left with OED catids in the HT’s category table that point to nothing.
In addition to the HT / OED task, I spent about half the week working on DSL related issues too, including a trip to the DSL offices in Edinburgh on Wednesday. The team have been making updates to the data on a locally hosted server for many years now, and none of these updates have yet made their way into the live site. I’m helping them to figure out how to get the data out of the systems they have been using and into the ‘live’ system. This is a fairly complicated task as the data is stored in two separate systems, which need to be amalgamated. Also, the ‘live’ data stored at Glasgow is made available via an API that I didn’t develop, for which there is very little documentation, and which appears to dynamically make changes to the data extracted from the underlying database and refactor it each time a request is made. As this API uses technologies that Arts IT Support are not especially happy to host on their servers (Django / Python and Solr) I am going to develop a new API using technologies that Arts IT Support are happy to deal with (PHP), and eventually replace the old API, and also the old data with the new, merged data that the DSL people have been working on. It’s going to be a pretty big task, but really needs to be tackled.
Last week Ann Ferguson from the DSL had sent me a list of changes she wanted me to make to the ‘Wordpressified’ version of the DSL website. These ranged from minor tweaks to text, to reworking the footer, to providing additional options for the ‘quick search’ on the homepage to allow a user to select whether their search looks in SND, DOST or both source dictionaries. It took quite some time to go through this document, and I’ve still not entirely finished everything, but the bulk of it is now addressed.
The request for updating the ‘save screenshot’ feature refers to the option to save an image of the atlas, complete with all icons and the legend, at a resolution that is much greater than the user’s monitor in order to use the image in print publications. Unfortunately getting the map position correct when using this feature is very difficult – small changes to position can result in massively different images.
I took another look at the screengrab plugin I’m using to see if there’s any way to make it work better. The plugin is leaflet.easyPrint (https://github.com/rowanwins/leaflet-easyPrint). I was hoping that perhaps there had been a new version released since I installed it, but unfortunately there hasn’t. The standard print sizes all seem to work fine (i.e. positioning the resulting image in the right place). The A3 size is something I added in, following the directions under ‘Custom print sizes’ on the page above. This is the only documentation there is, and by following it I got the feature working as it currently does. I’ve tried searching online for issues relating to the custom print size, but I haven’t found anything relating to map position. I’m afraid I can’t really attempt to update the plugin’s code as I don’t know enough about how it works and the code is pretty incomprehensible (see it here: https://raw.githubusercontent.com/rowanwins/leaflet-easyPrint/gh-pages/dist/bundle.js).
I’d previously tried several other ‘save map as image’ plugins but without success, mainly because they are unable to incorporate HTML map elements (which we use for icons and the legend). For example, the plugin https://github.com/mapbox/leaflet-image which rather bluntly says “This library does not rasterize HTML because browsers cannot rasterize HTML. Therefore, L.divIcon and other HTML-based features of a map, like zoom controls or legends, are not included in the output, because they are HTML.”
I think that with the custom print size in the plugin we’re using we’re really pushing the boundaries of what it’s possible to do with interactive maps. They’re not designed to be displayed bigger than a screen and they’re not really supposed to be converted to static images either. I’m afraid the options available are probably as good as it’s going to get.
Also this week I made some further changes to the RNSN timelines, had a chat with Simon Taylor about exporting the REELS data for print publication, undertook some App store admin duties and had a chat with Helen Kingstone about a research database she’s hoping to put together.