
Month: February 2019
Week Beginning 18th February 2019
As with the past few weeks, I spent a fair amount of time this week on the HT / OED data linking issue. I updated the ‘duplicate lexemes’ tables to add in some additional information. For HT categories the catid now links through to the category in the HT website and each listed word has an [OED] link after it that performs a search for the word on the OED website, as currently happens with words on the HT website. For OED categories the [OED] link leads directly to the sense on the OED website, using a combination of ‘refentry’ and ‘refid’.
I then created a new script that lists HT / OED categories where all the words match (HT and OED stripped forms are the same and HT startdate matches OED GHT1 date) or where all HT words match and there are additional OED forms (hopefully ‘new’ words), with the latter appearing in red after the matched words. Quite a large percentage of categories either have all their words matching or have everything matching except a few additional OED words (note that ‘OE’ words are not included in the HT figures):
For 01: 82300 out of 114872 categories (72%) are ‘full’ matches. 335195 out of 388189 HT words match (86%). 335196 out of 375787 OED words match (89%). For 02: 20295 out of 29062 categories (70%) are ‘full’ matches. 106845 out of 123694 HT words match (86%). 106842 out of 119877 OED words match (89%). For 03: 57620 out of 79248 categories (73%) are ‘full’ matches. 193817 out of 223972 HT words match (87%). 193186 out of 217771 OED words match (89%). It’s interesting how consistent the level of matching is across all three branches of the thesaurus.
I also received a new batch of XML data from the OED, which will need to replace the existing OED data that we’re working with. Thankfully I have set things up so that the linking of OED and HT data takes place in the HT tables, for example the link between an HT and OED category is established by storing the primary key of the OED category as a foreign key in the corresponding row of HT category table. This means that swapping out the OED data should (or at least I thought it should) be pretty straightforward.
I ran the new dataset through the script I’d previously created that goes through all of the OED XML, extracts category and lexeme data and inserts it into SQL tables. As was expected, the new data contains more categories than the old data. There are 238697 categories in the new data and 237734 categories in the old data, so it looks like 963 new categories. However, I think it’s likely to be more complicated than that. Thankfully the OED categories have a unique ID (called ‘CID’ in our database). In the old data this increments from 1 to 237734 with no gaps. In the new data there are lots of new categories that start with an ID greater than 900000. In fact, there are 1219 categories with such IDs. These are presumably new categories, but note that there are more categories with these new IDs than there are ‘new’ categories in the new data, meaning some existing categories must have been deleted. There are 237478 categories with an ID less than 900000, meaning 256 categories have been deleted. We’re going to have to work out what to do with these deleted categories and any lexemes contained within them (which presumably might have been moved to other categories).
Another complication is that the ‘Path’ field in the new OED data has been reordered to make way for changes to categories. For example, the OED category with the path ’02.03.02’ and POS ‘n’ in the old data is 139993 ‘Ancient Greek philosophy’. In the new OED data the category with the path ’02.03.02’ and POS ‘n’ is 911699 ‘badness or evil’, while ‘Ancient Greek philosophy’ now appears as ’02.01.15.02’. Thankfully the CID field does not appear to have been changed, for example, CID 139993 in the new data is still ‘Ancient Greek philosophy’ and still therefore links to the HT catid 231136 ‘Ancient Greek philosophy’, which has the ‘oedmainat’ of 02.03.02. I note that our current ‘t’ number for this category is actually ‘02.01.15.02’, so perhaps the updates to the OED’s ‘path’ field bring it into line with the HT’s current numbering. I’m guessing that the situation won’t be quite as simple as that in all cases, though.
Moving on to lexemes, there are 751156 lexemes in the new OED data and 715546 in the old OED data, meaning there are some 35,610 ‘new’ lexemes. As with categories I’m guessing it’s not quite as simple as that as some old lexemes may have been deleted too. Unfortunately, the OED does not have a unique identifier for lexemes in its data. I generate an auto-incrementing ID when I import the data, but as the order of the lexemes has changed between data the ID for the ‘old’ set does not correspond to the ID in the ‘new’ set. For example, the last lexeme in the ‘old’ set has an ID of 715546 and is ‘line’ in the category 237601. In the new set the lexeme with the ID 715546 is ‘melodica’ in the category 226870.
The OED lexeme data has two fields which sort of look like unique identifiers: ‘refentry’ and ‘refid’. The former is the ID for a dictionary entry while the latter is the ID for the sense. So for example refentry 85205 is the dictionary entry for ‘Heaven’ and refid 1922174 is the second sense, allowing links to individual senses, as follows: http://www.oed.com/view/Entry/85205#eid1922174. Unfortunately in the OED lexeme table neither of these IDs is unique, either on its own or in combination. For example, the lexeme ‘abaca’ has a refentry of 37 and a refid of 8725393, but there are three lexemes with these IDs in the data, associated with categories 22927, 24826 and 215239.
I was hoping that the combination of refentry, refid and category ID would be unique and and serve as a primary key, and I therefore wrote a script to check for this. Unfortunately this script demonstrated that these three fields are not sufficient to uniquely identify a lexeme in the OED data. There are 5586 times that refentry and refid appear more than once in a category. Even more strangely, these occurrences frequently have different lexemes and different dates associated with them. For example: ‘Ecliptic circle’ (1678-1712) and ‘ecliptic way’ (1712-1712) both have 59369 as refentry and 5963672 as refid.
While there are some other entries that are clearly erroneous duplicates (e.g. half-world (1615-2013) and 3472: half-world (1615-2013) have the same refentry (83400, 83400) and refid (1221624180, 1221624180)), the above example and others are (I guess) legitimate and would not be fixed by removing duplicates, so we can’t rely on a combination of cid, refentry and refid to uniquely identify a lexeme.
Based on the data we’d been given from the OED, in order to uniquely identify an OED lexeme we would need to include the actual ‘lemma’ field and/or date fields. We can’t introduce our own unique identifier as it will be redefined every time new OED data is inputted, so we will have to rely on a combination of OED fields to uniquely identify a row, in order to link up one OED lexeme and one HT lexeme. But if we rely on the ‘lemma’ or date fields the risk is these might change between OED versions, so the link would break.
To try and find a resolution to this issue I contacted James McCracken, who is the technical guy at the OED. I asked him whether there is some other field that the OED uses to uniquely identify a lexeme that was perhaps not represented in the dataset we had been given. James was extremely helpful and got back to me very quickly, stating that the combination of ‘refentry’ and ‘refid’ uniquely identifies the dictionary sense, but that a sense can contain several different lemmas, each of which may generate a distinct item in the thesaurus, and these distinct items may co-occur in the same thesaurus category. He did, however, note that in the source data, there’s also a pointer to the lemma (‘lemmaid’), which wasn’t included in the data we had been given. James pointed out that this field is only included when a lemma appears more than once in a category, but that we should therefore be able to use CID, refenty, refid and (where present) lemmaid to uniquely identify a lexeme. James very helpfully regenerated the data so that it included this field.
Once I received the updated data I updated my database structure to add in a new ‘lemmaid’ field and ran the new data through a slightly updated version of my migration script. The new data contains the same number of categories and lexemes as the dataset I’d been sent earlier in the week, so that all looks good. Of the lexemes there are 33283 that now have a lemmaid, and I also updated my script that looks for duplicate words in categories to check the combination of refentry, refid and lemmaid.
After adding in the new lemmaid field, the number of listed duplicates has decreased from 5586 to 1154. Rows such as ‘Ecliptic way’ and ‘Ecliptic circle’ have now been removed, which is great. There are still a number of duplicates listed that are presumably erroneous, for example ‘cock and hen (1785-2006)’ appears twice in CID 9178 and neither form has a lemmaid. Interestingly, the ‘half-world’ erroneous(?) duplicate example I gave previously has been removed as one of these has a ‘lemmaid’.
Unfortunately there are still rather a lot of what look like legitimate lemmas that have the same refentry and refid but no lemmaid. Although these point to the same dictionary sense they generally have different word forms and in many cases different dates. E.g. in CID 24296: poor man’s treacle (1611-1866) [Lemmaid 0] and countryman’s treacle (1745-1866) [Lemmaid 0] have the same refentry (205337, 205337) and refid (17724000, 17724000). We will need to continue to think about what to do with these next week as we really need to be able to identify individual lexemes in order to match things up properly with the HT lexemes. So this is a ‘to be continued’.
Also this week I spent some time in communication with the DSL people about issues relating to extracting their work in progress dictionary data and updating the ‘live’ DSL data. I can’t really go into detail about this yet, but I’ve arranged to visit the DSL offices next week to explore this further. I also made some tweaks to the DSL website (including creating a new version of the homepage) and spoke to Ann about the still in development WordPress version of the website and a log list of changes that she had sent me to implement.
I also tracked down a bug in the REELS system that was resulting in place-name element descriptions being overwritten with blanks in some situations. It would appear to only occur when associating place-name elements with a place when the ‘description’ field had carriage returns in it. When you select an element by typing characters into ‘element’ box to bring up a list of matching elements and then select an element from the list, a request is sent to the server to bring back all the information about the element in order to populate the various boxes in the form relating to the element. However, special characters used to represent carriage returns (\n and \r) are not valid in the JSON format. When an element description contained such characters, the returned file couldn’t be read properly by the script. Form elements up to the description field were getting automatically filled in, but then the description field was being left blank. Then when the user pressed the ‘update’ button the script assumed the description field had been updated (to clear the contents) and deleted the text in the database. Once I identified this issue I updated the script that grabs the information about an element so that special characters that break JSON files are removed, so hopefully this will not happen again.
Also this week I updated the transcription case study on the Decadence and Translation website to tweak a couple of things that were raised during a demonstration of the system and I created a further timeline for the RNSN project, which took most of Friday afternoon.
Week Beginning 11th February 2019
I continued with the HT / OED linking tasks for a lot of this week, dealing not only with categories but also the linking of lexemes within linked categories. We’d previously discovered that the OED had duplicated an entire branch of the HT: their 03.01.07.06 was structurally the same as their 03.01.04.06, but the lexemes contained in the two branches didn’t match up exactly due to subsequent revisions. We had decided to ‘quarantine’ the 03.01.07.06 so as to ensure no contents from this branch are accidentally matched up. I did so by adding a new ‘quarantined’ column to the ‘category_oed’ table. It’s ‘N’ by default and ‘Y’ for the 207 categories in this branch. All future lexeme matching scripts will be set to ignore this branch.
I also created a ‘gap matching’ script. This grabs every unmatched OED category that has a POS and contains words (not including the quarantined categories). There are 950 in total. For each of these the script grabs the OED categories with an ID one lower and one higher than the category ID and only returns them if they are both the same POS and contain words. So for example with OED 2560 ‘relating to dry land’ (aj) the previous category is 2559 ‘partially’ and the next category is 2561 ‘spec’. It then checks to see whether these are both matched up to HT categories. In this case they are, the former to 910 ‘partially’, the latter to 912 ‘specific’. The script then notes whether there is a gap in the HT numbering, which there is here. It also checks to make sure the category in the gap is of the same POS. So in this example, 911 is the gap and the category (‘pertaining to dry land’) is an Aj. So this category is returned in its own column, along with a count of the number of words and a list of the words.
There are, however, some things to watch out for. There are a few occasions where there is more than one HT category in the gap. For example, for the OED category 165009 ‘enter upon command’ the ‘before’ category matches HT category 157423 and the ‘after’ category matches 157445, meaning there are several categories in the gap. Currently in such cases the script just grabs the first HT category in the gap. Linked to this (but not always due to this) some HT categories in the gap are already linked to other OED categories. I’ve put in a check for this so they can be manually checked.
There are 169 gaps to explore and of these 14 HT categories in the gap are already matched to something else. There are also two categories where the identified HT category in the gap is the wrong POS, and these are also flagged. Many of the potential matches are ones that have fallen through the cracks due to lexemes being too different to automatically match up, generally due to there being only 1-3 matching words in the category. The matches look pretty promising, and will just need to be manually checked over before I tick a lot of them off.
Also this week, I updated the ‘match lexemes’ script output to ignore final ‘s’ and initial ‘to’. I also added in counts of matched and unmatched words. We were right to be concerned about duplicate words as the ‘total matched’ figures for OED and HT lexemes are not the same, meaning a word in OED matches multiple in HT (or vice-versa). After running the script here are some stats:
For ’01’ there are 347312 matched HT words and 40877 unmatched HT words, and 347947 matched OED words and 27840 unmatched OED words. For ’02’ there are 110510 matched HT words and 13184 unmatched HT words, and 110651 matched OED words and 9226 unmatched OED words. For ’03’ there are 201653 matched HT words and 22319 unmatched HT words, and 201994 matched OED words and 15777 unmatched OED words.
I then created a script that lists all duplicate lexemes in HT and OED categories. There shouldn’t really be any duplicate lexemes in categories as each word should only appear once in each sense. However, my script uncovered rather a lot of duplicates. This is going to have an impact on our lexeme matching scripts as our plans were based on the assumption that each lexeme form would be unique in a category. My script gives four different lists for both HT and OED categories: All categories comparing citation form, all categories comparing stripped form, matched categories comparing citation form and matched categories comparing stripped form. The output lists the lexeme ID and either fulldate in the case of HT or GHT dates 1 and 2 in the case of OED so it’s easier to compare forms.
For all HT categories there are 576 duplicates using citation form and 3316 duplicates using the stripped form. The majority of these are in matched categories (550 and 3264 respectively). In the OED data things get much, much worse. For all OED categories there are 5662 duplicates using citation form and 6896 duplicates using the stripped form. Again, the majority of these are in matched categories (5634 and 6868 respectively). This is going to need some work in the coming weeks.
As we can’t currently rely on the word form in a category to be unique, I decided to make a new script that matches lexemes in matched categories using both their word form and their date. It matches both stripped word form and start date (the first bit of HT fulldate against the GHT1 date) and is looking pretty promising, with matched figures not too far off those found when comparing stripped word form on its own. The script lists the HT word / ID and date and its corresponding OED word / ID and date in both the HT and OED word columns. Any unmatched HT or OED words are then listed in red underneath
Here are some stats (with those for the ‘only matching by stripped form’ in brackets for comparison)
01: There are 335195 (347312) matched HT words and 52994 (40877) unmatched HT words, and 335196 (347947) matched OED words and 40591 (27840) unmatched OED words.
02: There are 106845 (110510) matched HT words and 16849 (13184) unmatched HT words, and 106842 (110651) matched OED words and 13035 (9226) unmatched OED words.
03: There are 193187 (201653) matched HT words and 30785 (22319) unmatched HT words, and 193186 (201994) matched OED words and 24585 (15777) unmatched OED words.
I’m guessing that the reason the number of HT and OED matches aren’t exactly the same is because of duplicates with identical dates somewhere. But still, the matches are much more reliable. However, there would still appear to be several issues relating to duplicates. Some OED duplicates are carried over from HT duplicates – e.g. ‘stalagmite’ in HT 3142 ‘stalagmite/stalactite’. Duplicates appear in both HT and OED, and the forms in each set have matching dates so are matched up without issue. But sometimes the OED has changed a form, which has resulted in a duplicate being made. E.g. For HT 5750 ‘as seat of planet’ there are two OED ‘term’ words. The second one (ID 252, date a1625) should actually match the HT word ‘termin’ (ID 19164, date a1625). In HT 6506 ‘Towards’ the OED has two ‘to the sun-ward’, but the latter (ID 1806, date a1711) seems to have been changed from the HT’s ‘sunward’ (ID 20940, date a1711), which is a bit weird. There are also some cases where the wrong duplicate is still being matched, often due to OE dates. For example, in HT category 5810 (Sky, heavens (n)), ‘heaven’ (HT 19331 with dates OE-1860) is set to match OED 399 ‘heaven’ (with dates OE-). But HT ‘heavens’ (19332 with dates OE-) is also set to match OED 399 ‘heaven’ (as the stripped form is ‘heaven’ and the start date matches). The OED duplicate ‘heaven’ (ID 433, dates OE-1860) doesn’t get matched as the script finds the 399 ‘heaven’ first and goes no further. Also in this case the OED duplicate ‘heaven’ appears to have been created by the OED removing ‘final -s’ from the second form.
On Friday I met with Marc to discuss all of the above, and we made a plan about what to focus on next. I’ll be continuing with this next week.
Also this week I did some more work for the DSL people. I reviewed some documents Ann had sent me relating to IT infrastructure, spoke to Rhona about some future work I’m going to be doing for the DSL that I can’t really go into any detail about at this stage, created a couple of new pages for the website that will go live next week and updated the way the DSL’s entry page works to allow dictionary ID (e.g. ‘dost24821’) to be passed to the page in addition to the current way of passing dictionary (e.g. ‘dost’) and entry href (e.g. ‘milnare’).
I also gave some advice to the RA of the SCOSYA project who is working on reshaping the Voronoi cells to more closely fit the coastline of Scotland, gave some advice to a member of staff in History who is wanting to rework an existing database, spoke to Gavin Miller about his new Glasgow-wide Medical Humanities project and completed the migration of the RNSN timeline data from Google Docs to locally hosted JSON files.
Week Beginning 4th February 2019
Everyone in the College of Arts had their emails migrated to a new system this week, so I had to spend a little bit of time getting all of my various devices working properly. Rather worryingly, the default Android mail client told me I couldn’t access my emails until I allowed outlook.office365.com to remotely control my device, which included giving permissions to erase all data from my phone, control screen locks and control cameras. It seemed like a lot of control to be giving a third party when this is my own personal device and all I want to do is read and send emails. After some investigation would appear that the Outlook app for Android doesn’t require permission to erase all data or control the camera, just less horrible permissions involving setting password types and storage encryption. It’s only the default Android mail app that asks for the more horrible permissions. I therefore switched to using the Outlook app, although I also realised the default Android calendar app was also asking for the same permissions, so I’ve had to switch to using the calendar in the Outlook app as well.
With that issue out of the way, I divided my time this week primarily between three projects. First of all in SCOSYA. On Wednesday I met with E and Jennifer to discuss the ‘story atlas’ interface I’d created previously. Jennifer found the Voronoi cells rather hard to read due to the fact that the cells are overlaid on the map, meaning the cell colour obscures features such as place-names and rivers, and the cells extend beyond the edges of the coastline, which makes it hard to see exactly what part of the country each cell corresponds to. Unfortunately the map and all its features (e.g. placenames, rivers) are served up together as tiles. It’s not possible to (for example) have the base map, then our own polygons then place-names, rivers etc on the top. Coloured polygons are always going to obscure the map underneath as they are always added on top of the base tiles. Voronoi diagrams automatically generate cells based on the proximity of points, and this doesn’t necessarily work so well with a coastline such as Scotland’s that features countless islands and features. Some cells extend across bodies of water and give the impression that features are found in areas where they wouldn’t necessarily be found. For example, North Berwick appears in the cell generated by Anstruther, over the other side of the Firth of Forth. We decided, therefore, to abandon Voronoi diagrams and instead make our own cells that would more accurately reflect our questionnaire locations. This does mean ‘hard coding’ the areas, but we decided this wasn’t too much of a problem as our questionnaire locations are all now in place and are fixed. It will mean that someone will have to manually trace out the coordinates for each cell, following the coastline and islands, which will take some time, but we reckoned the end result will be much easier to understand. I found a very handy online tool that can be used to trace polygons on a map and then download the shapes as GeoJSON files: https://geoman.io/studio and I also investigated whether it might be possible to export the polygons generated by my existing Voronoi diagram to use these as a starting point, rather than having to generate the shapes manually from scratch.
I spent some time trying to extract the shapes, but I was unable to do so using the technologies used to generate the map, as the polygons are not geolocational shapes (i.e. with latitude / longitude pairs) but are instead SVG shapes with coordinates that relate to the screen, which then get recalculated and moved every time the underlying map moves. However, I then investigated alternative libraries and have come across one called turf.js (http://turfjs.org/) that can generate Voronoi cells that are actual geolocational shapes. The Voronoi bit of the library can be found here: https://github.com/Turfjs/turf-voronoi and although it rather worryingly is plastered with messages from 4 years ago saying ‘Under development’, ‘not ready for use!’ and ‘build failing’ I’ve managed to get it to work. By passing it our questionnaire locations as lat/lng coordinates I’ve managed to get it to spit out Voronoi polygons as a series of lat/lng coordinates. These can be uploaded to the mapping service linked to above, resulting in polygons as the following diagram shows:
However, the Voronoi shapes generated by this library are not the same dimensions as those generated by the other library (see an earlier post for an image of this). They are a lot spikier somehow. I guess the turf.js Voronoi algorithm is rather different to the d3.js Voronoi algorithm. Also, the boundaries between cells consist of lines for each polygon, meaning when dragging a line you’ll have to drag two or possibly three or more lines to fully update the positions of each cell. Finally, despite including the names of each location in the data that was inputted into the Turf.js Voronoi processor this data has been ignored, meaning the polygon shapes have no place-name associated with them. There doesn’t seem to be a way of getting these added back in automatically, so at some point I’m going to have to manually add place-names (and unique IDs) to the data. This is going to be pretty horrible, but actually I would have had to have done that with any manually created shapes too. It’s now over to other members of the team to tweak the polygons to get them to fit the coastline better.
Also for SCOSYA this week, the project’s previous RA, Gary Thoms, got in touch to ask about generating views of the atlas for publication. He was concerned about issues relating to copyright, issues relating to the resolution of the images and also the fact that the publication would prefer images to be in greyscale rather than colour. I investigated each of these issues:
Regarding copyright: The map imagery we use is generated using the MapBox service. According to their terms of service (see the ‘static images for print’ section here: https://docs.mapbox.com/help/how-mapbox-works/static-maps/) we are allowed to use them in academic publications: “You may make static exports and prints for non-commercial purposes such as flyers, posters, or other short publications for academic, non-profit, or personal use.” I’m not sure what their definition of ‘short’ is, though. Attribution needs to be supplied (see https://docs.mapbox.com/help/how-mapbox-works/attribution/). Map data (roads, place-names etc) comes from OpenStreetMap and is released via a Creative Commons license. This should also appear in the attribution.
Regarding resolution: The SCOSYA atlas maps are raster images rather than scalable vector images, so generating images that are higher than screen resolution is going to be tricky. There’s not much we can do about it without generating maps in a desktop GIS package, or some other such software. All online maps packages I’ve used (Google Maps, Leaflet, MapBox) use raster image tiles (e.g. PNG, JPEG) rather than vector images (e.g. SVG). The page linked to above states “With the Mapbox Static API, image exports can be up to 1,280 px x 1,280 px in size. While enabling retina may improve the quality of the image, you cannot export at a higher resolution using the Static API, and we do not support vector image formats.” And later on: “The following formats are not supported as a map export option and are not currently on our road map for integration: SVG, EPS, PDF”. The technologies we’re using were chosen to make an online, interactive atlas and I’m afraid they’re not ideally suited for producing static printed images. However, the ‘print map to A3 Portrait image’ option I added to the CMS version of the atlas several months ago does allow you to grab a map image that is larger than your screen. Positioning the map to get what you want is a bit hit and miss, and it can take a minute or so to process once you press the button, but it does then generate an image that is around 2440×3310 pixels, which might be good enough quality.
Regarding greyscale images: I created an alternative version of the CMS atlas that uses a greyscale basemap and icons (see below for an example). It is somewhat tricky to differentiate the shades of grey in the icons, though, so perhaps we’ll need to use different icon shapes as well. I haven’t heard back from Gary yet, so will just need to see whether this is going to be good enough.
The next project I focussed on this week was the Historical Thesaurus, and the continuing task of linking up the HT and OED categories and lexemes. I updated one of the scripts I wrote last week so that the length of the subcat is compared rather than the actual subcat (so 01 and 02 now match, but 01 and 01.02 don’t). This has increased the matches from 110 to 209. I also needed to rewrite the script that outputted all of the matching lexemes in every matched category in the HT and OED datasets as I’d realised that my previous script had silently failed to finish due to its size – it just cut off somewhere with no error having been given by Firefox. The same thing happened in Firefox again when I tried to generate a new output, and when trying in Chrome it spent about half an hour processing things then crashed. I’m not sure which browser comes out worse in this, but I’d have to say Firefox silently failing is probably worse, which pains me to say as Firefox is my browser of choice.
Anyway, I have since split the output into three separate files – one each for ‘01’, ‘02’ and ‘03’ categories, and thankfully this has worked. There are a total of 223,182 categories in the three files, up from the 222433 categories in the previous half-finished file. I have also changed the output so that OED lexemes that are marked as ‘revised’ in the database have a yellow [R] after them. This applies to both matched and unmatched lexemes, as I thought it might be useful to see both. I’ve also added a count of the number of revised forms that are matched and unmatched. These appear underneath the tables. It was adding this info underneath the tables that led me to realise the data had failed to fully display – as although Firefox said the page was loaded there was nothing displaying underneath the table. So, for example, in the 114,872 ‘01’ matched categories there are 122,196 words that match and are revised and 15,822 words that don’t match and are revised.
On Friday I met with Marc and Fraser to discuss the next steps for the linking process and I’ll be focussing on this for much of the next few weeks, all being well. Also this week I finally managed to get my travel and accommodation for Bergamo booked.
The third main project I worked on this week was RNSN. For this project I updated our over-arching timeline to incorporate the new timeline I created last week and the major changes to an existing timeline. I also made a number of other edits to existing timelines. One of the project partners had been unable to access the timelines from her work. The timeline page was loading, but the Google Doc containing the data failed to load. It turned out that the person’s work WiFi was blocking access to Google Docs, as when the person checked via the mobile network the full timeline loaded without an issue. This got me thinking that hosting data for the timelines via Google Docs is probably a bad idea. The ‘storymap’ data is already hosted in JSON files hosted on our own servers, but for the timelines I used the Google Docs approach as it was so easy to add and edit data. However, it does mean that we’re relying on a third party service to publish our timelines (all other code for the timelines is hosted at Glasgow). If providers block access to Google Doc hosted spreadsheets, or Google decides to remove free access to this data (as it recently did for Google Maps) then all our timelines break. In addition, the data is currently tied to my Google account, meaning no-one else can edit it or access it.
After a bit of investigation I discovered that you can just store timeline data in locally hosted JSON files, and read these into the timeline script in a very similar way to a Google Doc. I therefore created a test timeline in the JSON format and everything worked perfectly. I migrated two timelines to this format and will need to migrate the remainder in the coming weeks. It will be slightly time consuming and may introduce errors, but I think it will be worth it.
Also this week I made a couple of small tweaks to the Decadence and Translation transcription pages, including reordering the pages and updating notes and explanatory texts, upgraded WordPress to the latest version for all the sites I manage and fixed the footer for the DSL WordPress site.
Week Beginning 28th January 2019
Last Friday afternoon I met with Charlotte Methuen to discuss a proposal she’s putting together. It’s an AHRC proposal, but not a typical one as it’s in collaboration with a German funding body and it has its own template. I had agreed to write the technical aspects of the proposal, which I had assumed would involve a typical AHRC Data Management Plan, but the template didn’t include such a thing. It did however include other sections where technical matters could be added, so I wrote some material for these sections. As Charlotte wanted to submit the proposal for internal review by the end of the week I needed to focus on my text at the start of the week, and spent most of Monday and Tuesday working on it. I sent my text to Charlotte on Tuesday afternoon, and made a few minor tweaks on Wednesday and everything was finalised soon after that. Now we’ll just need to wait and see whether the project gets funded.
I also continued with the HT / OED linking process this week as well. Fraser had clarified which manual connections he wanted me to tick off, so I ran these through a little script that resulted in another 100 or so matched categories. Fraser had also alerted me to an issue with some OED categories. Apparently the OED people had duplicated an entire branch of the thesaurus (03.01.07.06 and 03.01.04.06) but had subsequently made changes to each of these branches independently of the other. This means that for a number of HT categories there are two potential OED category matches, and the words (and information relating to words such as dates) found in each of these may differ. It’s going to be a messy issue to fix. I spent some time this week writing scripts that will help us to compare the contents of the two branches to work out where the differences lie. First of all I wrote a script that displays the full contents (categories and words) contained in an OED category in tabular format. For example, passing the category 03.01.07.06 then lists the 207 categories found therein, and all of the words contained in these categories. For comparison, 03.01.04.06 contains 299 categories.
I then created another script that compares the contents of any two OED categories. By default, it compares the two categories mentioned above, but any two can be passed, for example to compare things lower down the hierarchy. The script extracts the contents of each chosen category and looks for exact matches between the two sets. The script looks for an exact match of the following in combination (i.e. all must be true):
- length of path (so xx.xx and yy.yy match but xx.xx and yy.yy.yy don’t)
- length of sub (so a sub of xx matches yy but a sub of xx doesn’t match xx.yyy)
- POS
- Stripped heading
In such cases the categories are listed in a table together with their lexemes, and the lexemes are also then compared. If a lexeme from cat1 appears in cat2 (or vice-versa) it is given a green background. If a lexeme from one cat is not present in the other it is given a red background, and all lexemes are listed with their dates. Unmatched categories are listed in their own tables below the main table, with links at the top of the page to each. 03.01.04.06 has 299 categories and 03.01.07.06 has 207 categories. Of these there would appear to be 209 matches, although some of these are evidently duplicates. Some further investigation is required, but it does at least look like the majority of categories in each branch can be matched.
I also updated the lists of unmatched categories to incorporate the number of senses for each word. The overview page now gives a list of the number of times words appear in the unmatched category data. Of the 2155 OED words that are currently in unmatched OED categories we have 1763 words with 1 unmatched sense, 232 words with 2 unmatched senses, 75 words with 3 unmatched senses, 18 words with 6 unmatched senses, 36 words with 4 unmatched senses, 15 words with 5 unmatched senses and 16 words with 8 unmatched senses. I also updated the full category lists linked to from this summary information to include the count of senses (unmatched) for each individual OED word, so for example for ‘extra-terrestrial’ the following information is now displayed: extra-terrestrial (1868-1969 [1963-]) [1 unmatched sense].
Also this week I tweaked some settings relating to Rob Maslen’s ‘Fantasy’ blog, investigated some categories that had been renumbered erroneously in the Thesaurus of Old English and did a bit more investigation into travel and accommodation for the Bergamo conference.
I split the remainder of my time between RNSN and SCOSYA. For RNSN I had been sent a sizable list of updates that needed to be made to the content of a number of song stories, so I made the necessary changes. I had also been sent an entirely new timeline-based song story, and I spent a couple of hours extracting the images, text and audio from the PowerPoint presentation and formatting everything for display in the timeline.
For SCOSYA I spent some time further researching Voronoi diagrams and began trying to update my code to work with the current version of D3.js. It turns out that there have been many changes to the way in which D3 implements Voronoi diagrams since the code I based my visualisations on was released. For one thing, ‘d3-voronoi’ is going to be deprecated and replaced by a new module called d3-delaunay. Information about this can be found here: https://github.com/d3/d3-voronoi/blob/master/README.md. There is also now a specific module for applying Voronoi diagrams to spheres using coordinates, called d3-geo-voronoi (https://github.com/Fil/d3-geo-voronoi). I’m now wondering whether I should start again from scratch with the visualisation. However, I also received an email from Jennifer raising some issues with Voronoi diagrams in general so we might need an entirely different approach anyway. We’re going to meet next week to discuss this.