Week Beginning 18th March 2019

This week I spent a lot of time continuing with the HT/OED linking task, tackling the outstanding items on my ‘to do’ list before I met with Marc and Fraser on Friday.  This included the following:

Re-running category pattern matching scripts on the new OED categories:  The bulk of the category matching scripts rely on matching the HT’s oedmaincat field against the OED’s path field (and then doing other things like comparing category contents).  However, these scripts aren’t really very helpful with the new OED category table as the path has changed for a lot of the categories.  The script that seemed the most promising was number 17 in our workflow document, which compares first dates of all lexemes in all unmatched OED and HT categories and doesn’t check anything else.  I’ve created an updated version of this that uses the new OED data, and the script only brings back unmatched categories that have at least one word that has a GHT date, and interestingly the new data has less unmatched categories featuring GHT dates than the old data (591 as opposed to 794).  I’m not really sure why this is, or what might have happened to the GHT dates.  The script brings back five 100% matches (only 3 more than the old data, all but one containing just one word) and 52 matches that don’t meet our criteria (down from 56 with the old data) so was not massively successful.

Ticking off all matching HT/OED lexemes rather than just those within completely matched categories: 627863 lexemes are now matched.  There are 731307 non-OE words in the HT, so about 86% of these are ticked off.  There are 751156 lexemes in the new OED data, so about 84% of these are ticked off.  Whilst doing this task I noticed another unexpected thing about the new OED data:  the number of categories in ’01’ and ‘02’ have decreased while the number in ‘03’ has increased.  In the old OED data we have the following number of matched categories:

01: 114968

02: 29077

03: 79282

In the new OED data we have the following number of matched categories:

01: 109956

02: 29069

03: 84260

The totals match up, other than the 42 matched categories that have been deleted in the new data, so (presumably) some categories have changed their top level.  Matching up the HT and OED lexemes has introduced a few additional duplicates, caused when a ‘stripped’ form means multiple words within a category match.  There aren’t too many, but they will need to be fixed manually.

Identifying all words in matched categories that have no GHT dates and see which of these can be matched on stripped form alone: I created a script to do this, which lists every unmatched OED word that doesn’t have a GHT date in every matched OED category and then tries to find a matching HT word from the remaining unmatched words within the matched HT category.  Perhaps I misunderstood what was being requested because there are no matches returned in any of the top-level categories.  But then maybe OED words that don’t have a GHT date are likely to be new words that aren’t in the HT data anyway?

Create a monosemous script that finds all unmatched HT words that are monosemous and sees whether there are any matching OED words that are also monosemous: Again, I think the script I created will need more work.  It is currently set to only look at lexemes within matched categories.  It finds all the unmatched HT words that are in matched categories, then checks how many times each word appears amongst the unmatched HT words in matched categories of the same POS. If the word only appears once then the script looks within the matched OED category to find a currently unmatched word that matches.  At the moment the script does not check to see if this word is monosemous as I figured that if the word matches and is in a matched category it’s probably a correct match.  Of the 108212 unmatched HT words in matched categories, 70916 are monosemous within their POS and of these 14474 can be matched to an OED lexeme in the corresponding OED category.

Deciding which OED dates to use: I created a script that gets all of the matched HT and OED lexemes in one of the top-level categories (e.g. 01) and then for each matched lexeme works out the largest difference between OED sortdate and HT firstd (if sortdate is later then sortdate-firstd, otherwise firstd-sortdate); works out the largest difference between OED enddate and HT lastd in the same way; adds these two differences together to work out the largest overall difference.  It then sorts the data on the largest difference and then displays all lexemes in a table ordered by largest difference, with additional fields containing the start difference, end difference and total difference for info.  I did, however, encounter a potential issue:  Not all HT lexemes have a firstd and lastd.  E.g. words that are ‘OE-‘ have nothing in firstd and lastd but instead have ‘OE’ in the ‘oe’ column and ‘_’ in the ‘current’ column.  In such cases the difference between HT and OED dates are massive, but not accurate.  I wonder whether using HT’s apps and appe columns might work better.

Looking at lexemes that have an OED citation after 1945, which should be marked as ‘current’:  I created a script that goes through all of the matched lexemes and lists all of the ones that either have an OED sortdate greater than 1945 or an OED enddate greater than 1945 where the matched HT lexeme does not have the ‘current’ flag set to ‘_’.  There are 73919 such lexemes.

On Friday afternoon I had a meeting with Marc and Fraser where we discussed the above and our next steps.  I now have a further long ‘to do’ list, which I will no doubt give more information about next week.

Other than HT duties I helped out with some research proposals this week.  Jane Stuart-Smith and Eleanor Lawson are currently putting a new proposal together and I helped to write the data management plan for this.  I also met with Ophira Gamliel in Theology to discuss a proposal she’s putting together.  This involved reading through a lot of materials and considering all the various aspects of the project and the data requirements of each, as it is a highly multifaceted project.  I’ll need to spend some further time next week writing a plan for the project.

I also had a chat to Wendy Anderson about updating the Mapping Metaphor database, and also the possibility of moving the site to a different domain.  I also met with Gavin Miller to discuss the new website I’ll be setting up for his new Glasgow-wide Medical Humanities Network, and I ran some queries on the DSL database in order to extract entries that reference the OED for some work Fraser is doing.

Finally, I had to make some changes to the links from the Bilingual Thesaurus to the Middle English dictionary website.  The site has had a makeover, and is looking great, but unfortunately when they redeveloped the site they didn’t put redirects from the old URLs to the new ones.  This is pretty bas as it means anyone who has cited or bookmarked a page will end up with broken links, not just BTh.  I would imagine entries have been cited in countless academic papers and all these citations will now be broken, which is not good.  Anyway, I’ve fixed the MED links in BTh now.  Unfortunately there are two forms of link in the database, for example: http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=id&id=MED6466 and http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=byte&byte=24476400&egdisplay=compact.  I’m not sure why this is the case and I’ve no idea what the ‘byte’ number refers to in the second link type.  The first type includes the entry ID, which is still used in the new MED URLs.  This means I can get my script to extract the ID from the URL in the database and then replace the rest with the new URL, so the above becomes https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary/MED6466 as the target for our MED button and links directly through to the relevant entry page on their new site.

Unfortunately there doesn’t seem to be any way to identify an individual entry page for the second type of link.  This means there is no way to link directly to the relevant entry page.  However, I can link to the search results page by passing the headword, and this works pretty well.  So, for example the three words on this page: https://thesaurus.ac.uk/bth/category/?type=search&hw=2&qsearch=catourer&page=1#id=1393 have the second type of link, but if you press on one of the buttons you’ll find yourself at the search results page for that word on the MED website, e.g. https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary?utf8=%E2%9C%93&search_field=hnf&q=Catourer.

 

 

Week Beginning 11th March 2019

I mainly worked on three projects this week:  SCOSYA, the Historical Thesaurus and the DSL.  For SCOSYA I continued with the new version of my interactive ‘story map’ using Leaflet’s choropleth example and the geographical areas that has been created by the project’s RAs.  Last week I’d managed to get the areas working and colour coded based on the results from the project’s questionnaires.  This week I needed to make the interactive aspects, such as being able to navigate through slides, load in new data, click on areas to view details etc.  After a couple of days I had managed to get a version of the interface that did everything my earlier, more crudely laid out Voronoi diagram did, but using the much more pleasing (and useful) geographical areas and more up to date underlying technologies.  Here’s a screenshot of how things currently look:

If you look at the screenshot from last week’s post you’ll notice that one location (Whithorn) wasn’t getting displayed.  This was because the script iterates through the locations with data first, then the other locations, and the code to take the ratings from the locations with data and to add them to the map was only triggering when the next location also had data.  After figuring this out I fixed it.  I also figured out why ‘Airdrie’ had been given a darker colour.  This was because of a typo in the data.  We had both ‘Airdrie’ and ‘Ardrie’, so two layers were being generated, one on top of the other.  I’ve fixed ‘Ardrie’ now.  I also updated the styles of the layers to make the colours more transparent and the borders less thick and added in circles representing the actual questionnaire locations.  Areas now get a black border when you move the mouse over it, reverting to the dotted white border on mouse out.  When you click on an area (as with Harthill in the above screenshot) it is given a thicker black border and the area becomes more opaque.  The name of the location and its rating level appears in the box in the bottom left.  Clicking on another area, or clicking a second time on the selected area, deselects the current area.  Also, the pan and zoom between slides is now working, and this uses Leaflet’s ‘FlyTo’ method, which is a bit smoother than the older method used in the Voronoi storymap.  Similarly, switching from one dataset to another is also smoother.  Finally, the ‘Full screen’ option in the bottom right of the map works, although I might need to work on the positioning of the ‘slide’ box when in this view.  I haven’t implemented the ‘transparency slider’ feature that was present in the Voronoi version, as I’m not sure it’s really necessary any more.

The underlying data is exactly that same as for the Voronoi example, and is contained in a pretty simple JSON file.  So long as the project RAs stick to the same format they should be able to make new stories for different features, and I should be able to just plug a new file into the atlas and display it without any further work.  I think this new story map interface is working really well now and I’m very glad we took the time to manually plot out the geographical areas.

Also for SCOSYA this week, E contacted me to say that the team hadn’t been keeping a consistent record of all of the submitted questionnaires over the years, and wondered whether I might be able to write an export script that generated questionnaires in the same format as they were initially uploaded.  I spent a few hours creating such a feature, which at the click of a button iterates through the questionnaires in the database, formats all of the data, generates CSV files for each, adds them to a ZIP file and presents this for download.  I also added a facility to download an individual CSV when looking at a questionnaire’s page.

For the HT I continues with the seemingly endless task of matching up the HT and OED data.  Last week Fraser had sent me some category matches he’d manually approved that had been outputted by my gap matching script.  I ran these through a further script that ticked these matches off.  There were 154 matches, bringing the number of unmatched OED categories that have a POS and are not empty down to 995.  It feels like something of a milestone to get this figure under a thousand.

Last week we’d realised that using category ID to uniquely identify OED lexemes (as they don’t have a primary key) is not going to work in the long term as during the editing process the OED people can move lexemes between categories.  I’d agreed to write a script that identifies all of the OED lexemes that cannot be uniquely identified when disregarding category ID (i.e. all those OED lexemes that appear in more than one category).  Figuring this out proved to be rather tricky as the script I wrote takes up more memory than the server will allow me to use.  I had to run things on my desktop PC instead, but to do this I needed to export tables from the online database, and these were bigger than the server would let me export too.  So I had to process the XML on my desktop and generate fresh copies of the table that way. Ugh.

Anyway, the script I wrote goes through the new OED lexeme data and counts all the times a specific combination of refid, refentry and lemmaid appears (disregarding the catid).  As I expected, the figure is rather large.  There are 115,550 times when a combination of refid, refentry and lemmaid appears more than once.  Generally the number of times is 2, but looking through the data I’ve seen one combination that appears 7 times.  The total number of words with a non-unique combination is 261,028, which is about 35% of the entire dataset.  We clearly need some other way of uniquely identifying OED lexemes.  Marc’s suggestion last week of asking the OED to create a legacy ‘catid’ field that is retained in the data as it is now and is never updated in future that would be sufficient to uniquely identify everything in a (hopefully) persistent way.  However, we would still need to deal with new lexemes added in future, though, which might be an issue.

I then decided to generate a list of all of the OED words where the refentry, refid and lemmaid are the sameMost of the time the word has the same date in each category, but not always.  For example, see:

Absenteeism:

654         5154310                0              180932  absenteeism      1957       2005

654         5154310                0              210756  absenteeism      1850       2005

Affectuous:

3366       9275424                0              92850    affectuous          1441       1888

3366       9275424                0              136581  affectuous          1441       1888

3366       9275424                0              136701  affectuous          1566       1888

Aquiline:

10058    40557440             0              25985    aquiline                 1646       1855

10058    40557440             0              39861    aquiline                 1745       1855

10058    40557440             0              65014    aquiline                 1791       1855

I then updated the script to output data only when refentry, refid, lemmaid, lemma, sortdate and enddate are all the same.  There are 97927 times when a combination of all of these fields appears more than once, and the total number of words where this happens is 213,692 (about 28% of all of the OED lemmas).  Note that the output here will include the first two ‘affectuous’ lines listed above while omitting the third. After that I created a script that brings back all HT lexemes that appear in multiple categories but have the same word form (the ‘word’ column), ‘startd’ and ‘endd’ (non OE words only).  There are 71,934 times when a combination of these fields is not unique, and the total number of words where this happens is 154,335.  We have 746,658 non-OE lexemes, so this is about 21% of all the HT’s non-OE words.  Again, most of these appear in two categories, but not all of them.  See for example:

529752  138922  abridge of/from/in          1303       1839

532961  139700  abridge of/from/in          1303       1839

613480  164006  abridge of/from/in          1303       1839

328700  91512    abridged              1370       0

401949  111637  abridged              1370       0

779122  220350  abridged              1370       0

542289  142249  abridgedly           1801       0

774041  218654  abridgedly           1801       0

779129  220352  abridgedly           1801       0

I also created a script that attempted to identify whether the OED categories that had been deleted in their new version of the data, but we had connected up to one of the HT’s categories, had possibly been moved elsewhere rather than being deleted outright.  There were 42 such categories and I created two checks to try and find whether the categories had just been moved.  The first looks for a category in the new data that has the same path, sub and pos while the second looks for a category with the same heading and pos and the highest number of words (looking at the stripped form) that are identical to the deleted category.  Unfortunately neither approach has been very successful.  Check number 1 has identified a few categories, but all are clearly wrong.  It looks very much like where a category has been deleted things lower down the hierarchy have been shifted up.  Check number 2 has identified two possible matches but nothing more.  And unfortunately both of these OED categories are already matched to HT categories and are present in the new OED data too, so perhaps these are simply duplicate categories that have been removed from the new data.

I then began to use the new OED category table rather than the old one.  As expected, when using the new data the number of unmatched not empty OED categories with POS has increased, from 995 to 1952.  In order to check thow the new OED category data compares to the old data I wrote a script that brings back 100 random matched categories and their words for spot checking.  This displays the category and word details for the new OED data, the old OED data and the HT data.I’ve looked through a few output screens and haven’t spotted any issues with the matching yet.  However, it’s interesting to note how the path field in the new OED data differs from the old, and from the HT.  In many cases the new path is completely different to the old one.  In the HT data we use the ‘oedmaincat’ field, which (generally) matches the path in the old data. I added in a new field ‘HT current Tnum’ that displays the current HT catnum and sub, just to see if this matches up with the new OED path.  It is generally pretty similar but frequently slightly different. Here are some examples:

OED catid 47373 (HT 42865) ‘Turtle-soup’ is ‘01.03.09.01.15.05|03 (n)’ in the old data and in the HT’s ‘oedmaincat’ field.  In the new OED data it’s ‘01.08.01.15.05|03 (n)’ while the HT’s current catnum is ‘01.07.01.15.05|03’.

OED catid 98467 (HT 91922) ‘place off centre’ is ‘01.06.07.03.02|04.01 (vt)’ in the old data and oedmaincat.  In the new OED data its ‘01.13.03.02|04.01 (vt)’ and HT catnum is ‘01.12.03.02|04.01’.

OED catid 202508 (HT 192468) ‘Miniature vehicle for use in experiments’ is ‘03.09.03.02.02.12|13 (n)’ in the old data and oedmaincat.  In the new data it’s ‘03.11.03.02.02.12|13 (n)’ and the HT catnum (as you probably guessed) is ‘03.10.03.02.02.12|13’.

As we’re linking categories on the catid it doesn’t really have any bearing on the matching process, but it’s possibly less than ideal that we have three different hierarchical structures on the go.

For the DSL I spent some time this week analysing the DSL API in order to try and figure out why the XML outputted by the API is different to the XML stored in the underlying database that the API apparently uses.  I wasn’t sure whether there was another database on the server that I was unaware of, or whether Peter’s API code was dynamically changing the XML each time it was requested.  It turns out it’s the latter.  As far as I can tell, every time a request for an entry is sent to the API, it grabs the XML in the database, plus some other information stored in other tables relating to citations and bibliographical entries, and then it dynamically updates sections of the XML (e.g. <cit>) to replace sections, adding in IDs, quotes and other such things.  It’s a bit of an odd system, but presumably there was a reason why Peter set it up like this.

Anyway, after figuring out that the API is behaving this way I could then work out a method to grab all of the fully formed XML that the API generates.  Basically I’ve written a little script that requests the full details for every word in the dictionary and then saves this information in a new version of the database.  It took several hours for the script to complete, but it has now done so.  I would appear to have the fully formed XML details for 89,574, and with access to this data I should be able to start working on a new version of the API using this data, that will hopefully give us something identical in functionality and content to the old API.

Also this week I moved offices, which took most of Friday morning to sort out.  I also helped Bryony Randall to get some stats for the New Modernist Editing websites, created a further ‘song story’ for the RNSN project and updated all of the WordPress sites I manage to the latest version of WordPress.

Week Beginning 4th March 2019

I spent about half of this week working on the SCOSYA project.  On Monday I met with Jennifer and E to discuss a new aspect of the project that will be aimed primarily at school children.  I can’t say much about it yet as we’re still just getting some ideas together, but it will allow users to submit their own questionnaire responses and see the results.  I also started working with the location data that the project’s researchers had completed mapping out.  As mentioned in previous posts, I had initially created Voronoi diagrams that extrapolate our point-based questionnaire data to geographic areas.  The problem with this approach was that the areas were generated purely on the basis of the position of the points and did not take into consideration things like the varying coastline of Scotland or the fact that a location on one side of a body of water (e.g. the Firth of Forth) should not really extend into the other side, giving the impression that a feature is exhibited in places it quite clearly doesn’t.  Having the areas extend over water also made it difficult to see the outline of Scotland and to get an impression of which cell corresponded to which area.  So instead of this purely computational approach to generating geographical areas we decided to create them manually, using the Voronoi areas as a starting point, but tweaking them to take geographical features into consideration.   I’d generated the Voronoi cells as GeoJSON files and the researchers then used this very useful online tool https://geoman.io/studio to import the shapes and tweak them, saving them in multiple files as their large size caused some issues with browsers.

Upon receiving these files I then had to extract the data for each individual shape and work out which of our questionnaire locations the shape corresponded to, before adding the data to the database.  Although GeoJSON allows you to incorporate any data you like, in addition to the latitude / longitude pairings, I was not able to incorporate location names and IDs into the GeoJSON file I generated using the Voronoi library (it just didn’t work – see an earlier post for more information), meaning this ‘which shape corresponds to which location’ process needed to be done manually.  This involved grabbing the data for an individual location from the GeoJSON files, saving this and importing it into the GeoMan website, comparing the shape to my initial Voronoi map to find the questionnaire location contained within the area, adding this information to the GeoJOSN and then uploading it to the database.  There were 147 areas to do, and the process took slightly over a day to complete.

With all of the area data associated with questionnaire locations in the database I could then begin to work on an updated ‘storymap’ interface that would use this data.  I’m basing this new interface on Leaflet’s choropleth example: https://leafletjs.com/examples/choropleth/ which is a really nice interface and is very similar to what we require.  My initial task was to try and get the data out of the database and formatted in such a way that it could appear on the map.  This involved updating the SCOSYA API to incorporate the GeoJSON output for each location, which turned out to be slightly tricky, as my API automatically converts the data exported from the database (e.g. arrays and such things) into JSON using PHP’s json_encode function.  However, applying this to data that is already encoded as JSON (i.e. the new GeoJSON data) results in that data being treated as a string rather than as a JSON object, so the output was garbled.  Instead I had to ensure that the json_encode function was applied to every bit of data except the GeoJSON data, and once I’d done this the API outputted the GeoJSON data in such a way as to ensure any JavaScript could work with it.

I then produced a ‘proof of concept’ that simply grabbed the location data, pulled all the GeoJSON for each location together and processed it via Leaflet to produce area overlays, as you can see in the following screenshot:

With this in place I then began looking at how to incorporate our intended ‘story’ interface with the Choropleth map – namely working with a number of ‘slides’ that a user can navigate between, with a different dataset potentially being loaded and displayed on each slide, and different position and zoom levels being set on each slide.  This is actually proving to be quite a complicated task, as much of the code I’d written for my previous Voronoi version of the storymap was using older, obsolete libraries.  Thankfully with the new approach I’m able to use the latest version of Leaflet, meaning features like the ‘full screen’ option and smoother panning and zooming will work.

By the end of the week I’d managed to get the interface to load in data for each slide and colour code the areas.  I’d also managed to get the slide contents to display – both a ‘big’ version that contains things like video clips and a ‘compact’ version that sits to one side, as you can see in the following screenshot:

There is still a lot to do, though.  One area is missing its data, which I need to fix.  Also the ‘click on an area’ functionality is not yet working.  Locations as map points still need to be added in too, and the formatting of the areas still needs some work.  Also, the pan and zoom functionality isn’t there yet either.  However, I hope to get all of this working next week.

Also this week I had had a chat with Gavin Miller about the website for his new Medical Humanities project.  We have been granted the top-level ‘.ac.uk’ domain we’d requested so we can now make a start on the website itself.  I also made some further tweaks to the RNSN data, based on feedback.  I also spent about a day this week working on the REELS project, creating a script that would output all of the data in the format that is required for printing.  The tool allows you to select one or more parishes, or to leave the selection blank to export data for all parishes.  It then formats this in the same way as the printed place-name surveys, such as the Place-Names of Fife.  The resulting output can then be pasted into Word and all formatting will be retained, which will allow the team to finalise the material for publication.

I spent the rest of the week working on Historical Thesaurus tasks.  I met with Marc and Fraser on Friday, and ahead of this meeting I spent some time starting to look at matching up lexemes in the HT and OED datasets.  This involved adding seven new fields to the HT’s lexeme database to track the connection (which needs up to four fields) and to note the status of the connection (e.g. whether it was a manual or automatic match, which particular process was applied).  I then ran a script that matched up all lexemes that are found in matched categories where every HT lexeme matches an OED lexeme (based on the ‘stripped’ word field plus first dates).

Whilst doing this I’m afraid I realised I got some stats wrong previously.  When I calculated the percentage of total matched lexemes in matched categories and it gave figures of about 89% matched lexemes this was actually the number of matched lexemes across all categories (whether they were fully matched or not).  The number of matched lexemes in fully matched categories is unfortunately a lot lower.  For ‘01’ there are 173,677 matched lexemes, for ‘02’ there are 45,943 matched lexemes and for ‘03’ there are 110,087 matched lexemes.  This gives a total of 329,707 matched lexemes in categories where every HT word matches an OED word (including categories where there are additional OED words) out of 731307 non-OE words in the HT, which is about 45% matched.  I ticked these off in the database with check code 1 but these will need further checking, as there are some duplicate matches (where the HT lexeme has been joined to more than one OED lexeme).  Where this happens the last occurrence currently overwrites any earlier occurrence.  Some duplicates are caused by a word’s resulting ‘stripped’ form being the same – e.g. ‘chine’ and ‘to-chine’.

When we met on Friday we figured out another big long list of updates and new experiments that I would carry out over the next few weeks, but Marc spotted a bit of a flaw in the way we are linking up HT and OED lexemes.  In order to ensure the correct OED lexeme is uniquely identified we rely on the OED’s category ID field.  However, this is likely to be volatile:  during future revisions some words will be moved between categories.  Therefore we can’t rely on the category ID field as a means of uniquely identifying an OED lexeme.  This will be a major problem when dealing with future updates from the OED ad we will need to try and find a solution – for example updating the OED data structure so that the current category ID is retained in a static field.  This will need further investigation next week.

 

Week Beginning 25th February 2019

I met with Marc and Fraser on Monday to discuss the current situation with regards to the HT / OED linking task.  As I mentioned last week, we had run into an issue with linking HT and OED lexemes up as there didn’t appear to be any means of uniquely identifying specific OED lexemes as on investigation the likely candidates (a combination of category ID, refentry and refid) could be applied to multiple lexemes, each with different forms and dates.  James McCracken at the OED had helpfully found a way to include a further ID field (lemmaid) that should have differentiated these duplicates, and for the most part it did, but there were still more than a thousand rows where the combination of the four columns was not unique.

At our meeting we decided that this number of duplicates was pretty small (we are after all dealing with more than 700,000 lexemes) and we’d just continue with our matching processes and ignore these duplicates until they can be sorted.  Unexpectedly, James got back to me soon after the meeting and had managed to fix the issue.  He sent me an updated dataset that after processing resulted in there being only 28 duplicate rows, which is going to be a great help.

As a result of our meeting I made a number of further changes to scripts I’d previously created, including fixing the layout of the gap matching script, to make it easier for Fraser to manually check the rows, and I also updated the ‘duplicate lexemes in categories’ script (these are different sorts of duplicates – word forms that appear more than once in a category, but with their own unique identifiers) so that HT words where the ‘wordoed’ field is the same but the ‘word’ field is different are not considered duplicates.  This should filter out words of OE origin that shouldn’t be considered duplicates.  So for example, ‘unsweet’ with ‘unsweet’ and ‘unsweet’ with ‘unsweet < unswete’ no longer appear as duplicates.  This has reduced the number of rows listed from 567 to 456.  Not as big a drop as I’d expected, but a bit less.

At the meeting I’d also pointed out that the new data from the OED has deleted some categories that were present in the version of the OED data we’d been working with up to this point.  There are 256 OED categories that have been deleted, and these contain 751 words.  I wanted to check what was going on with these categories so wrote a little script that lists the deleted categories and their words.   I added a check to see which of these are ‘quarantined’ categories (categories that were duplicated in the existing data that we had previously marked as ‘quarantined’ to keep them separate from other categories) and I’m very glad to say that 202 such categories have been deleted (out of a total of 207 quarantined categories – we’ll need to see what’s going on with the remainder).  I also added in a check to see whether any of the deleted OED categories are matched up to HT categories.  There are 42 such categories, unfortunately, which appear in red.  We’ll need to decide what to do about these, ideally before I switch to using the new OED data, otherwise we’re left with OED catids in the HT’s category table that point to nothing.

In addition to the HT / OED task, I spent about half the week working on DSL related issues too, including a trip to the DSL offices in Edinburgh on Wednesday.  The team have been making updates to the data on a locally hosted server for many years now, and none of these updates have yet made their way into the live site.  I’m helping them to figure out how to get the data out of the systems they have been using and into the ‘live’ system.  This is a fairly complicated task as the data is stored in two separate systems, which need to be amalgamated.  Also, the ‘live’ data stored at Glasgow is made available via an API that I didn’t develop, for which there is very little documentation, and which appears to dynamically make changes to the data extracted from the underlying database and refactor it each time a request is made.  As this API uses technologies that Arts IT Support are not especially happy to host on their servers (Django / Python and Solr) I am going to develop a new API using technologies that Arts IT Support are happy to deal with (PHP), and eventually replace the old API, and also the old data with the new, merged data that the DSL people have been working on.  It’s going to be a pretty big task, but really needs to be tackled.

Last week Ann Ferguson from the DSL had sent me a list of changes she wanted me to make to the ‘Wordpressified’ version of the DSL website.  These ranged from minor tweaks to text, to reworking the footer, to providing additional options for the ‘quick search’ on the homepage to allow a user to select whether their search looks in SND, DOST or both source dictionaries.  It took quite some time to go through this document, and I’ve still not entirely finished everything, but the bulk of it is now addressed.

Also this week I responded to some requests from the SCOSYA team, including making changes to the website theme’s menu structure and investigating the ‘save screenshot’ of the atlas.  Unfortunately I wasn’t very successful with either request.  The WordPress theme the website currently uses only supports two levels of menu and a third level had been requested (i.e. a drop-down menu, and then a slide-out menu from the drop-down).  I thought I could possibly update the theme to include this with a few tweaks to the CSS and JavaScript, but after some investigation it looks like it would take a lot of work to implement, and it’s really not work doing so when plenty of other themes provide this functionality by default.  I had suggested we switch to a different theme, but instead the menu contents are just going to be rearranged.

The request for updating the ‘save screenshot’ feature refers to the option to save an image of the atlas, complete with all icons and the legend, at a resolution that is much greater than the user’s monitor in order to use the image in print publications.  Unfortunately getting the map position correct when using this feature is very difficult – small changes to position can result in massively different images.

I took another look at the screengrab plugin I’m using to see if there’s any way to make it work better.  The plugin is leaflet.easyPrint (https://github.com/rowanwins/leaflet-easyPrint).  I was hoping that perhaps there had been a new version released since I installed it, but unfortunately there hasn’t.  The standard print sizes all seem to work fine (i.e. positioning the resulting image in the right place).  The A3 size is something I added in, following the directions under ‘Custom print sizes’ on the page above.  This is the only documentation there is, and by following it I got the feature working as it currently does.  I’ve tried searching online for issues relating to the custom print size, but I haven’t found anything relating to map position.  I’m afraid I can’t really attempt to update the plugin’s code as I don’t know enough about how it works and the code is pretty incomprehensible (see it here: https://raw.githubusercontent.com/rowanwins/leaflet-easyPrint/gh-pages/dist/bundle.js).

I’d previously tried several other ‘save map as image’ plugins but without success, mainly because they are unable to incorporate HTML map elements (which we use for icons and the legend).  For example, the plugin https://github.com/mapbox/leaflet-image which rather bluntly says “This library does not rasterize HTML because browsers cannot rasterize HTML. Therefore, L.divIcon and other HTML-based features of a map, like zoom controls or legends, are not included in the output, because they are HTML.”

I think that with the custom print size in the plugin we’re using we’re really pushing the boundaries of what it’s possible to do with interactive maps.  They’re not designed to be displayed bigger than a screen and they’re not really supposed to be converted to static images either.  I’m afraid the options available are probably as good as it’s going to get.

Also this week I made some further changes to the RNSN timelines, had a chat with Simon Taylor about exporting the REELS data for print publication, undertook some App store admin duties and had a chat with Helen Kingstone about a research database she’s hoping to put together.