Week Beginning 23rd September 2019

On Monday this week we had another Arts Developer coffee meeting, which as always was a good opportunity to catch up with my fellow developers in the College of Arts and talk about our work.  On Tuesday I attended a team meeting for the SCOSYA project, where we discussed some of the final things that needed done before the online resource would be ready for the user testing sessions that will take place in the next few weeks.  I spend quite a bit of time implementing these final tweaks during the week.  This included adding in the full map attribution and copyright information in a pop-up that’s linked to from the bottom of the atlas.  I also added it in to the API as well.  After this I changed a number of colours that were used for markers and menu items on both the public and experts atlases and added in some links to help pages and some actual text to the atlas menus to replace the placeholder text.

I also realised that highlighting wasn’t working on the experts ‘home’ map, which was probably a bit confusing.  Implementing this turned out to be rather tricky as highlighting depended on grabbing the location name from the pop-up and then comparing this with the location names in a group.  The ‘Home’ map has no pop-ups so highlighting wouldn’t work.  Instead I had to change things so that the location is grabbed from the tooltip text.  Also, the markers on the ‘Home’ map were actually different types of markers (HTML elements styled by CSS as opposed to SVG shapes) so even though they look the same the highlighting code wasn’t working for them.  I’ve now switched them to SVG shape and highlighting seems to be working now.  It’s even possible to create a group on the ‘Home’ page too.

I also added in a new ‘cite’ menu item to the experts atlas, the allows users to grab a link to their specific map view, formatted in a variety of citation styles.  This updates everytime the ‘cite’ menu is opened, so if the user has changed the zoom level or map centre the citation link always reflects this.  Finally, I created new versions of the atlases (now called ‘atlas’ and ‘linguists atlas’) that will be used for beta testing.

I also spent some time working for the DSL, fixing the ‘sienna’ test version of the website and changing how the quick search works on both test versions of the website.  If the user selects an item from the autocomplete list, the search then performs an exact search for this work, whereas previously it was just matching the characters anywhere in the headword, which didn’t really make much sense.  I also spent quite a bit of time looking through the old DSL editor server to try and track down some files for Rhona.

Also this week I had a chat with Gavin Miller about publicising his new Glasgow Medical Humanities site, set up a researcher in Psychology with an account to create an iOS app, fixed a couple of broken links on the Seeing Speech website and had a lengthy email chat with Heather Pagan about the Anglo-Norman Dictionary data.  We have now managed to access the server and begin to analyse the contents to try and track down the data, and by the end of the week it looked like we might actually have found the full dataset, which is encouraging.  I finished off the week by creating a final ‘Song Story’ for the RNSN project, which took a few hours to implement but is looking pretty good.

I’m going to be out of the office for the next three weeks on a holiday in Australia so there will be no further updates from me for a while.

Week Beginning 26th August 2019

I focussed on the SCOSYA project for the first few days of this week.  I need to get everything ready to launch by the end of September and there is an awful lot still left to do, so this is really my priority at the moment.  I’d noticed over the weekend that the story pane wasn’t scrolling properly on my iPad when the length of the slide was longer than the height of the atlas.  In such cases the content was just getting cut off and you couldn’t scroll down to view the rest or press the navigation buttons.  This was weird as I thought I’d fixed this issue before.  I spent quite a bit of time on Monday investigating the issue, which has resulted in me having to rewrite a lot of the slide code.  After much investigation I reckoned that this was an intermittent fault caused by the code returning a negative value for the height of the story pane instead of its real height.  When the user presses the button to load a new slide the code pulls the HTML content of the slide in and immediately displays it.  After that another part of the code then checks the height of the slide to see if the new contents make the area taller than the atlas, and if so the story area is then resized.  The loading of the HTML using jQuery’s html() function should be ‘synchronous’ – i.e. the following parts of code should not execute before the loading of the HTML is completed.  But sometimes this wasn’t the case – the new slide contents weren’t being displayed before the check for the new slide height was being run, meaning the slide height check was giving a negative value (no contents minus the padding round the slide).  The slide contents then displayed but as the code thought the slide height was less than the atlas it was not resizing the slide, even when it needed to.  It is a bit of a weird situation as according to the documentation it shouldn’t ever happen.  I’ve had to put a short ‘timeout’ into the script as a work-around – after the slide loads the code waits for half a second before checking for the slide height and resizing, if necessary.  This seems to be working but it’s still annoying to have to do this.  I tested this out on my Android phone and on my desktop Windows PC with the browser set to a narrow height and all seemed to be working.  However, when I got home I tested the updated site out on my iPad and it still wasn’t working, which was infuriating as it was working perfectly on other touchscreens.

In order to fix the issue I needed to entirely change how the story pane works.  Previously the story pane was just an HTML area that I’d added to the page and then styled to position within the map, but there were clearly some conflicts with the mapping library Leaflet when using this approach.  The story pane was positioned within the map area and mouse actions that Leaflet picks up (scrolling and clicking for zoom and pan) were interfering with regular mouse actions in the HTML story area (clicking on links, scrolling HTML areas).  I realised that scrolling within the menu on the left of the map was working fine on the iPad so I investigated how this differed from the story pane on the right.  It turned out that the menu wasn’t just a plain HTML area but was instead created by a plugin for Leaflet that extends Leaflet’s ‘Control’ options (used for buttons like ‘+/-‘ and the legend).  Leaflet automatically prevents the map’s mouse actions from working within its control areas, which is why scrolling in the left-hand menu worked.  I therefore created my own Leaflet plugin for the story pane, based on the menu plugin.  Using this method to create the story area thankfully worked on my iPad, but it did unfortunately taken several hours to get things working, which was time I should ideally have been spending on the Experts interface.  It needed to be done, though, as we could hardly launch an interface that didn’t work on iPads.

I also has to spend some further time this week making some more tweaks to the story interface that the team had suggested such as changing the marker colour for the ‘Home’ maps, updating some of the explanatory text and changing the pop-up text on the ‘Home’ map to add in buttons linking through to the stories.  The team also wanted to be able to have blank maps in the stories, to make users focus on the text in the story pane rather than getting confused by all of the markers.  Having blank maps for a story slide wasn’t something the script was set up to expect, and although it was sort of working, if you navigated from a map with markers to a blank map and then back again the script would break, so I spent some time fixing this.  I also managed to find a bit of time starting on the experts interface, although less time than I had hoped.  For this I’ve needed to take elements from the atlas I’d created for staff use, but adapt it to incorporate changes that I’d introduced for the public atlas.  This has basically meant starting from scratch and introducing new features one by one.  So far I have the basic ‘Home’ map showing locations and the menu working.  There is still a lot left to do.

I spent the best part of two days this week working on the front-end for the 18th Century Borrowing pilot project for Matthew Sangster.  I wrote a little document that detailed all of the features I was intending to develop and sent this to Matt so he could check to see if what I’m doing met his expectations.  I spent the rest of the time working on the interface, and made some pretty good progress.  So far I’ve made an initial interface for the website (which is just temporary and any aspect of which can be changed as required), I’ve written scripts to generate the student forename / surname and professor title / surname columns to enable searching by surname, and I’ve created thumbnails of the images.  The latter was a bit of a nightmare as previously I’d batch rotated the images 90 degrees clockwise as the manuscripts (as far as I could tell) were written in landscape format but the digitised images were portrait, meaning everything was on its side.

However, I did this using the Windows image viewer, which gives the option of applying the rotation to all images in a folder.  What I didn’t realise is that the image viewer doesn’t update the metadata embedded in the images, and this information is used by browsers to decide which way round to display the images.  I ended up in a rather strange situation where the images looked perfect on my Windows PC, and also when opened directly within the browser, but when embedded in an HTML page they appeared on their side.  It took a while to figure out why this was happening, but once I did I regenerated the thumbnails using the command-line ImageMagick tool instead, which I set to wipe the image metadata as well as rotating the images, which seemed to work.  That is until I realised that Manuscript 6 was written in portrait not landscape so I had to repeat the process again but miss out Manuscript 6.  I have since realised that all the batch processing of images I did to generate tiles for the zooming and panning interface is also now going to be wrong for all landscape images and I’m going to have to redo all of this too.

Anyway, I also made the facility where a user can browse the pages of the manuscripts, enabling them to select a register, view the thumbnails of each page contained therein and then click through to view all of the records on the page.  This ‘view records’ page has both a text and image view.  The former displays all of the information about each record on the page in a tabular manner, including links through to the GUL catalogue and the ESTC.  The latter presents the image in a zoomable / pannable manner, but as mentioned earlier, the bloody image is on its side for any manuscript written in a landscape way and I still need to fix this, as the following screenshot demonstrates:

Also this week I spent a further bit of time preparing for my PDR session that I will be having next week, spoke to Wendy Anderson about updates to the SCOTS Corpus advanced search map that I need to fix, fixed an issue with the Medical Humanities Network website, made some further tweaks to the RNSN song stories and spoke to Ann Ferguson at the DSL about the bibliographical data that needs to be incorporated into the new APIs.  A another pretty busy week, all things considered.


Week Beginning 19th August 2019

After meeting with Fraser to discuss his Scots Thesaurus project last Friday I spent some time on Monday this week writing a script that returns some random SND or DOST entries that met certain criteria, so as to allow him to figure out how these might be placed into HT categories.  The script brings back main entries (as opposed to supplements) that are nouns, are monosemous (i.e. no other noun entries with the same headword), have only one sense (i.e. not multiple meanings within the entry), have fewer than 5 variant spellings, have single word headwords and have definitions that are relatively short (100 characters or less).  Whilst writing the script I realised that database queries are somewhat limited on the server and if I try to extract the full SND or DOST dataset to then select rows that meet the criteria in my script these limits are reached and the script just displays a blank page.  So what I had to do is to set the script up to bring back a random sample of 5000 main entry nouns that don’t have multiple words in their headword in the selected dictionary.  I then have to apply the other checks on this set of 5000 random entries.  This can mean that the number of outputted entries ends up being less than the 200 that Fraser was hoping for, but still provides a good selection of data.  The output is currently an HTML table, with IDs linking through to the DSL website and I’ve given the option of setting the desired number of returned rows (up to 1000) and the number of characters that should be considered a ‘short’ definition (up to 5000).  Fraser seemed pretty happy with how the script is working.

Also this week I made some further updates to the new song story for RNSN and I spent a large amount of time on Friday preparing for my upcoming PDR session.  On Tuesday I met with Luca to have a bit of a catch-up, which was great.  I also fixed a few issues with the Thesaurus of Old English data for Jane Roberts and responded to a request for developer effort from a member of staff who is not in the College of Arts.  I also returned to working on the Books and Borrowing pilot system for Matthew Sangster, going through the data I’d uploaded in June, exporting rows with errors and sending these to Matthew for further checking.  Although there are still quite a lot of issues with the data, in terms of its structure things are pretty fixed, so I’m going to begin work on the front-end for the data next week, the plan being that I will work with the sample data as it currently stands and then replace it with a cleaner version once Matthew has finished working with it.

I divided the rest of my time this week between DSL and SCOSYA.  For the DSL I integrated the new APIs that I was working on last week with the ‘advanced search’ facilities on both the ‘new’ (v2 data) and ‘sienna’ (v3 data) test sites.  As previously discussed, the ‘headword match type’ from the live site has been removed in favour of just using wildcard characters (*?”).  Full-text searches, quotation searches and snippets should all be working, in addition to headword searches.  I’ve increased the maximum number of full-text / quotation results from 400 to 500 and I’ve updated the warning messages so they tell you how many results your query would have returned if the total number is greater than this.  I’ve tested both new versions out quite a bit and things are looking good to me, and I’ve contacted Ann and Rhona to let them know about my progress.  I think that’s all the DSL work I can do for now, until the bibliography data is made available.

For SCOSYA I engaged in an email conversation with Jennifer and others about how to cover the costs of MapBox in the event of users getting through the free provision of 200,000 map loads a month after the site launches next month.  I also continued to work on the public atlas interface based on discussions we had at a team meeting last Wednesday.  The main thing was replacing the ‘Home’ map, which previously just displayed the questionnaire locations, with a new map that highlights certain locations that have sound clips that demonstrate an interesting feature.  The plan is that this will then lead users on to finding out more about these features in the stories, whilst also showing people where some of the locations to project visited are.  This meant creating facilities in the CMS to manage this data, updating the database, updating the API and updating the front-end, so a fairly major thing.

I updated the CMS to include a page to manage the markers that appear on the new ‘Home’ map.  Once logged into the CMS click on the ‘Browse Home Map Clips’ menu item to load the page.  From here staff can see all of the locations and add / edit the information for a location (adding an MP3 file and the text for the popup).  I added the data for a couple of sample locations that E had sent me.  I then added a new endpoint to the API that brings back the information about the Home clips and updated the public atlas to replace the old ‘Home’ map with the new one.  Markers are still the bright blue colour and drop into the map.  I haven’t included the markers for locations that don’t have clips.  We did talk at the meeting about including these, but I think they might just clutter the map up and confuse people.

Getting the links to the stories to work turned out to be unexpectedly tricky, as the only information that changes in the link is after the hash sign, and browsers treat such links as referring to a different point on the same page rather than doing a full page reload.  So I’ve had to handle all of the loading of the story and updating the menu in JavaScript rather than it all reloading and just working, as would happen on a full page reload.  Below is a screenshot of how the ‘Home’ map currently looks, with a pop-up open:

I also reordered and relabelled the menu, and have changed things so that you can now click on an open section to close it.  Currently doing so still triggers the map reload for certain menu items (e.g. Home).  I’ll try to stop it doing so, but I haven’t managed to yet.

I also implemented the ‘Full screen’ slide type, although I think we might need to change the style of this.  Currently it takes up about 80% of the map width, pinned to the right hand edge (which it needs to be for the animated transitions between slides to work).  It’s only as tall as the content of the slide needs it to be, though, so the map is not really being obscured, which is what Jennifer was wanting.  Although I could set it so that the slide is taller, this would then shift the navigation buttons down to the bottom of the map and if people haven’t scrolled the map fully into view they might not notice the buttons.  I’m not sure what the best approach here might be, and this needs further discussion.

I also changed the way location data is returned from the API this week, to ensure that the GeoJSON area data is only returned from the API when it is specifically asked for, rather than by default.  This means such data is only requested and used in the front-end when a user selects the ‘area’ map in the ‘Explore’ menu.  The reason for doing this is to make things load quicker and to reduce the amount of data that was being downloaded unnecessarily.  The GeoJSON data was rather large (several megabytes) and requesting this each time a map loaded meant the maps took some time to load on slower connections.  With the areas removed the stories and ‘explore’ maps that are point based are much quicker to load.  I did have to update a lot of code so that things still work without the area data being present, and I also needed to update all API URLs contained in the stories to specifically exclude GeoJSON data, but I think it’s been worth spending the time doing this.

Week Beginning 12th August 2019

I’d taken Tuesday off this week to cover the last day of the school holidays so it was a four-day week for me.  It was a pretty busy four days, though, involving many projects.  I had some app related duties to attend to, including setting up a Google Play developer account for people in Sports and Recreation and meeting with Adam Majumdar from Research and Innovation about plans for commercialising apps in future.  I also did some further investigation into locating the Anglo-Norman Dictionary data, created a new song story for RNSN and read over Thomas Clancy’s Iona proposal materials one last time before the documents are submitted.  I also met with Fraser Dallachy to discuss his Scots Thesaurus plans and will spend a bit of time next week preparing some data for him.

Other than these tasks I split my remaining time between SCOSYA and DSL.  For SCOSYA we had a team meeting on Wednesday to discuss the public atlas.  There is only about a month left to complete all development work on the project and I was hoping that the public atlas that I’d been working on recently was more or less complete, which would then enable me to move on to the other tasks that still need to be completed, such as the experts interface and the facilities to manage access to the full dataset.  However, the team have once again changed their minds about how they want the public atlas to function and I’m therefore going to have to devote more time to this task than I had anticipated, which is rather frustrating at this late stage.  I made a start on some of the updates towards the end of the week, but there is still a lot to be done.

For DSL we finally managed to sort out the @dsl.ac.uk email addresses, meaning the DSL people can now use their email accounts again.  I also investigated and fixed an issue with the ‘v3’ version of the API which Ann Ferguson had spotted.  This version was not working with exact searches, which use speech marks.  After some investigation I discovered that the problem was being caused by the ‘v3’ API code missing a line that was present in the ‘v2’ API code.  The server automatically escapes quotes in URLs by adding a preceding slash (\).  The ‘v2’ code was stripping this slash before processing the query, meaning it correctly identified exact searches.  As the ‘v3’ code didn’t get rid of the slashes it wasn’t finding the quotation mark and was not treating it as an exact search.

I also investigated why some DSL entries were missing from the output of my script that prepared data for Solr.  I’d previously run the script on my laptop, but running it on my desktop instead seemed to output the full dataset including the rows I’d identified as being missing from the previous execution of the script.  Once I’d outputted the new dataset I sent it on to Raymond for import into Solr and then I set about integrating full-text searching into both ‘v2’ and ‘v3’ versions of the API.  This involved learning how Solr uses wildcard characters and Boolean searches, running some sample queries via the Solr interface and then updating my API scripts to connect to the Solr interface, format queries in a way that Solr could work with, submit the query and then deal with the results that Solr outputs, integrating these with fields taken from the database as required.

Other than the bibliography side of things I think that’s the work on the API more or less complete now (I still need to reorder the ‘browse’ output).  What I haven’t done yet is to work on the advanced search pages of the ‘new’ and ‘sienna’ versions of the website to actually work with the new APIs, so as of yet you can’t perform any free-text searches through these interfaces but only directly through the APIs.  Working to connect the front-ends fully to the APIs is my next task, which I will try to start on next week.

Week Beginning 10th June 2019

On Monday this week I attended the ‘Data Hack’ event organised by the SCOSYA project.  This was a two-day event, with day one being primarily lectures while on the second day the participants could get their hands on some data and create things themselves.  I only attended the first day and enjoyed hearing the speakers talk.  It was especially useful to hear the geospatial visualisation speaker, and also to get a little bit of hands-on experience with R.  Unfortunately during a brief and unscheduled demonstration of the SCOSYA ‘expert atlas’ interface the search for multiple attributes failed to work.  I spent some time frantically trying to figure out why, as I hadn’t changed any of the code.  It turned out that (unbeknownst to me) the version of PHP on the server had recently been updated and one tiny and seemingly insignificant bit of code was no longer supported in the newer version and instead caused a fatal error.  is it’s no longer possible to set a variable as an empty string and then to use it as an array later on.  For example, $varname = “”; and then later on $varname[] = “value”.  Doing this in the newer version causes the script to stop running.  Once I figured this out it was very easy to fix, but going through the code to identify what was causing the problem took quite a while.

Once I’d discovered the issue I checked with Arts IT support and they confirmed that they had upgraded the server.  It would have been great if they’d let me know.  I then had to go through all of the other sites that are hosted on the server to check if the error appeared anywhere else, which unfortunately it did.  I think I’d managed to fix everything by the end of the week, though.

Also for SCOSYA I continued to work on the public atlas interface, this time focussing on the ‘stories’ (now called ‘Learn more’).  Previously these appeared in a select box, and once you selected a story from the drop-down list and pressed a ‘show’ button the story would load.  This was all a bit clunky, so I’ve now replaced it with a more nicely formatted list of stories, as with the ‘examples’.  Clicking on one of these automatically loads the relevant story, as the screenshot below demonstrates:

I had also been alerted to an issue with the stories, whereby moving from one story to another and then navigating through the pages was resulting in the page breaking.  The cause of this proved to be rather tricky to investigate, but I eventually tracked the issue down.  The click event function for ‘next slide’ is created when the JSON data is loaded from the server, but the click event remains when new data is loaded, which results in a subsequent click event being created.  So when the ‘next slide’ button is pressed both these events fire.  I managed to stop an error from occurring and stopping the JavaScript proceeding, but doing so resulted in multiple sets of map data being loaded in on top of one another each time the ‘next slide’ button is pressed, as multiple versions of the click event fire even though the button is only pressed once.

Due to the asynchronous nature of AJAX calls, it’s not possible to just set up the click event once as it needs to be set up as a result of the data finishing loading.  If it was set up independently the next slide would load before data was pulled in from the server and would therefore display nothing.  However, after further thought I realised that the issue wasn’t occurring when the initial slide was loaded, only when the user presses the buttons.  As this will only ever happen once the AJAX data has loaded (because otherwise the button isn’t there for the user to press) it should be ok to have the click event initiated outside of the ‘load data’ function.  Thankfully I managed to get this issue sorted by Friday, when the project team was demonstrating the feature at another event.  I also managed to sort the issue with the side panel not scrolling on mobile screens, which was being caused by ‘pointer events’ being set to none on the element that was to be scrolled.  On regular screens this worked fine, as the scrollbar gets added in, but on touchscreens this caused issues.

For the rest of the week I worked on several different projects.  I continued with the endless task of linking up the HT and OED datasets.  This involved ticking off lexemes in matched categories based on comparing the stripped forms of the lexeme on their own.  This resulted in around 32,000 lexeme matches being ticked off.  It also uncovered an instance where an OED category has been linked to two different HT categories, which is clearly an error.  I wrote a script to look into the issue of duplicate matches for both categories and lexemes, which shows 28 (so 14 incorrect) category matches and 136 (so 68 incorrect) lexeme matches.  I also created a new stats page that displays statistics about both matched and unmatched categories and lexemes.  I also ‘deticked’ a few category and lexeme matches that Fraser had sent to me in spreadsheets.

I continued to work with the new DSL data this week too.  This included checking through some of the supplemental entry data from the server that didn’t seem to do exactly what the DSL people were expecting. I also set up a new subdomain where I’m replicating the functionality of the main DSL website, but using the new data exported from the server.  This means it is now possible to compare the live data (using Peter’s original API) with the V2 data (extracted and saved fully assembled from Peter’s API, rather than having bits injected into it every time an entry is requested) and the V3 data (from the DSL people’s editor server), which should hopefully be helpful in checking the new data.  I also continued to work on the new API, for both V2 and V3 versions of the data, getting the search results working with the new API.  Next for me to do is add Boolean searching to the headword search, remove headword match type as discussed and then develop the full text searches (full / without quotes / quotes only).  After that comes the bibliography entries.

Also this week I made another few tweaks to the RNSN song stories, gave access to the web stats for Seeing Speech and Dynamic Dialects to Fraser Rowan, who is going to do some work on them and met with Matthew Sangster to discuss a pilot website I’m going to put together for him about books and borrowing records in the 18th century at Glasgow.  I also attended a meeting on Friday afternoon with the Anglo-Norman dictionary people, who were speaking to various people including Marc and Fraser, about redeveloping their online resource.


Week Beginning 29th April 2019

I worked on several different projects this week.  First of all I completed work on the new Medical Humanities Network website for Gavin Miller.  I spent most of last week working on this but didn’t quite manage to get everything finished off, but I did this week.  This involved completing the front-end pages for browsing through the teaching materials, collections and keywords.  I still need to add in a carousel showing images for the project, and a ‘spotlight on…’ feature, as are found on the homepage of the UoG Medical Humanities site, but I’ll do this later once we are getting ready to actually launch the site.  Gavin was hoping that the project administrator would be able to start work on the content of the website over the summer, so everything is in place and ready for them when they start.

With that out of the way I decided to return to some of the remaining tasks in the Historical Thesaurus / OED data linking.  It had been a while since I last worked on this, but thankfully the list of things to do I’d previously created was easy to follow and I could get back into the work, which is all about comparing dates for lexemes between the two datasets.  We really need to get further information from the OED before we can properly update the dates, but for now I can at least display some rows where the dates should be updated, based on the criteria we agreed on at our last HT meeting.

To begin with I completed a ‘post dating’ script.  This goes through each matched lexeme (split into different outputs for ‘01’, ‘02’ and ‘03’ due to the size of the output) and for each it firstly changes (temporarily) any OED dates that are less than 1100 to 1100 and any OED dates that are greater than 1999 to 2100.  This is so as to match things up with the HT’s newly updated Apps and Appe fields.  The script then compares the HT Appe and OED Enddate fields (the ‘Post’ dates).  It ignores any lexemes where these are the same.  If they’re not the same the script outputs data in colour-coded tables.

In the Green table were lexemes where Appe is greater or equal to 1150, Appe is less than or equal to 1850 and Enddate is greater than Appe and the difference between Appe and Enddate is no more than 100 years OR Appe is greater than 1850 and Enddate is greater than Appe.  The yellow table contains lexemes (other than the above) where Enddate is greater than Appe and the difference between Appe and Enddate is between 101 and 200.  In the orange table there are lexemes where the Enddate is greater than Appe and the difference between Appe and Enddate is between 201 and 250, while the red table contained lexemes where the Enddate is greater than Appe and difference between Appe and Enddate is more than 200.  It’s a lot of data, and fairly evenly spread between tables, but hopefully it will help us to ‘tick off’ dates that should be updated with figures from the OED data.

I then created an ‘ante dating’ script that looks at the ‘before’ dates (based on OED Firstdate (or ‘Sortdate’ as they call it) and HT apps.  This looks at rows where Firstdate is earlier than Apps and splits the data up into colour coded chunks in a similar manner to the above script.  I then created a further script that identifies lexemes where there is a later first date or an earlier end date in the OED data for manual checking, as such dates are likely to need investigation.

Finally, I create a script that brings back a list of all of the unique date forms in the HT.  This goes through each lexeme and replaces individual dates with ‘nnnn’, then strings all of the various (and there are a lot) date fields together to create a date ‘fingerprint’.  Individual date fields are separated with a bar (|) so it’s possible to extract specific parts.  The script also made a count of the number of times each pattern was applied to a lexeme.  So we have things like ‘|||nnnn||||||||||||||_’ which is applied to 341,308 lexemes (this is a first date and still in current use) and ‘|||nnnn|nnnn||-|||nnnn|nnnn||+||nnnn|nnnn||’ which is only used for a single lexeme.  I’m not sure exactly what we’re going to use this information for, but it’s interesting to see the frequency of the patterns.

I spent most of the rest of the week working on the DSL.  This included making some further tweaks to the WordPress version of the front-end, which is getting very close to being ready to launch.  This included updating the way the homepage boxes work to enable staff to more easily control the colours used and updating the wording for search results.  I also investigated an issue in the front end whereby slightly different data was being returned for entries depending on the way in which the data was requested.  Using dictionary ID (e.g. https://dsl.ac.uk/entry/dost44593) brings back some additional reference text that is not returned when using the dictionary and href method (e.g. https://dsl.ac.uk/entry/dost/proces_n).  It looks like the DSL API processes things differently depending on the type of call, which isn’t good.  I also checked the full dataset I’d previously exported from the API for future use and discovered it is the version that doesn’t contain the full reference text, so I will need to regenerate this data next week.

My main DSL task was to work on a new version of the API that just uses PHP and MySQL, rather than technologies that Arts IT Support are not so keen on having on their servers.  As I mentioned, I had previously run a script that got the existing API to spit out its fully generated data for every single dictionary entry and it’s this version of the data that I am currently building the new API around.  My initial aim is to replicate the functionality of the existing API and plug a version of the DSL website into it so we can compare the output and performance of the new API to that of the existing API.  Once I have the updated data I will create a further version of the API that uses this data, but that’s a little way off yet.

So far I have completed the parts of the API for getting data for a single entry and the data required by the ‘browse’ feature.  Information on how to access the data, and some examples that you can follow, and included in the API definition page.  Data is available as JSON (the default as used by the website) and CSV (which can be opened in Excel).  However, while the CSV data can be opened directly in Excel any Unicode characters will be garbled, and long fields (e.g. the XML content of long entries) will likely be longer than the maximum cell size in Excel and will break onto new lines.

I also replicated the WordPress version of the DSL front-end here and set it up to work with my new API.  As of yet the searches don’t work as I haven’t developed the search parts of the API, but it is possible to view individual entries and use the ‘browse’ facility on the entry page.  These features use the new API and the new ‘fully generated’ data.  This will allow staff to compare the display of entries to see if anything looks different.

I still need to work on the search facilities of the API, and this might prove to be tricky.  The existing API uses Apache Solr for fulltext searching, which is a piece of indexing software that is very efficient for large volumes of text.  It also brings back nice snippets showing where results are located within texts.  Arts IT Support don’t really want Solr on their servers as it’s an extra thing for them to maintain.  I am hoping to be able to develop comparable full text searches just using the database, but it’s possible that this approach will not be fast enough, or pinpoint the results as well as Solr does.  I’ll just need to see how I get on in the coming weeks.

I also worked a little bit on the RNSN project this week, adding in some of the concert performances to the existing song stories.  Next week I’m intending to start on the development of the front end for the SCOSYA project, and hopefully find some time to continue with the DSL API development.

Week Beginning 16th April 2019

It was a four-day week due to Good Friday, and I spent the beginning of the week catching up on things relating to last week’s conference – writing up my notes from the sessions and submitting my expenses claims.  Marc also dropped off a bunch of old STELLA materials, such as old CD-ROMS, floppy disks, photos and books, so I spent a bit of time sorting through these.  I also took delivery of a new laptop this week, so spent some time installing things on it and getting it ready for work.

Apart from these tasks I completed a final version of the Data Management Plan for Ophira Gamliel’s project, which has now been submitted to college and I met with Luca Guariento to discuss his new role.  Luca will be working across the College in a similar capacity to my role within Critical Studies, which is great news both for him and the College.  I also made a number of tweaks to one of the song stories for the RNSN project and engaged in an email discussion about REF and digital outputs.

I had two meetings with the SCOSYA project this week to discuss the development of the front end features for the project.  It’s getting to the stage where the public atlas will need to be developed and we met to discuss exactly what features it will need to include.  There will actually be four public interfaces – an ‘experts interface’ which will be very similar to the atlas I developed for the project team, a simplified atlas that will only include a selection of features and search types, the ‘story maps’ about 15-25 particular features, and a ‘listening atlas’ that will present the questionnaire locations and samples of speech at each place.  There’s a lot to develop and Jennifer would like as much as possible to be in place for a series of events the project is running in mid-June, so I’ll need to devote most of May to the project.

I spent about a day this week working for DSL.  Rhona contacted me to say they were employing a designer who needed to know some details about the website (e.g. fonts and colours), so I got that information to her.  The DSL’s Facebook page has also changed so I needed to update that on the website too.  Last week Ann sent me a list of further updates that needed to be made to the WordPress version of the DSL website that we’ve been working on, so I implemented those.  This included sorting out the colours of various boxes, and ensuring that these are relatively easy to update in future, sorting out the contents box that stay fixed on the page as the user scrolls on one of the background essay sections, creating some new versions of images and ensuring the citation pop-up worked on a new essay.

I spent a further half-day or so working on the REELS project, making updates to the ‘export for publication’ feature I’d created a few weeks ago.  This feature grabs all of the information about place-names and outputs it in a format that reflects how it should look on the printed page.  It is then possible to copy this output into Word and retain the formatting.  Carole has been using the feature and had sent me a list of updates.  This included enabling paragraph divisions in the analysis section.  Previously each place-name entry was a single paragraph, therefore paragraph tags in any sections contained within were removed.  I have changed this now so that each place-name uses an HTML <div> element rather than a <p> element, meaning any paragraphs can be represented.  However, this has potentially resulted in there being more vertical space between parts of the information than there were previously.

Carole had also noted that in some places on import into Word spaces were being interpreted as non-breaking spaces, meaning some phrases were moved down to a new line even though some of the words would fit on the line above.  I investigated this, but it was a bit of a weird one.  It would appear that Word is using a ‘non-breaking space’ in some places and not in others.  Such characters (represented in Word if you turn markers on by what looks like a superscript ‘o’ as opposed to a mid-dot) link words together and prevent them being split over multiple lines.  I couldn’t figure out why Word was using them in some places and not in others as they’re not used consistently.  For this reason this is something that will need to be fixed after pasting into Word.  The simplest way is to turn markers on, select one of the non-breaking space characters then choose ‘replace’ from the menu, paste this character into the ‘Find what’ box and then put a regular space in the ‘replace with’ box.  There were a number of other smaller tweaks to make to the script, such as fixing the appearance of tildes for linear features, adjusting the number of spaces between the place-name and other text and changing the way parishes were displayed, which brings me to the end of this week’s report.  It will be another four-day week next week due to Easter Monday, and I intend to focus on the new Glasgow Medical Humanities resource for Gavin Miller, and some further Historical Thesaurus work.

Week Beginning 4th March 2019

I spent about half of this week working on the SCOSYA project.  On Monday I met with Jennifer and E to discuss a new aspect of the project that will be aimed primarily at school children.  I can’t say much about it yet as we’re still just getting some ideas together, but it will allow users to submit their own questionnaire responses and see the results.  I also started working with the location data that the project’s researchers had completed mapping out.  As mentioned in previous posts, I had initially created Voronoi diagrams that extrapolate our point-based questionnaire data to geographic areas.  The problem with this approach was that the areas were generated purely on the basis of the position of the points and did not take into consideration things like the varying coastline of Scotland or the fact that a location on one side of a body of water (e.g. the Firth of Forth) should not really extend into the other side, giving the impression that a feature is exhibited in places it quite clearly doesn’t.  Having the areas extend over water also made it difficult to see the outline of Scotland and to get an impression of which cell corresponded to which area.  So instead of this purely computational approach to generating geographical areas we decided to create them manually, using the Voronoi areas as a starting point, but tweaking them to take geographical features into consideration.   I’d generated the Voronoi cells as GeoJSON files and the researchers then used this very useful online tool https://geoman.io/studio to import the shapes and tweak them, saving them in multiple files as their large size caused some issues with browsers.

Upon receiving these files I then had to extract the data for each individual shape and work out which of our questionnaire locations the shape corresponded to, before adding the data to the database.  Although GeoJSON allows you to incorporate any data you like, in addition to the latitude / longitude pairings, I was not able to incorporate location names and IDs into the GeoJSON file I generated using the Voronoi library (it just didn’t work – see an earlier post for more information), meaning this ‘which shape corresponds to which location’ process needed to be done manually.  This involved grabbing the data for an individual location from the GeoJSON files, saving this and importing it into the GeoMan website, comparing the shape to my initial Voronoi map to find the questionnaire location contained within the area, adding this information to the GeoJOSN and then uploading it to the database.  There were 147 areas to do, and the process took slightly over a day to complete.

With all of the area data associated with questionnaire locations in the database I could then begin to work on an updated ‘storymap’ interface that would use this data.  I’m basing this new interface on Leaflet’s choropleth example: https://leafletjs.com/examples/choropleth/ which is a really nice interface and is very similar to what we require.  My initial task was to try and get the data out of the database and formatted in such a way that it could appear on the map.  This involved updating the SCOSYA API to incorporate the GeoJSON output for each location, which turned out to be slightly tricky, as my API automatically converts the data exported from the database (e.g. arrays and such things) into JSON using PHP’s json_encode function.  However, applying this to data that is already encoded as JSON (i.e. the new GeoJSON data) results in that data being treated as a string rather than as a JSON object, so the output was garbled.  Instead I had to ensure that the json_encode function was applied to every bit of data except the GeoJSON data, and once I’d done this the API outputted the GeoJSON data in such a way as to ensure any JavaScript could work with it.

I then produced a ‘proof of concept’ that simply grabbed the location data, pulled all the GeoJSON for each location together and processed it via Leaflet to produce area overlays, as you can see in the following screenshot:

With this in place I then began looking at how to incorporate our intended ‘story’ interface with the Choropleth map – namely working with a number of ‘slides’ that a user can navigate between, with a different dataset potentially being loaded and displayed on each slide, and different position and zoom levels being set on each slide.  This is actually proving to be quite a complicated task, as much of the code I’d written for my previous Voronoi version of the storymap was using older, obsolete libraries.  Thankfully with the new approach I’m able to use the latest version of Leaflet, meaning features like the ‘full screen’ option and smoother panning and zooming will work.

By the end of the week I’d managed to get the interface to load in data for each slide and colour code the areas.  I’d also managed to get the slide contents to display – both a ‘big’ version that contains things like video clips and a ‘compact’ version that sits to one side, as you can see in the following screenshot:

There is still a lot to do, though.  One area is missing its data, which I need to fix.  Also the ‘click on an area’ functionality is not yet working.  Locations as map points still need to be added in too, and the formatting of the areas still needs some work.  Also, the pan and zoom functionality isn’t there yet either.  However, I hope to get all of this working next week.

Also this week I had had a chat with Gavin Miller about the website for his new Medical Humanities project.  We have been granted the top-level ‘.ac.uk’ domain we’d requested so we can now make a start on the website itself.  I also made some further tweaks to the RNSN data, based on feedback.  I also spent about a day this week working on the REELS project, creating a script that would output all of the data in the format that is required for printing.  The tool allows you to select one or more parishes, or to leave the selection blank to export data for all parishes.  It then formats this in the same way as the printed place-name surveys, such as the Place-Names of Fife.  The resulting output can then be pasted into Word and all formatting will be retained, which will allow the team to finalise the material for publication.

I spent the rest of the week working on Historical Thesaurus tasks.  I met with Marc and Fraser on Friday, and ahead of this meeting I spent some time starting to look at matching up lexemes in the HT and OED datasets.  This involved adding seven new fields to the HT’s lexeme database to track the connection (which needs up to four fields) and to note the status of the connection (e.g. whether it was a manual or automatic match, which particular process was applied).  I then ran a script that matched up all lexemes that are found in matched categories where every HT lexeme matches an OED lexeme (based on the ‘stripped’ word field plus first dates).

Whilst doing this I’m afraid I realised I got some stats wrong previously.  When I calculated the percentage of total matched lexemes in matched categories and it gave figures of about 89% matched lexemes this was actually the number of matched lexemes across all categories (whether they were fully matched or not).  The number of matched lexemes in fully matched categories is unfortunately a lot lower.  For ‘01’ there are 173,677 matched lexemes, for ‘02’ there are 45,943 matched lexemes and for ‘03’ there are 110,087 matched lexemes.  This gives a total of 329,707 matched lexemes in categories where every HT word matches an OED word (including categories where there are additional OED words) out of 731307 non-OE words in the HT, which is about 45% matched.  I ticked these off in the database with check code 1 but these will need further checking, as there are some duplicate matches (where the HT lexeme has been joined to more than one OED lexeme).  Where this happens the last occurrence currently overwrites any earlier occurrence.  Some duplicates are caused by a word’s resulting ‘stripped’ form being the same – e.g. ‘chine’ and ‘to-chine’.

When we met on Friday we figured out another big long list of updates and new experiments that I would carry out over the next few weeks, but Marc spotted a bit of a flaw in the way we are linking up HT and OED lexemes.  In order to ensure the correct OED lexeme is uniquely identified we rely on the OED’s category ID field.  However, this is likely to be volatile:  during future revisions some words will be moved between categories.  Therefore we can’t rely on the category ID field as a means of uniquely identifying an OED lexeme.  This will be a major problem when dealing with future updates from the OED ad we will need to try and find a solution – for example updating the OED data structure so that the current category ID is retained in a static field.  This will need further investigation next week.


Week Beginning 25th February 2019

I met with Marc and Fraser on Monday to discuss the current situation with regards to the HT / OED linking task.  As I mentioned last week, we had run into an issue with linking HT and OED lexemes up as there didn’t appear to be any means of uniquely identifying specific OED lexemes as on investigation the likely candidates (a combination of category ID, refentry and refid) could be applied to multiple lexemes, each with different forms and dates.  James McCracken at the OED had helpfully found a way to include a further ID field (lemmaid) that should have differentiated these duplicates, and for the most part it did, but there were still more than a thousand rows where the combination of the four columns was not unique.

At our meeting we decided that this number of duplicates was pretty small (we are after all dealing with more than 700,000 lexemes) and we’d just continue with our matching processes and ignore these duplicates until they can be sorted.  Unexpectedly, James got back to me soon after the meeting and had managed to fix the issue.  He sent me an updated dataset that after processing resulted in there being only 28 duplicate rows, which is going to be a great help.

As a result of our meeting I made a number of further changes to scripts I’d previously created, including fixing the layout of the gap matching script, to make it easier for Fraser to manually check the rows, and I also updated the ‘duplicate lexemes in categories’ script (these are different sorts of duplicates – word forms that appear more than once in a category, but with their own unique identifiers) so that HT words where the ‘wordoed’ field is the same but the ‘word’ field is different are not considered duplicates.  This should filter out words of OE origin that shouldn’t be considered duplicates.  So for example, ‘unsweet’ with ‘unsweet’ and ‘unsweet’ with ‘unsweet < unswete’ no longer appear as duplicates.  This has reduced the number of rows listed from 567 to 456.  Not as big a drop as I’d expected, but a bit less.

At the meeting I’d also pointed out that the new data from the OED has deleted some categories that were present in the version of the OED data we’d been working with up to this point.  There are 256 OED categories that have been deleted, and these contain 751 words.  I wanted to check what was going on with these categories so wrote a little script that lists the deleted categories and their words.   I added a check to see which of these are ‘quarantined’ categories (categories that were duplicated in the existing data that we had previously marked as ‘quarantined’ to keep them separate from other categories) and I’m very glad to say that 202 such categories have been deleted (out of a total of 207 quarantined categories – we’ll need to see what’s going on with the remainder).  I also added in a check to see whether any of the deleted OED categories are matched up to HT categories.  There are 42 such categories, unfortunately, which appear in red.  We’ll need to decide what to do about these, ideally before I switch to using the new OED data, otherwise we’re left with OED catids in the HT’s category table that point to nothing.

In addition to the HT / OED task, I spent about half the week working on DSL related issues too, including a trip to the DSL offices in Edinburgh on Wednesday.  The team have been making updates to the data on a locally hosted server for many years now, and none of these updates have yet made their way into the live site.  I’m helping them to figure out how to get the data out of the systems they have been using and into the ‘live’ system.  This is a fairly complicated task as the data is stored in two separate systems, which need to be amalgamated.  Also, the ‘live’ data stored at Glasgow is made available via an API that I didn’t develop, for which there is very little documentation, and which appears to dynamically make changes to the data extracted from the underlying database and refactor it each time a request is made.  As this API uses technologies that Arts IT Support are not especially happy to host on their servers (Django / Python and Solr) I am going to develop a new API using technologies that Arts IT Support are happy to deal with (PHP), and eventually replace the old API, and also the old data with the new, merged data that the DSL people have been working on.  It’s going to be a pretty big task, but really needs to be tackled.

Last week Ann Ferguson from the DSL had sent me a list of changes she wanted me to make to the ‘Wordpressified’ version of the DSL website.  These ranged from minor tweaks to text, to reworking the footer, to providing additional options for the ‘quick search’ on the homepage to allow a user to select whether their search looks in SND, DOST or both source dictionaries.  It took quite some time to go through this document, and I’ve still not entirely finished everything, but the bulk of it is now addressed.

Also this week I responded to some requests from the SCOSYA team, including making changes to the website theme’s menu structure and investigating the ‘save screenshot’ of the atlas.  Unfortunately I wasn’t very successful with either request.  The WordPress theme the website currently uses only supports two levels of menu and a third level had been requested (i.e. a drop-down menu, and then a slide-out menu from the drop-down).  I thought I could possibly update the theme to include this with a few tweaks to the CSS and JavaScript, but after some investigation it looks like it would take a lot of work to implement, and it’s really not work doing so when plenty of other themes provide this functionality by default.  I had suggested we switch to a different theme, but instead the menu contents are just going to be rearranged.

The request for updating the ‘save screenshot’ feature refers to the option to save an image of the atlas, complete with all icons and the legend, at a resolution that is much greater than the user’s monitor in order to use the image in print publications.  Unfortunately getting the map position correct when using this feature is very difficult – small changes to position can result in massively different images.

I took another look at the screengrab plugin I’m using to see if there’s any way to make it work better.  The plugin is leaflet.easyPrint (https://github.com/rowanwins/leaflet-easyPrint).  I was hoping that perhaps there had been a new version released since I installed it, but unfortunately there hasn’t.  The standard print sizes all seem to work fine (i.e. positioning the resulting image in the right place).  The A3 size is something I added in, following the directions under ‘Custom print sizes’ on the page above.  This is the only documentation there is, and by following it I got the feature working as it currently does.  I’ve tried searching online for issues relating to the custom print size, but I haven’t found anything relating to map position.  I’m afraid I can’t really attempt to update the plugin’s code as I don’t know enough about how it works and the code is pretty incomprehensible (see it here: https://raw.githubusercontent.com/rowanwins/leaflet-easyPrint/gh-pages/dist/bundle.js).

I’d previously tried several other ‘save map as image’ plugins but without success, mainly because they are unable to incorporate HTML map elements (which we use for icons and the legend).  For example, the plugin https://github.com/mapbox/leaflet-image which rather bluntly says “This library does not rasterize HTML because browsers cannot rasterize HTML. Therefore, L.divIcon and other HTML-based features of a map, like zoom controls or legends, are not included in the output, because they are HTML.”

I think that with the custom print size in the plugin we’re using we’re really pushing the boundaries of what it’s possible to do with interactive maps.  They’re not designed to be displayed bigger than a screen and they’re not really supposed to be converted to static images either.  I’m afraid the options available are probably as good as it’s going to get.

Also this week I made some further changes to the RNSN timelines, had a chat with Simon Taylor about exporting the REELS data for print publication, undertook some App store admin duties and had a chat with Helen Kingstone about a research database she’s hoping to put together.

Week Beginning 18th February 2019

As with the past few weeks, I spent a fair amount of time this week on the HT / OED data linking issue.  I updated the ‘duplicate lexemes’ tables to add in some additional information.  For HT categories the catid now links through to the category in the HT website and each listed word has an [OED] link after it that performs a search for the word on the OED website, as currently happens with words on the HT website.  For OED categories the [OED] link leads directly to the sense on the OED website, using a combination of ‘refentry’ and ‘refid’.

I then created a new script that lists HT / OED categories where all the words match (HT and OED stripped forms are the same and HT startdate matches OED GHT1 date) or where all HT words match and there are additional OED forms (hopefully ‘new’ words), with the latter appearing in red after the matched words.  Quite a large percentage of categories either have all their words matching or have everything matching except a few additional OED words (note that ‘OE’ words are not included in the HT figures):

For 01: 82300 out of 114872 categories (72%) are ‘full’ matches.  335195 out of 388189 HT words match (86%).  335196 out of 375787 OED words match (89%).  For 02: 20295 out of 29062 categories (70%) are ‘full’ matches.  106845 out of 123694 HT words match (86%).  106842 out of 119877 OED words match (89%). For 03: 57620 out of 79248 categories (73%) are ‘full’ matches.  193817 out of 223972 HT words match (87%).  193186 out of 217771 OED words match (89%).  It’s interesting how consistent the level of matching is across all three branches of the thesaurus.

I also received a new batch of XML data from the OED, which will need to replace the existing OED data that we’re working with.  Thankfully I have set things up so that the linking of OED and HT data takes place in the HT tables, for example the link between an HT and OED category is established by storing the primary key of the OED category as a foreign key in the corresponding row of HT category table.  This means that swapping out the OED data should (or at least I thought it should) be pretty straightforward.

I ran the new dataset through the script I’d previously created that goes through all of the OED XML, extracts category and lexeme data and inserts it into SQL tables.  As was expected, the new data contains more categories than the old data.  There are 238697 categories in the new data and 237734 categories in the old data, so it looks like 963 new categories. However, I think it’s likely to be more complicated than that.  Thankfully the OED categories have a unique ID (called ‘CID’ in our database).  In the old data this increments from 1 to 237734 with no gaps.  In the new data there are lots of new categories that start with an ID greater than 900000.  In fact, there are 1219 categories with such IDs.  These are presumably new categories, but note that there are more categories with these new IDs than there are ‘new’ categories in the new data, meaning some existing categories must have been deleted.  There are 237478 categories with an ID less than 900000, meaning 256 categories have been deleted.  We’re going to have to work out what to do with these deleted categories and any lexemes contained within them (which presumably might have been moved to other categories).

Another complication is that the ‘Path’ field in the new OED data has been reordered to make way for changes to categories.  For example, the OED category with the path ’02.03.02’ and POS ‘n’ in the old data is 139993 ‘Ancient Greek philosophy’.  In the new OED data the category with the path ’02.03.02’ and POS ‘n’ is 911699 ‘badness or evil’, while ‘Ancient Greek philosophy’ now appears as ’’.  Thankfully the CID field does not appear to have been changed, for example, CID 139993 in the new data is still ‘Ancient Greek philosophy’ and still therefore links to the HT catid 231136 ‘Ancient Greek philosophy’, which has the ‘oedmainat’ of 02.03.02.  I note that our current ‘t’ number for this category is actually ‘’, so perhaps the updates to the OED’s ‘path’ field bring it into line with the HT’s current numbering.  I’m guessing that the situation won’t be quite as simple as that in all cases, though.

Moving on to lexemes, there are 751156 lexemes in the new OED data and 715546 in the old OED data, meaning there are some 35,610 ‘new’ lexemes.  As with categories I’m guessing it’s not quite as simple as that as some old lexemes may have been deleted too.  Unfortunately, the OED does not have a unique identifier for lexemes in its data.  I generate an auto-incrementing ID when I import the data, but as the order of the lexemes has changed between data the ID for the ‘old’ set does not correspond to the ID in the ‘new’ set.  For example, the last lexeme in the ‘old’ set has an ID of 715546 and is ‘line’ in the category 237601.  In the new set the lexeme with the ID 715546 is ‘melodica’ in the category 226870.

The OED lexeme data has two fields which sort of look like unique identifiers:  ‘refentry’ and ‘refid’.  The former is the ID for a dictionary entry while the latter is the ID for the sense.  So for example refentry 85205 is the dictionary entry for ‘Heaven’ and refid 1922174 is the second sense, allowing links to individual senses, as follows: http://www.oed.com/view/Entry/85205#eid1922174. Unfortunately in the OED lexeme table neither of these IDs is unique, either on its own or in combination.  For example, the lexeme ‘abaca’ has a refentry of 37 and a refid of 8725393, but there are three lexemes with these IDs in the data, associated with categories 22927, 24826 and 215239.

I was hoping that the combination of refentry, refid and category ID would be unique and and serve as a primary key, and I therefore wrote a script to check for this.  Unfortunately this script demonstrated that these three fields are not sufficient to uniquely identify a lexeme in the OED data.  There are 5586 times that refentry and refid appear more than once in a category.  Even more strangely, these occurrences frequently have different lexemes and different dates associated with them.  For example:  ‘Ecliptic circle’ (1678-1712) and ‘ecliptic way’ (1712-1712) both have 59369 as refentry and 5963672 as refid.

While there are some other entries that are clearly erroneous duplicates (e.g. half-world (1615-2013) and 3472: half-world (1615-2013) have the same refentry (83400, 83400) and refid (1221624180, 1221624180)), the above example and others are (I guess) legitimate and would not be fixed by removing duplicates, so we can’t rely on a combination of cid, refentry and refid to uniquely identify a lexeme.

Based on the data we’d been given from the OED, in order to uniquely identify an OED lexeme we would need to include the actual ‘lemma’ field and/or date fields.  We can’t introduce our own unique identifier as it will be redefined every time new OED data is inputted, so we will have to rely on a combination of OED fields to uniquely identify a row, in order to link up one OED lexeme and one HT lexeme.  But if we rely on the ‘lemma’ or date fields the risk is these might change between OED versions, so the link would break.

To try and find a resolution to this issue I contacted James McCracken, who is the technical guy at the OED.  I asked him whether there is some other field that the OED uses to uniquely identify a lexeme that was perhaps not represented in the dataset we had been given.  James was extremely helpful and got back to me very quickly, stating that the combination of ‘refentry’ and ‘refid’ uniquely identifies the dictionary sense, but that a sense can contain several different lemmas, each of which may generate a distinct item in the thesaurus, and these distinct items may co-occur in the same thesaurus category.  He did, however, note that in the source data, there’s also a pointer to the lemma (‘lemmaid’), which wasn’t included in the data we had been given.  James pointed out that this field is only included when a lemma appears more than once in a category, but that we should therefore be able to use CID, refenty, refid and (where present) lemmaid to uniquely identify a lexeme.  James very helpfully regenerated the data so that it included this field.

Once I received the updated data I updated my database structure to add in a new ‘lemmaid’ field and ran the new data through a slightly updated version of my migration script.  The new data contains the same number of categories and lexemes as the dataset I’d been sent earlier in the week, so that all looks good.  Of the lexemes there are 33283 that now have a lemmaid, and I also updated my script that looks for duplicate words in categories to check the combination of refentry, refid and lemmaid.

After adding in the new lemmaid field, the number of listed duplicates has decreased from 5586 to 1154.  Rows such as ‘Ecliptic way’ and ‘Ecliptic circle’ have now been removed, which is great.  There are still a number of duplicates listed that are presumably erroneous, for example ‘cock and hen (1785-2006)’ appears twice in CID 9178 and neither form has a lemmaid.  Interestingly, the ‘half-world’ erroneous(?) duplicate example I gave previously has been removed as one of these has a ‘lemmaid’.

Unfortunately there are still rather a lot of what look like legitimate lemmas that have the same refentry and refid but no lemmaid.  Although these point to the same dictionary sense they generally have different word forms and in many cases different dates.  E.g. in CID 24296:  poor man’s treacle (1611-1866) [Lemmaid 0] and countryman’s treacle (1745-1866) [Lemmaid 0] have the same refentry (205337, 205337) and refid (17724000, 17724000).  We will need to continue to think about what to do with these next week as we really need to be able to identify individual lexemes in order to match things up properly with the HT lexemes.  So this is a ‘to be continued’.

Also this week I spent some time in communication with the DSL people about issues relating to extracting their work in progress dictionary data and updating the ‘live’ DSL data.  I can’t really go into detail about this yet, but I’ve arranged to visit the DSL offices next week to explore this further.  I also made some tweaks to the DSL website (including creating a new version of the homepage) and spoke to Ann about the still in development WordPress version of the website and a log list of changes that she had sent me to implement.

I also tracked down a bug in the REELS system that was resulting in place-name element descriptions being overwritten with blanks in some situations.  It would appear to only occur when associating place-name elements with a place when the ‘description’ field had carriage returns in it.  When you select an element by typing characters into ‘element’ box to bring up a list of matching elements and then select an element from the list, a request is sent to the server to bring back all the information about the element in order to populate the various boxes in the form relating to the element.  However, special characters used to represent carriage returns (\n and \r) are not valid in the JSON format.  When an element description contained such characters, the returned file couldn’t be read properly by the script.  Form elements up to the description field were getting automatically filled in, but then the description field was being left blank.  Then when the user pressed the ‘update’ button the script assumed the description field had been updated (to clear the contents) and deleted the text in the database. Once I identified this issue I updated the script that grabs the information about an element so that special characters that break JSON files are removed, so hopefully this will not happen again.

Also this week I updated the transcription case study on the Decadence and Translation website to tweak a couple of things that were raised during a demonstration of the system and I created a further timeline for the RNSN project, which took most of Friday afternoon.