Week Beginning 13th May 2019

I split my time this week between the Scots Syntax Atlas (SCOSYA), the Dictionary of the Scots Language (DSL) and the Historical Thesaurus (HT).  For SCOSYA I worked primarily on the stories, which will guide users through one or more linguistic features.  I made updates to the CMS to allow staff to add, list, edit and delete stories, and there are new menu items for ‘Add Story’, which allows staff to upload the text of a JSON file and specify and title and description for the story, and ‘Browse Stories’ which lists the uploaded stories and whether they are visible on the website.  Through this page you can choose to delete a story or press on its title to view and edit the data.  Staff can use this to make updates to stories (e.g. changing slides, adding new ones), update the title and description or set the story so that it doesn’t appear on the website.  I uploaded the test ‘Gonnae’ story into the system and I edited it via these pages to set all slides to type ‘slide’ instead of ‘full’ and to remove the embedded YouTube clip.  I also updated the API to add in endpoints for listing stories and getting all data about a specified story.

In terms of the front end, a list of stories now appears as a drop-down list in the ‘Stories’ section of the accordion.  If you select one its description is displayed and pressing ‘Show’ will load the story.  It’s also possible to link directly to a story.  Currently it’s not possible to link to a specific slide within a story, but I’m thinking of adding that in as an option.

It’s taken quite a long time to fully integrate the story with the existing Public Atlas interface.  I won’t go into details, but it has been rather tricky at times.  Anyway, I have slightly changed how the story interface works, based on previous discussions with the team.  All slides should now be ‘slide’ rather than ‘full’, meaning they all appear to the right of the map.  I’ve repositioned the box to be in the top right corner and have repositioned the ‘next’ and ‘previous’ buttons to be inside the box rather than at the edges of the map.  The ‘info box’ that previously told you what feature is being displayed and gave you information about an area you click on has now gone, with the information about the feature now appearing as a separate section beneath the slide text and above the navigation buttons.  I’ve made the ‘return to start’ link a button with an icon now too.

I have also now figured out how to get the pop-ups to work with the areas as well as with the points, and have incorporated the same pop-ups as in the ‘Features’ section into the ‘Stories’ section.  I think this works pretty well; it’s always better to have things working consistently across a site.  It also means marker tooltips with location names work in the story interface too.  Oh, and markers are now exactly the same as in the ‘Features’ section whereas before they may have looked similar but used different technologies (circleMarkers as opposed to markers with CSS based styling).  This also means they bounce when new data is loaded into a story.  Here’s a screenshot of a story slide:

I’ve also made some updates to the ‘Features’ section.  Marker borders are now white rather than blue, which I think looks better and fits in with the white dotted edges to the areas.  Markers when displayed together with areas are now much smaller to cut down on clutter.  The legend now works when just viewing areas whereas it was broken with this view before.  To get this working required a fairly major rewrite of the code so that markers and areas are bundled together into layers for each rating level where previously markers and areas were treated separately.  Popups for areas now work through the ‘Features’ section too.  Hopefully next week I’ll add in sample sound files and the option to highlight groups.

For DSL I launched a new version of the DSL website (https://dsl.ac.uk/) that has a front-end powered by WordPress to make it easier for DSL staff to manage directly.  It doesn’t look particularly different to the previous version, but the ancillary pages have been completely reorganised and the quick search now allows you to limit your search to DOST or SND.  The migration to the new system went pretty smoothly, with the only issue being some broken links and redirects that I needed to sort out.  I also managed to identify and fix the problem with citation texts being displayed differently depending on the type of URL used (e.g. https://dsl.ac.uk/entry/dost/proces_n and https://dsl.ac.uk/entry/dost44593).  This was actually being caused by some JavaScript on the page, which was responsible for switching the order of titles and quotations round and depended on ‘/dost/’ in the URL to work properly.  It should all be sorted now (the above pages should hopefully be identical and cases where there are multiple links for one quote (e.g. 6 (5)) should all now display).

I also continued to develop the new API for the DSL, although I’m running into some issues with how the existing API works, or rather some situations where it doesn’t seem to work as intended.  I’ve never been completely clear about how ‘Headword match type’ in the advanced search works, especially as wildcards can also be used.  Is ‘Headword match type’ really needed?  If it’s set to ‘exact match’ then presumably the use of wildcards shouldn’t be allowed anyway, surely?  I find the use of wildcards in ‘part of compound’ and ‘part of any word’ searches to be utterly confusing.  E.g. if I search for ‘hing*’ in DOST headwords set to ‘partial match’ it doesn’t just find things starting with ‘hing’ but anything containing ‘hing’, meaning the wildcard isn’t working.  Boolean searches and wildcards seem to be broken too.  E.g. a ‘part of compound’ headword search for “hing* OR hang*” brings back no results, which can’t be right as “hing*” on its own brings back plenty.

It’s difficult to replicate the old API when I don’t understand how (or even if) certain combinations of search terms work with the old API. I’ve emailed the DSL staff to see whether we could maybe remove the ‘headword match type’ limiting search option and just use wildcards and Booleans to provide similar functionality.  E.g. the default match is ‘exact match’, which matches the search term to the headword form and its variants (e.g. ‘Coutch’ also has ‘Coutcher’ and ‘Coutching’).  The user could then use wildcards to match forms beginning or ending with their term or containing their term in the middle (‘hing*’, ‘*hing’, ‘*hing*’), or use a single character wildcard too (‘h?ng*’, ‘*h?ng’, ‘*h?ng*’), and combine this with Boolean operators (‘h?ng* NOT hang*’).  I just can’t see a way to get these searches working consistently when used in conjunction with ‘Headword match type’, so I’ll need to see what the DSL people think about this.

I also investigated why DSL posts to Facebook suddenly started appearing with a University of Glasgow logo after the DSL website was migrated.  The UoG image that’s included wasn’t even the official UoG image, but is just the blue icon I made to go in the footer of the DSL website.  It therefore looks like Facebook is scanning the page that’s linked to for an icon and somehow deciding that this is the one to use, even though we link to several DSL icons in the page header to provide the icon in browser tabs, for example.

After a bit of digging I found this page: https://stackoverflow.com/questions/10864208/how-do-i-get-facebook-to-show-my-sites-logo-when-i-paste-the-link-in-a-post-sta with an answer that suggests using the Facebook developer’s debug tool to see what Facebook does when you pass a link to it (http://developers.facebook.com/tools/debug).  When entering the DSL URL into the page it didn’t display any associated logo (not even the wrong one), which wasn’t very helpful, but it did point out some missing Facebook specific metadata fields that we can add to the DSL header information.  This included <meta property=”og:image” content=”[url to image]” />.  I added this to the DSL site with a link to one of the DSL’s icons, then got Facebook to ‘scrape’ the content again, but it didn’t work, and instead gave errors about content encoding.  Some further searching uncovered that the correct tag to use was “og:image:url”.  Pointing this at an icon image that was larger than 200px seems to have worked as the DSL logo is now pulled into the Facebook debugger.

For the HT I ticked off a few tasks that had been agreed upon after last week’s meeting.  I spent a bit of time investigating exactly how HT and OED lexemes were matched and ticked off, as we’d rather lost track of what had happened.  HT lexemes were matched against the newest OED lexemes (the version of the data with the ‘lemmaid’ field).  The stripped form of each lexeme AND the dates need to be the same before a match is recorded.  The dates that are compared are the numeric characters before the dash of the HT’s ‘fulldate’ field and the numeric characters from the OED’s ght_date1 field.  The following fields are then populated in the HT’s lexeme table: V2_oedchecked (Y/N), v2_oedcatid, V2_refentry,  V2_refid, V2_lemmaid (these four in combination uniquely identify the OED lexeme), V2_checktype (as with category used to note automatic or manual matches), v2_matchprocess (for keeping track of matching processes as with categories).  There are 627872 matched lexemes, all of which have been given v2_checktype ‘A’ and v2_matchprocess ‘1’.

I also started to deal with categories that had been ticked off but had a part of speech mismatch.  There were 41 of these and I realised that I would need to not only ‘detick’ the categories but any lexemes contained therein too.  I wrote a script that lists the mismatched categories and any lexemes in them that are currently matched too, but I haven’t ticked anything off yet as I’m wondering whether it might be that the OED has changed the POS of a category, especially as the bulk of the mismatches are changes between sub-verb forms.  I also exported our current category table so Fraser can send it on to the OED people, who are interested to know about the matches that we have currently made.  I also set up two new email addresses to the HT using the ht.ac.uk domain.  We have ‘editors@ht.ac.uk’ and ‘web@ht.ac.uk’. I also wrote a script that finds monosemous forms in the unmatched HT lexemes (monosemous within a specific POS and using the ‘stripped’ column), finds all the ones that have an identical monosemous match in the OED data and then displays not only those forms and their category details, but every lexeme in the HT / OED categories the forms belong to.  Finally I made a  start on a further script that finds unmatched words within matched categories and tries to match them with an increasing Levenshtein distance.  I haven’t finished this yet, though, and will continue with it next week.