I continued to work on some Historical Thesaurus / OED data linking issues this week, working through my ‘to do’ list from the last meeting. The first thing I tackled was the creation of a script that tries to link up unmatched words in matched categories using a Levenshtein distance of 1. The script grabs the unmatched HT and OED words in every matched category and for each HT word it compares the stripped form to the stripped form of every unmatched OED word in the matched category. If any have a Levenshtein distance of 1 (or less) these then appear in bold after the word in the ‘HT unmatched lexemes’ column. The same process is repeated for unmatched OED lexemes, comparing each to every unmatched HT word.
There are quite a lot of occasions where there are multiple potential matches. E.g. in 01.02 there are two HT words (a world, world < woruld) that match one OED word (world). Some forms that look like they ought to match are not getting picked up, e.g. ‘palæogeography’ and ‘palaeogeography’ due to there being two characters difference. Some category contents seem rather odd, for example I’m not sure what’s going on with ‘North-east’ as it looks like the OED repeats the word ‘north-east’ four times. There are also some instances where comparing the ‘stripped’ form with a Levenshtein distance of 1 is giving a false positive – e.g. HT ‘chingle’ and ‘shingle’ are both matching OED ‘shingle’. However, I reckon there is a lot of potential for matching here, if we apply some additional limits.
I then wrote a further script that attempts to identify new OED words. It brings back all matched categories and then counts the number of unmatched words in the HT and OED categories. If there are more unmatched OED words than HT words then the category information is displayed. This includes the HT words for reference, and also the OED words complete with GHT dates, OED dates and whether the record was revised. There are 23212 matched categories with more unmatched OED words than HT words, although we probably should try and tick off more lexeme matches before we do too much with this script as many of the listed unmatched words clearly match. I also updated the ‘monosemous’ script I created last week to colour code the output based on Fraser’s suggestions, which will help in deciding which candidate matches to tick off.
On Wednesday Marc, Fraser and I had a meeting to discuss the current situation and to decide on further steps. Following on from this I made some further tweaks to existing scripts. There appeared to be a ug in the ‘monosemous’ script whereby some already matched OED words were being picked out as candidate monosemous matches. It’s turned out to be a rather large bug in terms of impact. The part of the script that checks through the potential OED matches to pick out those that are monosemous amongst the ‘not yet ticked off’ OED words was correctly identifying that a word that was monosemous within the unticked off OED words within a part of speech. However, the script needed to check through all occurrences of a word, and unfortunately it was set to use the last occurrence it reached, rather than the one that was the actual unticked monosemous form. For example, ‘Terrestrious’ as an Aj appears 4 times in the OED data. 3 of them have already been ticked off, so the remaining form is monosemous within Aj. But when checking through the 4 forms, the one that hasn’t been ticked off yet was looked at second. The script noted that the form only appeared once in the unticked Aj set, but was then using the last form it checked through, one that was already ticked off in category ‘having earth-like qualities’ rather than the unticked form in ‘occurring on’. I’m not sure if that makes sense or not, but basically in many cases the wrong OED category and words were being displayed, leading to many words being classed as not matches when in actual fact they were. I’ve updated the script to fix the bug.
I also made some updates to an existing category matching script and created a further script to list all of the unmatched words in matched categories that appear to match up based purely on the ‘stripped’ word form and not including a date check.
On Monday afternoon I attended the ‘Technician Commitment Launch Event’, which is aimed at promoting the role of technicians across the University. It was a busy event, with at least a couple of hundred people attending, and talks from technicians and senior management from within and beyond the University. It’s a promising initiative and I hope it’s a success.
I was contacted this week by College research admin staff who asked if I would write a Data Management Plan for a researcher who is based in the School of Culture and Creative Arts. As I’m employed specifically by the School of Critical Studies this really should not be by responsibility, especially as College recently appointed someone in a similar role to me to do such things. Despite pointing this out, and them not even bothering to contact this person, I was somehow still landed with the job of writing the DMP, which I’m not best pleased about. I spent most of Friday working on the plan, and it’s still not complete. I should have it finished early next week, but it’s meant I have been unable to work on the SCOSYA project this week and will have less time to work on it next week too.
Other than these tasks, and speaking to Carole Hough about some conference pages of hers that have gone missing from T4 and speaking to Wendy Anderson about some issues with the advanced search maps in the SCOTS corpus I spent the remainder of the week on DSL duties. This included beginning to write a script that will generate sections of entries for different types of advanced search (e.g. with quotes, without quotes, only quotes) and fixing some layout issues with the new version of the DSL website when viewed in portrait mode on mobile phones. The biggest task I focussed on was writing a script to go through the DOST and SND data that had been exported from scripts on the DSL’s test server, split up the XML and pick out the relevant information to update entries in the online DSL database. I started with the DOST file and this mostly went pretty well, although I ended up with a number of questions that I needed to send on to the DSL team. I also attempted to migrate the SND data but unfortunately the file that was outputted by the script on the test server is not valid XML so something must have gone wrong with it. This means my script is unable to parse the file, so I’ll need to try and figure out what has gone wrong with it. Further jobs for next week.
I split my time this week between the Scots Syntax Atlas (SCOSYA), the Dictionary of the Scots Language (DSL) and the Historical Thesaurus (HT). For SCOSYA I worked primarily on the stories, which will guide users through one or more linguistic features. I made updates to the CMS to allow staff to add, list, edit and delete stories, and there are new menu items for ‘Add Story’, which allows staff to upload the text of a JSON file and specify and title and description for the story, and ‘Browse Stories’ which lists the uploaded stories and whether they are visible on the website. Through this page you can choose to delete a story or press on its title to view and edit the data. Staff can use this to make updates to stories (e.g. changing slides, adding new ones), update the title and description or set the story so that it doesn’t appear on the website. I uploaded the test ‘Gonnae’ story into the system and I edited it via these pages to set all slides to type ‘slide’ instead of ‘full’ and to remove the embedded YouTube clip. I also updated the API to add in endpoints for listing stories and getting all data about a specified story.
In terms of the front end, a list of stories now appears as a drop-down list in the ‘Stories’ section of the accordion. If you select one its description is displayed and pressing ‘Show’ will load the story. It’s also possible to link directly to a story. Currently it’s not possible to link to a specific slide within a story, but I’m thinking of adding that in as an option.
It’s taken quite a long time to fully integrate the story with the existing Public Atlas interface. I won’t go into details, but it has been rather tricky at times. Anyway, I have slightly changed how the story interface works, based on previous discussions with the team. All slides should now be ‘slide’ rather than ‘full’, meaning they all appear to the right of the map. I’ve repositioned the box to be in the top right corner and have repositioned the ‘next’ and ‘previous’ buttons to be inside the box rather than at the edges of the map. The ‘info box’ that previously told you what feature is being displayed and gave you information about an area you click on has now gone, with the information about the feature now appearing as a separate section beneath the slide text and above the navigation buttons. I’ve made the ‘return to start’ link a button with an icon now too.
I have also now figured out how to get the pop-ups to work with the areas as well as with the points, and have incorporated the same pop-ups as in the ‘Features’ section into the ‘Stories’ section. I think this works pretty well; it’s always better to have things working consistently across a site. It also means marker tooltips with location names work in the story interface too. Oh, and markers are now exactly the same as in the ‘Features’ section whereas before they may have looked similar but used different technologies (circleMarkers as opposed to markers with CSS based styling). This also means they bounce when new data is loaded into a story. Here’s a screenshot of a story slide:
I’ve also made some updates to the ‘Features’ section. Marker borders are now white rather than blue, which I think looks better and fits in with the white dotted edges to the areas. Markers when displayed together with areas are now much smaller to cut down on clutter. The legend now works when just viewing areas whereas it was broken with this view before. To get this working required a fairly major rewrite of the code so that markers and areas are bundled together into layers for each rating level where previously markers and areas were treated separately. Popups for areas now work through the ‘Features’ section too. Hopefully next week I’ll add in sample sound files and the option to highlight groups.
I also continued to develop the new API for the DSL, although I’m running into some issues with how the existing API works, or rather some situations where it doesn’t seem to work as intended. I’ve never been completely clear about how ‘Headword match type’ in the advanced search works, especially as wildcards can also be used. Is ‘Headword match type’ really needed? If it’s set to ‘exact match’ then presumably the use of wildcards shouldn’t be allowed anyway, surely? I find the use of wildcards in ‘part of compound’ and ‘part of any word’ searches to be utterly confusing. E.g. if I search for ‘hing*’ in DOST headwords set to ‘partial match’ it doesn’t just find things starting with ‘hing’ but anything containing ‘hing’, meaning the wildcard isn’t working. Boolean searches and wildcards seem to be broken too. E.g. a ‘part of compound’ headword search for “hing* OR hang*” brings back no results, which can’t be right as “hing*” on its own brings back plenty.
It’s difficult to replicate the old API when I don’t understand how (or even if) certain combinations of search terms work with the old API. I’ve emailed the DSL staff to see whether we could maybe remove the ‘headword match type’ limiting search option and just use wildcards and Booleans to provide similar functionality. E.g. the default match is ‘exact match’, which matches the search term to the headword form and its variants (e.g. ‘Coutch’ also has ‘Coutcher’ and ‘Coutching’). The user could then use wildcards to match forms beginning or ending with their term or containing their term in the middle (‘hing*’, ‘*hing’, ‘*hing*’), or use a single character wildcard too (‘h?ng*’, ‘*h?ng’, ‘*h?ng*’), and combine this with Boolean operators (‘h?ng* NOT hang*’). I just can’t see a way to get these searches working consistently when used in conjunction with ‘Headword match type’, so I’ll need to see what the DSL people think about this.
I also investigated why DSL posts to Facebook suddenly started appearing with a University of Glasgow logo after the DSL website was migrated. The UoG image that’s included wasn’t even the official UoG image, but is just the blue icon I made to go in the footer of the DSL website. It therefore looks like Facebook is scanning the page that’s linked to for an icon and somehow deciding that this is the one to use, even though we link to several DSL icons in the page header to provide the icon in browser tabs, for example.
After a bit of digging I found this page: https://stackoverflow.com/questions/10864208/how-do-i-get-facebook-to-show-my-sites-logo-when-i-paste-the-link-in-a-post-sta with an answer that suggests using the Facebook developer’s debug tool to see what Facebook does when you pass a link to it (http://developers.facebook.com/tools/debug). When entering the DSL URL into the page it didn’t display any associated logo (not even the wrong one), which wasn’t very helpful, but it did point out some missing Facebook specific metadata fields that we can add to the DSL header information. This included <meta property=”og:image” content=”[url to image]” />. I added this to the DSL site with a link to one of the DSL’s icons, then got Facebook to ‘scrape’ the content again, but it didn’t work, and instead gave errors about content encoding. Some further searching uncovered that the correct tag to use was “og:image:url”. Pointing this at an icon image that was larger than 200px seems to have worked as the DSL logo is now pulled into the Facebook debugger.
For the HT I ticked off a few tasks that had been agreed upon after last week’s meeting. I spent a bit of time investigating exactly how HT and OED lexemes were matched and ticked off, as we’d rather lost track of what had happened. HT lexemes were matched against the newest OED lexemes (the version of the data with the ‘lemmaid’ field). The stripped form of each lexeme AND the dates need to be the same before a match is recorded. The dates that are compared are the numeric characters before the dash of the HT’s ‘fulldate’ field and the numeric characters from the OED’s ght_date1 field. The following fields are then populated in the HT’s lexeme table: V2_oedchecked (Y/N), v2_oedcatid, V2_refentry, V2_refid, V2_lemmaid (these four in combination uniquely identify the OED lexeme), V2_checktype (as with category used to note automatic or manual matches), v2_matchprocess (for keeping track of matching processes as with categories). There are 627872 matched lexemes, all of which have been given v2_checktype ‘A’ and v2_matchprocess ‘1’.
I also started to deal with categories that had been ticked off but had a part of speech mismatch. There were 41 of these and I realised that I would need to not only ‘detick’ the categories but any lexemes contained therein too. I wrote a script that lists the mismatched categories and any lexemes in them that are currently matched too, but I haven’t ticked anything off yet as I’m wondering whether it might be that the OED has changed the POS of a category, especially as the bulk of the mismatches are changes between sub-verb forms. I also exported our current category table so Fraser can send it on to the OED people, who are interested to know about the matches that we have currently made. I also set up two new email addresses to the HT using the ht.ac.uk domain. We have ‘firstname.lastname@example.org’ and ‘email@example.com’. I also wrote a script that finds monosemous forms in the unmatched HT lexemes (monosemous within a specific POS and using the ‘stripped’ column), finds all the ones that have an identical monosemous match in the OED data and then displays not only those forms and their category details, but every lexeme in the HT / OED categories the forms belong to. Finally I made a start on a further script that finds unmatched words within matched categories and tries to match them with an increasing Levenshtein distance. I haven’t finished this yet, though, and will continue with it next week.
It was another four-day week this week due to Monday being a bank holiday. On Tuesday I met my fellow College of Arts developers Luca Guariento and Stevie Barrett for coffee and an informal chat about our work. It was great to catch up with them again and find out what they’d been up to recently, and to talk about the things I’m working on, and we’re planning on having these coffees every month. I spent much of the week on the development of the front-end for the SCOSYA project. There are going to be several aspects to the front-end, including an ‘experts’ atlas, a more general ‘public’ atlas, the stories and a ‘listening atlas’. I started this week on the public atlas, which will present users with a selection of features that they can view the results for, without being swamped by features or too many options.
I’ve based the new public atlas on the existing test version of the atlas, that has a panel the slides out from the left which contains the options. I updated this panel so that it has no background colour at all, other than for the accordion content. This hopefully makes things feel a bit less claustrophobic. The left panel is now open by default, and I’ve updated the ‘open’ and ‘close’ icons so they sort of match each other, with a ‘<’ or a ‘>’ icon as appropriate. The accordion sections have been updated to use colours from the site logo, with white as the background of the open accordion section. There are currently four accordion sections: Home (the questionnaire locations), Features (the attributes that can be viewed), Stories (the stories) and How to use (help). Only Home and Features contain content at the moment.
Home displays the questionnaire locations, as the above screenshot demonstrates. As requested by Jennifer, the location markers drop onto the map and bounce, in a hopefully not too annoying way whenever the ‘Home’ accordion title is pressed on (you can keep pressing it to repeat the animation). Location markers are the lighter blue from the logo, there is no legend and no pop-ups, but you can hover over a marker to view the location name.
The Features accordion section displays the features that you can browse through the public interface, as the following screenshot demonstrates.
I have updated the project’s CMS to allow staff to control which codes appear in this list. Currently I have added all those that are in the ‘1B English Language 2019’ group. In the CMS the ‘Browse codes’ page has a new column added to the table called ‘In Public Atlas’. Anything set to ‘Yes’ appears in the ‘Features’ section. If you edit a code you can change a code’s ‘In Public Atlas’ value from ‘No’ to ‘Yes’ (or vice-versa).
In the Features section in the public atlas the available features are listed in a drop-down list. I initially had them in a scrollable section, but this took up too much vertical space that we will need for other things (I have yet to add in the sample recordings or the groups option). If you select an option its description then appears beneath the selected item. You can also choose to limit your search to all/young/old speakers and select whether locations should appear as points/areas/both.
When you press on the ‘Show’ button the data is loaded into the map. This can take some time to complete, as the full map areas currently get returned each time. I might have to think of a more efficient way of doing this. If point data is set to load the markers drop onto the map and bounce. The legend has been updated to remove the number ratings and ‘this’ from the text. If you turn a rating level off in the legend and then turn it on again the respective markers drop and bounce when they re-appear, which I rather like.
The pop-ups have also been updated. I have added a coloured border to the pop-ups that corresponds to the rating level and the rating text also uses this same colour. I’ve had to add an outline to the rating text as the very light ratings (e.g. wouldn’t say) are otherwise invisible against the white background. I’m not sure whether I like this or not, and an alternative would be to have the rating colour as a background to the text and the text in white, so it looks sort of like a button.
I had to add in new averages for young and old speakers in addition to the overall rating in order to get the pop-up text for young and old speakers. Where we have no data for young or old speakers for a feature text such as ‘We have no data from older speakers for this feature in Hurlford.’ Is displayed. When you limit the display to just young or old speakers only the sentence about your selected speakers is displayed.
I still haven’t quite finished with the area display yet. If you select to just display areas and not points then the areas work, but currently there is no legend and no pop-ups, because these are both tied into the points, as the following screenshot demonstrates.
Selecting both points and areas currently looks a bit cluttered and turning a rating level off currently only removes the points and not the areas. See the screenshot below for an example.
In addition to updating the area displays I still need to add in the sound samples and the groups, which will require updates to the API. Adding in groups might take some work as I’ll need to find a way of colour coding not just the markers but the areas too. I’ll be continuing with this next week.
In addition to working for SCOSYA I had a Historical Thesaurus meeting with Marc and Fraser, which lasted most of Wednesday morning. After quite a long period we are now returning to the seemingly never-ending task of linking up the HT and OED datasets. I had spent a couple of days working on scripts after our previous meeting several weeks ago, and created a summary of what the scripts did for our meeting, but unfortunately no-one could really remember why the requests for the scripts had been made so it looks very much like none of the work is gong to be put to use. It seems to be a slightly disheartening pattern that I get asked to create lots of new scripts at these meetings, which I spend several days doing, but then no-one really does much with the scripts or their outputs and then at the next meeting I’m asked to create another series of scripts, and then the same thing happens again. After this week’s meeting I have more scripts to write again, but I’m rather busy with more pressing deadlines (e.g. for SCOSYA) so I’m not going to be able to spend much time writing them for a while.
Also this week I did some further work for the DSL. I managed to log into the editor server that the DSL uses and ran a script that Thomas Widmann created that merges all of the most recently edited data and formats it all as a big XML file. I generated such files for both DOST and SND and saved them to my desktop PC. I have a couple of questions about the data that need answered (e.g. there are new fields in the metadata section of the entries and I’m not sure how these need to be used) but once that’s done I’ll be able to write a script to chop up the XML files, extract individual entries and upload them to a new version of the database in order for DSL staff to compare the new versions with the existing online versions.
Next week we’re hoping to go live with a new version of the DSL website that as a front-end that uses WordPress. It has been in development for quite a while on a test server, and although the user interface is not much different the structure of the site and a lot of the ancillary content is very different.
I worked on several different projects this week. First of all I completed work on the new Medical Humanities Network website for Gavin Miller. I spent most of last week working on this but didn’t quite manage to get everything finished off, but I did this week. This involved completing the front-end pages for browsing through the teaching materials, collections and keywords. I still need to add in a carousel showing images for the project, and a ‘spotlight on…’ feature, as are found on the homepage of the UoG Medical Humanities site, but I’ll do this later once we are getting ready to actually launch the site. Gavin was hoping that the project administrator would be able to start work on the content of the website over the summer, so everything is in place and ready for them when they start.
With that out of the way I decided to return to some of the remaining tasks in the Historical Thesaurus / OED data linking. It had been a while since I last worked on this, but thankfully the list of things to do I’d previously created was easy to follow and I could get back into the work, which is all about comparing dates for lexemes between the two datasets. We really need to get further information from the OED before we can properly update the dates, but for now I can at least display some rows where the dates should be updated, based on the criteria we agreed on at our last HT meeting.
To begin with I completed a ‘post dating’ script. This goes through each matched lexeme (split into different outputs for ‘01’, ‘02’ and ‘03’ due to the size of the output) and for each it firstly changes (temporarily) any OED dates that are less than 1100 to 1100 and any OED dates that are greater than 1999 to 2100. This is so as to match things up with the HT’s newly updated Apps and Appe fields. The script then compares the HT Appe and OED Enddate fields (the ‘Post’ dates). It ignores any lexemes where these are the same. If they’re not the same the script outputs data in colour-coded tables.
In the Green table were lexemes where Appe is greater or equal to 1150, Appe is less than or equal to 1850 and Enddate is greater than Appe and the difference between Appe and Enddate is no more than 100 years OR Appe is greater than 1850 and Enddate is greater than Appe. The yellow table contains lexemes (other than the above) where Enddate is greater than Appe and the difference between Appe and Enddate is between 101 and 200. In the orange table there are lexemes where the Enddate is greater than Appe and the difference between Appe and Enddate is between 201 and 250, while the red table contained lexemes where the Enddate is greater than Appe and difference between Appe and Enddate is more than 200. It’s a lot of data, and fairly evenly spread between tables, but hopefully it will help us to ‘tick off’ dates that should be updated with figures from the OED data.
I then created an ‘ante dating’ script that looks at the ‘before’ dates (based on OED Firstdate (or ‘Sortdate’ as they call it) and HT apps. This looks at rows where Firstdate is earlier than Apps and splits the data up into colour coded chunks in a similar manner to the above script. I then created a further script that identifies lexemes where there is a later first date or an earlier end date in the OED data for manual checking, as such dates are likely to need investigation.
Finally, I create a script that brings back a list of all of the unique date forms in the HT. This goes through each lexeme and replaces individual dates with ‘nnnn’, then strings all of the various (and there are a lot) date fields together to create a date ‘fingerprint’. Individual date fields are separated with a bar (|) so it’s possible to extract specific parts. The script also made a count of the number of times each pattern was applied to a lexeme. So we have things like ‘|||nnnn||||||||||||||_’ which is applied to 341,308 lexemes (this is a first date and still in current use) and ‘|||nnnn|nnnn||-|||nnnn|nnnn||+||nnnn|nnnn||’ which is only used for a single lexeme. I’m not sure exactly what we’re going to use this information for, but it’s interesting to see the frequency of the patterns.
I spent most of the rest of the week working on the DSL. This included making some further tweaks to the WordPress version of the front-end, which is getting very close to being ready to launch. This included updating the way the homepage boxes work to enable staff to more easily control the colours used and updating the wording for search results. I also investigated an issue in the front end whereby slightly different data was being returned for entries depending on the way in which the data was requested. Using dictionary ID (e.g. https://dsl.ac.uk/entry/dost44593) brings back some additional reference text that is not returned when using the dictionary and href method (e.g. https://dsl.ac.uk/entry/dost/proces_n). It looks like the DSL API processes things differently depending on the type of call, which isn’t good. I also checked the full dataset I’d previously exported from the API for future use and discovered it is the version that doesn’t contain the full reference text, so I will need to regenerate this data next week.
My main DSL task was to work on a new version of the API that just uses PHP and MySQL, rather than technologies that Arts IT Support are not so keen on having on their servers. As I mentioned, I had previously run a script that got the existing API to spit out its fully generated data for every single dictionary entry and it’s this version of the data that I am currently building the new API around. My initial aim is to replicate the functionality of the existing API and plug a version of the DSL website into it so we can compare the output and performance of the new API to that of the existing API. Once I have the updated data I will create a further version of the API that uses this data, but that’s a little way off yet.
So far I have completed the parts of the API for getting data for a single entry and the data required by the ‘browse’ feature. Information on how to access the data, and some examples that you can follow, and included in the API definition page. Data is available as JSON (the default as used by the website) and CSV (which can be opened in Excel). However, while the CSV data can be opened directly in Excel any Unicode characters will be garbled, and long fields (e.g. the XML content of long entries) will likely be longer than the maximum cell size in Excel and will break onto new lines.
I also replicated the WordPress version of the DSL front-end here and set it up to work with my new API. As of yet the searches don’t work as I haven’t developed the search parts of the API, but it is possible to view individual entries and use the ‘browse’ facility on the entry page. These features use the new API and the new ‘fully generated’ data. This will allow staff to compare the display of entries to see if anything looks different.
I still need to work on the search facilities of the API, and this might prove to be tricky. The existing API uses Apache Solr for fulltext searching, which is a piece of indexing software that is very efficient for large volumes of text. It also brings back nice snippets showing where results are located within texts. Arts IT Support don’t really want Solr on their servers as it’s an extra thing for them to maintain. I am hoping to be able to develop comparable full text searches just using the database, but it’s possible that this approach will not be fast enough, or pinpoint the results as well as Solr does. I’ll just need to see how I get on in the coming weeks.
I also worked a little bit on the RNSN project this week, adding in some of the concert performances to the existing song stories. Next week I’m intending to start on the development of the front end for the SCOSYA project, and hopefully find some time to continue with the DSL API development.