Week Beginning 5th November 2018

After a rather hectic couple of weeks this was a return to a more regular sort of week, which was a relief.  I still had more work to do than there was time to complete, but it feels like the backlog is getting smaller at least.  As with previous weeks, I continued with the HT / OED linking of categories processes this week, following on from the meeting Marc, Fraser and I had the Friday before.  For the lexeme / data matching script I separated out categories with zero matches that have words from the orange list into a new list with a purple background.  So orange now only contains categories where at least one word and its start date match.  The ones now listed in purple are almost certainly incorrect matches.  I also changed the ordering of results so that categories are listed by the largest number of matches, to make it easier to spot matches that are likely ok.

I also updated the ‘monosemous’ script, so that the output only contains OED categories that feature a monosemous word and is split into three tables (with links to each at the top of the page).  The first table features 4455 OED categories that include a monosemous word that has a comparable form in the HT data.  Where there are multiple monosemous forms they each correspond to the same category in the HT data.  The second table features 158 OED categories where the linked HT forms appear in more than one category.  This might either be because the word is not monosemous in the HT data and appears in two different categories (these are marked with the text ‘red|’ they can be search for in page.  An OED category can also appear in this table even if there are no red forms if (for example) one of the matched HT words is in a different category to all of the others (see OED catid 45524) where the word ‘Puncican’ is found in a different HT category to the other words).  The final table contains those OED categories that feature monosemous words that have no match in the HT data.  There are 1232 of these.  I also created a QA script for the 4455 matched monosemous categories, which applies the same colour coding and lexeme matching as other QA scripts I’ve created.  On Friday we had another meeting to discuss the findings and plan our next steps, which I will continue with next week.

Also this week I wrote an initial version of a Data Management Plan for Thomas Clancy’s Iona project, and commented on the DMP assessment guidelines that someone from the University’s Data Management people had put together.  I can’t really say much more about these activities, but it took at least a day to get all of this done.  I also did some app management duties, setting up an account for a new developer, and made the new Seeing Speech and Dynamic Dialects websites live.  These can now be viewed here: https://www.seeingspeech.ac.uk/ and here: https://www.dynamicdialects.ac.uk/.  I also had an email conversation with Rhona Alcorn about Google Analytics for the DSL site.

With the REELS project’s official launch approaching, I spent a bit of time this week going through the 23 point ‘to do’ list I’d created last week.  In fact, I added another three items to it.  I’m going to tackle the majority of the outstanding issues next week, but this week I investigated and fixed an issue with the ‘export’ script in the Content Management System. The script is very memory intensive and it was exceeding the server’s memory limits, so asking Chris to increase this limit sorted the issue.  I also updated the ‘browse place-names’ feature of the CMS, adding a new column and ordering facility to make it clearer which place-names actually appear on the website.  I also updated the front-end so that it ‘remembers’ whether you prefer the map or the text view of the data using HTML5 local storage and added in information about the Creative Commons license to the site and the API.  I investigated the issue of parish boundary labels appearing on top of icons, but as of yet I’ve not found a way to address this.  I might return to it before the launch if there’s time, but it’s not a massive issue.  I moved all of the place-name information on the record page above the map, other than purely map-based data such as grid reference.  I also removed the option to search the ‘analysis’ field from the advanced search and updated the element ‘auto-complete’ feature so that it only now matches the starting letters of an element rather than any letters.  I also noticed that the combination of ‘relief’ and ‘water’ classifications didn’t have an icon on the map, so I created one for it.

I also continued to work on the Bilingual Thesaurus website this week.  I updated the way in which source links work.  Links to dictionary sources now appear as buttons in the page, rather in a separate pop-up.  They feature the abbreviation (AND / MED / OED) and the magnifying glass icon and if you hover over a button the non-abbreviated form appears.  For OED links I’ve also added the text ‘subscription required’ to the hover-over text.  I also updated the word record so that where language of origin is ‘unknown’ the language of origin no longer gets displayed, and I made the headword text a bit bigger so it stands out more.  I also added the full hierarchy above the category heading in the category section of the browse page, to make it easier to see exactly where you are.  This will be especially useful for people using the site on narrow screens as the tree appears beneath the category section so is not immediately visible.  You can click on any of the parts of the hierarchy here to jump to that point.

I then began to work on the search facility, and realised I needed to implement a ‘search words’ list that features variants.  I did this for the Historical Thesaurus and it’s really useful.  What I’ve done so far is generate alternatives for words that have brackets and dashes.  For example, the headword ‘Bond(e)-man’ has the following search terms: Bond(e)-man, Bond-man, Bonde-man, Bond(e) man, Bond man, Bonde man, Bond(e)man, Bondman, Bondeman.  None of these varieties will ever appear on the website, but instead will be used to find the word when people search.  I’ll need some feedback as to whether these options will suffice, but for now I’ve uploaded variants to a table and began to get the quick search working.  It’s not entirely there yet, but I should get this working next week.  I also need to know what should be done about accented characters for search purposes.  The simplest way to handle them would be to just treat them as non-accented characters – e.g. searching for ‘alue’ will find ‘alué’.  However, this does mean you won’t be able to specifically search for words that include accented characters – e.g. a search for all the words featuring an ‘é’ will just bring back all characters with an ‘e’ in them.

I was intending to add a count of the number of words in each hierarchical level to the browse, or at least to make hierarchical levels that include words bold in the browse, so as to let users know whether it’s worthwhile clicking on a category to view the words at this level.  However, I’ve realised that this will just confuse users as levels that have no words in them but include child categories that do have words in them would be listed with a zero or not in bold, giving the impression that there is no content lower down the hierarchy.

My last task for the week was to create a new timeline for the RNSN project based on data that had been given to me.  I think this is looking pretty good, but unfortunately making these timelines and related storymaps is very time-intensive, as I need to extract and edit the images, upload them to WordPress, extract the text and convert it into HTML and fill out the template with all of the necessary fields.  It took about 2 and a half hours to make this timeline.  However, hopefully the end result will be worth it.