Week Beginning 17th June 2019

I seem to be heading through a somewhat busy patch at the moment, and had to focus my efforts on five major projects and several other smaller bits of work this week.  The major projects were SCOSYA, Books and Borrowing, DSL, HT and Bess of Hardwick’s Account books.  For SCOSYA I continued to implement the public atlas, this week focussing on the highlighting of groups.  I had hoped that this would be a relatively straightforward feature to implement, as I had already created facilities to create and view groups in the atlas I’d made for the content management system.  However, it proved to be much trickier than I’d anticipated as I’d rewritten much of the atlas code in order to incorporate the GeoJSON areas as well as purely point-based data, plus I needed to integrate the selection of groups and the loading of group locations with the API.  My existing code for finding the markers for a specified group and adding a coloured border was just not working, and I spent a frustratingly long amount of time debugging the code to find out what had changed to stop the selection from finding anything.  It turned out that in my new code I was reinstantiating the variable I was using to hold all of the point data within a function, meaning that the scope of the variable containing the data was limited to that function rather than being available to other functions.  Once I figured this out it was a simple fix to make the data available to the parts of the code that needed to find and highlight relevant markers and I then managed to make groups of markers highlight or ‘unhighlight’ at the press of a button, as the following screenshot demonstrates:

You can now select one or more groups and the markers in the group are highlighted in green.  Press a group button a second time to remove the highlighting.  However, there is still a lot to be done.  For one thing, only the markers highlight, not the areas.  It’s proving to be rather complicated to get the areas highlighted as these GeoJSON shapes are handled quite differently to markers.  I spent a long time trying to get the areas to highlight without success and will need to return to this another week.  I also need to implement highlighting in different colours, so each group you choose to highlight is given a different colour to the last.  Also, I need to find a way to make the selected groups be remembered as you change from points to areas to both, and change speaker type, and also possibly as you change between examples.  Currently the group selection resets but the selected group buttons remain highlighted, which is not ideal.

I also spend time this week on the pilot project for Matthew Sangster’s Books and Borrowing project, which is looking at University student (and possibly staff) borrowing records from the 18th century.  Matthew has compiled a spreadsheet that he wants me to create a searchable / browsable online resource for and my first task was to extract the data from the spreadsheet, create an online database and write a script to migrate the data to this database.  I’ve done this sort of task many times before, but unfortunately things are rather more complicated this time because Matthew has included formatting within the spreadsheet that needs to be retained in the online version.  This includes superscript text throughout the more than 8000 records and simply saving the spreadsheet as a CSV file and writing a script to go through each cell and upload the data won’t work as the superscript style will be lost in the conversion to CSV.  PHPMyAdmin also includes a facility to import a spreadsheet in the OpenDocument format, but unfortunately this not only removes the superscript format but also the text that is specified as superscript as well.

Therefore I had to investigate other ways of getting the data out of the spreadsheet while somehow retaining the superscript formatting.  The only means of doing so that I could think of was to save the spreadsheet as an HTML document, which would convert Excel’s superscript formatting into HTML superscript tags, which is what we’d need for displaying the data on a website anyway.  Unfortunately the HTML generated by Excel is absolutely awful and filled with lots of unnecessary junk that I then needed to strip out manually.  I managed to write a script that extracted the data (including the formatting for superscript) and import this into the online database for about 8000 of the 8200 rows, but the remainder had problems that prevented the insertion from taking place.  I’ll need to think about creating multiple passes for the data when I return to it next week.

For the DSL this week I spent rather a lot of time engaged in email conversations with Rhona Alcorn about the tasks required to sort out the data that the team have been working on for several years and which now needs to be extracted from older systems and migrated to a new system, plus the API that I am working on.  It looked like there would be a lot of work for me to do with this, but thankfully midway through the week it became apparent that the company who are supplying the new system for managing the DSL’s data have a member of staff who is expecting to do a lot of the tasks that had previously been assigned to me.  This is really good news as I was beginning to worry about the amount of work I wold have to do for the DSL and how I would fit this in around other work commitments.  We’ll just need to see how this all pans out.

I also spent some time implementing a Boolean search for the new DSL API.  I now have this in place and working for headword searches, which can be performed via the ‘quick search’ box on the test sites I’ve created.  It’s possible to use Boolean AND, OR and NOT (all must be entered upper case to be picked up) and a search can be used in combination with wildcards, and speech-marks can now be used to specify an exact search.  So, for example, if you want to find all the headwords beginning with ‘chang’ but wish to exclude results for ‘change’ and ‘chang’ you can enter ‘chang* NOT “change” NOT “chang”’.

OR searches are likely to bring back lots of results and at the moment I’ve not put a limit on the results, but I will do so before things go live.  Also, while there are no limits on the number of Booleans that can be added to a query, results when using multiple Booleans are likely to get a little weird due to there being multiple ways a query could be interpreted.  E.g. ‘Ran* OR run* NOT rancet’ still brings back ‘rancet’ because the query is interpreted as ‘get all the ‘ran*’ results OR all the ‘run*’ results so long as they don’t include ‘rancet’ – so ran* OR (run* NOT rancet).  But without complicating things horribly with brackets or something similar there’s no way of preventing such ambiguity when multiple different Booleans are used.

For the Historical Thesaurus I met with Marc and Fraser on Monday to discuss our progress with the HT / OED linking and afterwards continued with a number of tasks that were either ongoing or had been suggested at the meeting.  This included ticking off some matches from a monosemous script, creating a new script that brings back up to 1000 random unmatched lexemes at a time for spot-checking and creating an updated Levenshtein script for lexemes, which is potentially going to match a further 5000 lexemes.  I also wrote a document detailing how I think that full dates should be handled in the HT, to replace the rather messy way dates are currently recorded.  We will need to decide on a method in order to get the updated dates from the OED into a comparable format.

Also this week I returned to Alison Wiggins’s Account Books project, or rather a related output about the letters of Mary, Queen of Scots.  Alison had sent me a database containing a catalogue of letters and I need to create a content management system to allow her and other team members to work on this together.  I’ve requested a new subdomain for this system and have begun to look at the data and will get properly stuck into this next week, all being well.

Other than these main projects I also gave feedback on Thomas Clancy’s Iona project proposal, including making some changes to the Data Management Plan, helped sort out access to logo files for the Seeing Speech project, sorted out an issue with the Editing Burns blog that was displaying no content since the server upgrade (it turns out it was using a very old plugin that was not compatible with the newer version of PHP on the server) and helped sort out some app issues.  All in all a very busy week.

Week Beginning 3rd June 2019

I continued to develop the public interface for the SCOSYA project this week, and also helped out with the preparations for next week’s Data Hack event that the project is organising, which involved sorting out hosted for a lot of sample data.  On Monday I had a meeting with Jennifer and E, at which we went through the interface I had so far created and discussed things that needed updated or changed in some way.  It was a useful meeting and I came away with a long list of things to do, which I then spent quite some time during the remainder of the week implementing.  This included changing the font used throughout the site and drastically changing the base layer we use for the maps.  I had previously created a very simple ‘green land, blue sea’ base map, which is what the team had requested, but they wanted to try something a bit simpler still – white sea and light grey land – in order to emphasise the data points more than anything else.  I also removed all place-names from the map and in fact everything other than borders and water.  I also updated the colour range used for ratings, from a yellow to red scheme to a more grey / purple scheme that had been suggested by E.  This is now used both for the markers and for the areas.  Regarding areas, I removed the white border from the areas to make areas with the same rating blend into one another and make the whole thing look more like a heatmap, as the following screenshot demonstrates:

I also completely changed the way the pop-ups look, as it was felt that the previous version was just a bit too garish and comic book like.  The screenshot below shows markers with a pop-up open:

I also figured out how to add sound clips to story slides and I’ve changed how the selection of ‘examples’ works.  Rather than having a drop-down list and then all of the information about a selected feature displayed underneath I have split things up.  Now when you open the ‘Examples’ section you will see the examples listed as a series of buttons.  Pressing on one of these then loads the feature, automatically loading the data for it into the map.  There’s a button for returning to the list of examples, then the feature’s title and description, followed by sound clips if there are any are displayed.  Underneath this are the buttons for changing ‘speakers’ and ‘locations’.  Pressing on one of these options now automatically refreshes the map so there’s no longer any need for a ‘Show’ button. I think this works much better.  Note that your choice of speaker and location is remembered when using the map – e.g. if you have selected ‘Young’ and ‘Areas’ then go back and select a different example then the map will default to ‘Young’ and ‘Areas’ when this new feature is displayed.

I’ve also added a check for screen size that fires every time a side panel section is opened.  This ensures that if someone has resized their browser or changed the orientation of their screen the side panel should still fit. I still haven’t had time to get the ‘groups’ feature working yet, or to fix the display of stories on smaller screens.  I also need to update the ‘Learn more’ section so it uses a list rather than a drop-down box, all tasks I hope to continue with next week.

I also spent a bit of time on the Seeing Speech and Dynamic Dialects projects, helping to add in a new survey for each, participated in the monthly College of Arts developers coffee catch-up and advised a couple of members of staff on blog related issues and spoke to Kirsteen McCue about the proposal she’s putting together.

Other than these tasks I spent about a day working on DSL issues.  This included getting some data to Ann about which existing DSL entries were not present in the dataset that had been newly extracted from the server.  This appears to have been caused by some entries being merged with existing entries.  I also managed to get the new dataset uploaded to our temporary web-server and created a new API that outputs this new data.  I still need to create an alternative version of the DSL front-end that connects to this new version of the data, which I hope to be able to at least get started on next week.  I also did some investigation into scripts that Thomas Widmann had discussed in some hand-over documentation that did not seem to be available anywhere and discussed some issues relating to the server the DSL people host in their offices.

I also spent some time working on HT duties, making some tweaks to existing scripts based on feedback from Fraser, investigating why one of our categories is not accessible via the website (the answer being it was a subcategory that didn’t have a main category in the same part of speech so had no category to ‘hang’ off).  I also had a further meeting with Marc and Fraser on Friday to discuss our progress with the HT  OED linking.

Week Beginning 19th November 2018

This week I mainly working on three projects:  The Historical Thesaurus, the Bilingual Thesaurus and the Romantic National Song Network.  For the HT I continued with the ongoing and seemingly never-ending task of joining up the HT and OED datasets.  Marc, Fraser and I had a meeting last Friday and I began to work through the action points from this meeting on Monday.  By Wednesday I had ticked off most of the items, which I’ll summarise here.

Whilst developing the Bilingual Thesaurus I’d noticed that search term highlighting on the HT site wasn’t working for quick searches, only advanced searches for words, so I investigated and fixed this.  I then updated the lexeme pattern matching / date matching script to incorporate the stoplist we’d created during last week’s meeting (words or characters that should be removed when comparing lexemes, such as ‘to ‘ and ‘the’).  This worked well and has bumped matches up to better colour levels, but has resulted in some words getting matched multiple times.  E.g. when removing ‘to’, ‘of’ etc this results in a form that then appears multiple times.  For example, in one category the OED has ‘bless’ twice (presumably an erro?) and HT has ‘bless’ and ‘bless to’.  With ‘to’ removed there then appear to be more matches that there should be.  However, this is not an issue when dates are also taken into consideration.  I also updated the script so that categories where there are 3 matches and at least 66% of words match have been promoted from orange to yellow.

When looking at the outputs at the meeting Marc wondered why certain matches (e.g. 120202 ‘relating to doctrine or study’ / ‘pertaining to doctrine/study’ and 88114 ‘other spec.’ / ‘other specific’) hadn’t been ticked off and wondered whether category heading pattern matching had worked properly.  After some investigation I’d say it has worked properly – the reason these haven’t been ticked off is they contain too few words to have reached the criteria for ticking off.

Another script we looked at during our meeting was the sibling matching script, which looks for matches at the same hierarchical level and part of speech, but different numbers.  I completely overhauled the script to bring it into line with the other scripts (including recent updates such as the stoplist for lexeme matching and the new yellow criteria).  There are currently 19, 17 and 25 green, lime green and yellow matches that could be ticked off.  I also ticked off the empty category matches listed on the ‘thing heard’ script (so long as they have a match) and for the ‘Noun Matching’ I ticked off the few matches that there were.  Most were empty categories and there were less than 15 in total.

Another script I worked on was the ‘monosemous’ script, which looks for monosemous forms in unmatched categories and tries to identify HT categories that also contain these forms.  We weren’t sure at the meeting whether this script identified words that were fully monosemous in the entire dataset, or those that were monosemous in the unmatched categories.  It turned out it was the former, so I updated the script to only look through the unchecked data, which has identified further monosemous forms.  This has helped to more accurately identify matched categories.  I also created a QA script that checks the full categories that have potentially been matched by the monosemous script.

I also worked on the date fingerprinting script.  This gets all of the start dates associated with lexemes in a category, plus a count of the number of times each date appears, and uses these to try and find matches in the HT data.  I updated this script to incorporate the stoplist and the ‘3 matches and 66% match’ yellow rule, and ticked off lots of matches that this script identified.  I ticked off all green (1556), lime green (22) and yellow (123) matches.

Out of curiosity, I wrote a script that looked at our previous attempt at matching the categories, which Fraser and I worked on last year and earlier this year.  The script looks at categories that were matched during this ‘v1’ process that had yet to be matched during our current ‘v2’ process.  For each of these the script performs the usual checks based on content: comparing words and first dates and colour coding based on number of matches (this includes the stoplist and new yellow criteria mentioned earlier).  There are 7148 OED categories that are currently unmatched but were matched in V1.  Almost 4000 of these are empty categories.  There are 1283 ‘purple’ matches, which means (generally) something is wrong with the match.  But there are 421 in the green, lime green and yellow sections, which is about 12% of the remaining unmatched OED categories that have words.  It might also be possible to spot some patterns to explain why they were matched during v1 but have yet to be matched in v2.  For example, 2711 ‘moving water’ has 01.02.06.01.02 and its HT counterpart has 01.02.06.01.01.02.  There are possibly patterns in the 1504 orange matches that could be exploited too.

Finally, I updated the stats page to include information about main and subcats.  Here are the current unmatched figures:

Unmatched (with POS): 8629

Unmatched (with POS and not empty): 3414

Unmatched Main Categories (with POS): 5036

Unmatched Main Categories (with POS and not empty): 1661

Unmatched Subcategories (with POS): 3573

Unmatched Subcategories (with POS and not empty): 1753

So we are getting there!

For the Bilingual Thesaurus I completed an initial version of the website this week.  I have replaced the original colour scheme with a ‘red, white and blue’ colour scheme as suggested by Louise.  This might be changed again, but for now here is an example of how the resource looks:

The ‘quick’ and ‘advanced’ searches are also now complete, using the ‘search words’ mentioned in a previous post, and ignoring accents on characters.  As with the HT, by default the quick search matches category headings and headwords exactly, so ‘ale’ will return results as there is a category ‘ale’ and also a word ‘ale’ but ‘bread’ won’t match anything because there are no words or categories with this exact text.  You need to use an asterisk wildcard to find text within word or category text:  ‘bread*’ would find all items starting with ‘bread’, ‘*bread’ would find all items ending in ‘bread’ and ‘*bread*’ would find all items with ‘bread’ occurring anywhere.

The ‘advanced search’ lets you search for any combination of headword, category, part of speech, section, dates and languages or origin and citation.  Note that if you specify a range of years in the date search it brings back any word that was ‘active’ in your chosen period.  E.g. a search for ‘1330-1360’ will bring back ‘Edifier’ with a date of 1100-1350 because it was still in use in this period.

As with the HT, different search boxes are joined with ‘AND’ – e.g. if you tick ‘verb’ and select ‘Anglo Norman’ as the section then only words that are verbs AND Anglo Norman will be returned.  Where search types allow multiple options to be selected (i.e. part of speech and languages of origin and citation) if multiple options in each list are selected these are joined by ‘OR’.  E.g. if you select ‘noun’ and ‘verb’ and select ‘Dutch’, ‘Flemish’ and ‘Italian’ as languages or origin this will find all words that are either nouns OR verbs AND have a language of origin of Dutch OR Flemish OR Italian.

Search results display the full hierarchy leading to the category, the category name, plus the headword, section, POS and dates (if the result is a ‘word’ result rather than a ‘category’ result).  Clicking through to the category highlights the word. I also added in a ‘cite’ option to the category page and updated the ‘About’ page to add a sentence about the current website.  The footer still needs some work (e.g. maybe including logos for the University of Westminster and Leverhulme) and there’s a ‘terms of use’ page linked to from the homepage that currently doesn’t have any content, but other than that I think most of my work here is done.

For the Romantic National Song Network I continued to create timelines and ‘storymaps’ based on powerpoint presentations that had been sent to me.  This is proving to be a very time-intensive process, as it involves extracting images, audio files and text from the presentations, formatting the text as HTML, reworking the images (resizing, sometimes joining multiple images together to form one image, changing colour levels, saving the images, uploading them to the WordPress site), uploading the audio files, adding in the HTML5 audio tags to get the audio files to play, creating the individual pages for each timeline entry / storymap entry.  It took the best part of an afternoon to create one timeline for the project, which involved over 30 images, about 10 audio files and more than 20 Powerpoint slides.  Still, the end result works really well, so I think it’s worth putting the effort in.

In addition to these projects I met with a PhD student, Ewa Wanat, who wanted help in creating an app.  I spent about a day attempting to make a proof of concept for the app, but unfortunately the tools I work with are just not very well suited to the app she wants to create.  The app would be interactive and highly dependent on logging user interactions as accurately as possible.  I created looked into using the d3.js library to create the sort of interface she wanted (a circle that rotates with smaller circles attached to it, that the user should tap on when a certain point in the rotation is reached), but although this worked, the ‘tap’ detection was not accurate enough.  In fact on touchscreens more often than not a ‘tap’ wasn’t even being registered.  D3.js just isn’t made to deal with time-sensitive user interaction on animated elements and I have no experience with any libraries that are made in this way, so unfortunately it looks like I won’t be able to help out with this project.  Also, Ewa wanted the app to be launched in January and I’m just far too busy with other projects to be able to do the required work in this sort of timescale.

Also this week I helped extract some data about the Seeing Speech and Dynamic Dialects videos for Eleanor Lawson, I responded to queries from Meg MacDonald and Jennifer Nimmo about technical work on proposals they are involved with, I responded to a request for advice from David Wilson about online surveys, and another request from Rachel Macdonald about the use of Docker on the SPADE server.  I think that’s just about everything to report.

Week Beginning 12th November 2018

I spent most of my time this week split between three projects:  The HT / OED category linking, the REELS project and the Bilingual Thesaurus.  For the HT I continued to work on scripts to try and match up the HT and OED categories.  This week I updated all the currently in use scripts so that date checks now extract the first four numeric characters (OE is converted to 1000 before this happens) from the ‘GHT_date1’ field in the OED data and the ‘fulldate’ field in the HT data.  Doing this has significantly improved the matching on the first date lexeme matching script.  Greens have gone from 415 to 1527, lime greens from 2424 to 2253, yellows from 988 to 622 and oranges from 2363 to 1788.  I also updated the word lists to make them alphabetical, so it’s easier to compare the two lists and included two new columns.  The first is for matched dates (ignoring lexeme matching), which is a count of the number of dates in the HT and OED categories that match while the second is this figure as a percentage of the total number of OED lexemes.

However, taking dates in isolation currently isn’t working very well, as if a date appears multiple times it generates multiple matches.  So, for example, the first listed match for OED CID 94551 has 63 OED words, and all 63 match for both lexeme and date.  But lots of these have the same dates, meaning a total count of matched dates is 99, or 152% of the number of OED words.  Instead I think we need to do something more complicated with dates, making a note of each one AND the number of times each one appears in a category as its ‘date fingerprint’.

I created a new script to look at ‘date fingerprints’.  The script generates arrays of categories for HT and OED unmatched categories.  The dates of each word (or each word with a GHT date in the case of OED) in every category is extracted and a count of these is created (e.g. if the OED category 5678 has 3 words with 1000 as a date and 1 word with 1234 as a date then its ‘fingerprint’ is 5678[1000=>3,1234=>1].  I ran this against the HT database to see what matches.

The script takes about half an hour to process.  It grabs each unmatched OED category that contains words, picks out those that have GHT dates, gets the first four numerical figures of each and counts how many times this appears in the category.  It does the same for all unmatched HT categories and their ‘fulldate’ column too.  The script then goes through each OED category and for each goes through every HT category to find any that have not just the same dates, but the same number of times each date appears too.  If everything matches the information about the matched categories is displayed.

The output has the same layout as the other scripts but where a ‘fingerprint’ is not unique a category (OED or HT) may appear multiple times, linked to different categories.  This is especially common for categories that only have one or two words, as the combination of dates is less likely to be unique.  For an example of this search for our old favourite ‘extra-terrestrial’ and you’ll see that as this is the only word in its category, any HT categories that also have one word and the same start date (1963) are brought back as potential matches.  Nothing other than the dates are used for matching purposes – so a category might have a different POS, or be in a vastly different part of the hierarchy.  But I think this script is going to be very useful.

I also created a script that ignores POS when looking for monosemous forms, but this hasn’t really been a success.  It finds 4421 matches as opposed to 4455, I guess because some matches that were 1:1 are being complicated by polysemous HT forms in different parts of speech.

With these updates in place, Marc and Fraser gave the go-ahead for connections to be ticked off.  Greens, lime greens and yellows from ‘lexeme first date matching’  script have now been ticked off.  There were 1527, 2253 and 622 in these respective sections, so a total of 4402 ticked off.  That takes us down to 6192 unmatched OED categories that have a POS and are not empty, or 11380 unmatched that have a POS if you include empty ones.  I then ‘unticked’ the 350 purple rows from the script I’d created to QA the ‘erroneous zero’ rows that had been accidentally ticked off last week. This means we now have 6450 unmatched OED categories with words, or 11730 including those without words.  I then ticked off all of the ‘thing heard’ matches other than some rows tht Marc had spotted as being wrong.  1342 have been ticked off, bringing our unchecked but not empty total down to 5108 and our unchecked including empty total down to 10388.  On Friday, Marc, Fraser and I had a further meeting to discuss our next steps, which I’ll continue with next week.

For the REELS project I continued going through my list of things to do before the project launch.  This included reworking the Advanced Search layout, adding in tooltip text, updating the start date browse, which was including ‘inactive’ data in it’s count, created some further icons for combinations of classification codes, added in Creative Commons logos and information, added an ‘add special character’ box to the search page, added a ‘show more detail’ option to the record page that displays the full information about place-name elements, added an option to the API and Advanced Search that allows you to specify if your element search looks at current forms, historical forms or both, added in Google Analytics, updated the site text and page structure to make the place-name search and browse facilities publicly available, created a bunch of screenshots for the launch, set up the server on my laptop for the launch and made everything live.  You can now access the place-names here: https://berwickshire-placenames.glasgow.ac.uk/ (e.g. by doing a quick search or choosing to browse place-names)

I also investigated a strange situation Carole had encountered with the Advanced Search, whereby a search for ‘pn’ and ‘<1500’ brings back ‘Hassington West Mains’, even though it only has a ‘pn’ associated with a Historical form from 1797.  The search is really ‘give me all the place-names that have an associated ‘pn’ element and also have an earliest historical form before 1500’.  The usage of elements in particular historical forms and their associated dates is not taken into consideration – we’re only looking at the earliest recorded date for each place-name. Any search involving historical form data is treated in the same way – e.g. if you search for ‘<1500’ and ‘Roy’ as a source you also get Hassington West Mains as a result, because its earliest recorded historical form is before 1500 and it includes a historical form that has ‘Roy’ as a source.  Similarly if you search for ‘<1500’ and ‘N. mains’ as a historical form you’ll also get Hassignton West Mains, even though the only historical form before 1500 is ‘(lands of) Westmaynis’.  This is because again the search is ‘get me all of the place-names with a historical form before 1500 that have any historical form including the text ‘N. mains’.  We might need to make it clearer that ‘Earliest start date’ refers to the earliest historical form for a place-name record as a whole, not the earliest historical form in combination with ‘historical form’, ‘source’, ‘element language’ or ‘element’.

On Saturday I attended the ‘Hence the Name’ conference run by the Scottish Place-name Society and the Scottish Records Association, where we launched the website.  Thankfully everything went well and we didn’t need to use the screenshots or the local version of the site on my laptop, and the feedback we received about the resource was hugely positive.

For the Bilingual Thesaurus I continued to implement the search facilities for the resource.  This involved stripping out a lot of code from the HT’s search scripts that would not be applicable to the BTH’s data, and getting the ‘quick search’ feature to work.  After getting this search to actually bring back data I then had to format the results page to incorporate the fields that were appropriate for the project’s data, such as the full hierarchy, whether the word results are Anglo Norman or Middle English, dates, parts of speech and such things.  I also had to update the category browse page to get search result highlighting to work and to get the links back to search results working.  I then made a start on the advanced search form.

Other than these projects I also spoke to fellow developer David Wilson to give him some advice on Data Management Plans, I emailed Gillian Shaw with some feedback on the University’s Technician Commitment, I helped out Jane with some issues relating to web stats, I gave some advice to Rachel Macdonald on server specifications for the SPADE project, I replied to two PhD students who had asked me for advice on some technical matters, and I gave some feedback to Joanna Kopaczyk about hardware specifications for a project she’s putting together.

 

Week Beginning 5th November 2018

After a rather hectic couple of weeks this was a return to a more regular sort of week, which was a relief.  I still had more work to do than there was time to complete, but it feels like the backlog is getting smaller at least.  As with previous weeks, I continued with the HT / OED linking of categories processes this week, following on from the meeting Marc, Fraser and I had the Friday before.  For the lexeme / data matching script I separated out categories with zero matches that have words from the orange list into a new list with a purple background.  So orange now only contains categories where at least one word and its start date match.  The ones now listed in purple are almost certainly incorrect matches.  I also changed the ordering of results so that categories are listed by the largest number of matches, to make it easier to spot matches that are likely ok.

I also updated the ‘monosemous’ script, so that the output only contains OED categories that feature a monosemous word and is split into three tables (with links to each at the top of the page).  The first table features 4455 OED categories that include a monosemous word that has a comparable form in the HT data.  Where there are multiple monosemous forms they each correspond to the same category in the HT data.  The second table features 158 OED categories where the linked HT forms appear in more than one category.  This might either be because the word is not monosemous in the HT data and appears in two different categories (these are marked with the text ‘red|’ they can be search for in page.  An OED category can also appear in this table even if there are no red forms if (for example) one of the matched HT words is in a different category to all of the others (see OED catid 45524) where the word ‘Puncican’ is found in a different HT category to the other words).  The final table contains those OED categories that feature monosemous words that have no match in the HT data.  There are 1232 of these.  I also created a QA script for the 4455 matched monosemous categories, which applies the same colour coding and lexeme matching as other QA scripts I’ve created.  On Friday we had another meeting to discuss the findings and plan our next steps, which I will continue with next week.

Also this week I wrote an initial version of a Data Management Plan for Thomas Clancy’s Iona project, and commented on the DMP assessment guidelines that someone from the University’s Data Management people had put together.  I can’t really say much more about these activities, but it took at least a day to get all of this done.  I also did some app management duties, setting up an account for a new developer, and made the new Seeing Speech and Dynamic Dialects websites live.  These can now be viewed here: https://www.seeingspeech.ac.uk/ and here: https://www.dynamicdialects.ac.uk/.  I also had an email conversation with Rhona Alcorn about Google Analytics for the DSL site.

With the REELS project’s official launch approaching, I spent a bit of time this week going through the 23 point ‘to do’ list I’d created last week.  In fact, I added another three items to it.  I’m going to tackle the majority of the outstanding issues next week, but this week I investigated and fixed an issue with the ‘export’ script in the Content Management System. The script is very memory intensive and it was exceeding the server’s memory limits, so asking Chris to increase this limit sorted the issue.  I also updated the ‘browse place-names’ feature of the CMS, adding a new column and ordering facility to make it clearer which place-names actually appear on the website.  I also updated the front-end so that it ‘remembers’ whether you prefer the map or the text view of the data using HTML5 local storage and added in information about the Creative Commons license to the site and the API.  I investigated the issue of parish boundary labels appearing on top of icons, but as of yet I’ve not found a way to address this.  I might return to it before the launch if there’s time, but it’s not a massive issue.  I moved all of the place-name information on the record page above the map, other than purely map-based data such as grid reference.  I also removed the option to search the ‘analysis’ field from the advanced search and updated the element ‘auto-complete’ feature so that it only now matches the starting letters of an element rather than any letters.  I also noticed that the combination of ‘relief’ and ‘water’ classifications didn’t have an icon on the map, so I created one for it.

I also continued to work on the Bilingual Thesaurus website this week.  I updated the way in which source links work.  Links to dictionary sources now appear as buttons in the page, rather in a separate pop-up.  They feature the abbreviation (AND / MED / OED) and the magnifying glass icon and if you hover over a button the non-abbreviated form appears.  For OED links I’ve also added the text ‘subscription required’ to the hover-over text.  I also updated the word record so that where language of origin is ‘unknown’ the language of origin no longer gets displayed, and I made the headword text a bit bigger so it stands out more.  I also added the full hierarchy above the category heading in the category section of the browse page, to make it easier to see exactly where you are.  This will be especially useful for people using the site on narrow screens as the tree appears beneath the category section so is not immediately visible.  You can click on any of the parts of the hierarchy here to jump to that point.

I then began to work on the search facility, and realised I needed to implement a ‘search words’ list that features variants.  I did this for the Historical Thesaurus and it’s really useful.  What I’ve done so far is generate alternatives for words that have brackets and dashes.  For example, the headword ‘Bond(e)-man’ has the following search terms: Bond(e)-man, Bond-man, Bonde-man, Bond(e) man, Bond man, Bonde man, Bond(e)man, Bondman, Bondeman.  None of these varieties will ever appear on the website, but instead will be used to find the word when people search.  I’ll need some feedback as to whether these options will suffice, but for now I’ve uploaded variants to a table and began to get the quick search working.  It’s not entirely there yet, but I should get this working next week.  I also need to know what should be done about accented characters for search purposes.  The simplest way to handle them would be to just treat them as non-accented characters – e.g. searching for ‘alue’ will find ‘alué’.  However, this does mean you won’t be able to specifically search for words that include accented characters – e.g. a search for all the words featuring an ‘é’ will just bring back all characters with an ‘e’ in them.

I was intending to add a count of the number of words in each hierarchical level to the browse, or at least to make hierarchical levels that include words bold in the browse, so as to let users know whether it’s worthwhile clicking on a category to view the words at this level.  However, I’ve realised that this will just confuse users as levels that have no words in them but include child categories that do have words in them would be listed with a zero or not in bold, giving the impression that there is no content lower down the hierarchy.

My last task for the week was to create a new timeline for the RNSN project based on data that had been given to me.  I think this is looking pretty good, but unfortunately making these timelines and related storymaps is very time-intensive, as I need to extract and edit the images, upload them to WordPress, extract the text and convert it into HTML and fill out the template with all of the necessary fields.  It took about 2 and a half hours to make this timeline.  However, hopefully the end result will be worth it.

Week Beginning 8th October 2018

I continued to work on the HT / OED data alignment for a lot of this week.  I updated the matching scripts I had previously created so that all matches based on last lexeme were removed and instead replaced by a ‘6 matches or more and 80% of words in total match’ check.  This was a lot more effective that purely comparing the last word in each category and helped match up a lot more categories.  I also created a QA script to check the manual matches that were made during our first phase of matching.  There are 1407 manual matches in the system.  The script also listed all the words in each potential matched category to make it easier to tell where any potential difficulties were.  I also updated the ‘pattern matching’ script I’d created last week to list all words and include the ‘6 matches and 80%’ check and changed the layout so that separate groupings now appear in different tables rather than being all mixed up in one table.  It took quite a long time to sort this out, but it’s going to be much more useful for manual checking.

I then moved on to writing a new ‘sibling matching’ script.  This script goes through all unmatched OED categories (this includes all that appear in other scripts such as the pattern matching one) and retrieves all sibling categories of the same POS.  E.g. if the category is ‘01.01.01|03 (n)’ then the script brings back all HT noun subcats of ’01.01.01’ that are ‘level 1’ subcats and compares their headings.  It then looks to see if there is a sibling category that has the same heading – i.e. looking for when a category has been renumbered within the same level of the thesaurus.  This has uncovered several hundred such potential matches, which will hopefully be very helpful. I also then created a further script that compares non-noun headings to noun headings at the same level, as it looked like a number of times the OED kept the noun heading for other parts of speech while the HT renamed them.  This identified a further 65 possible matches, which isn’t too bad.

I met with Marc and Fraser on Wednesday to discuss the recent updates I’d made, after which I managed to tick off 2614 matched categories, taking our total of unmatched OED categories that have a part of speech and are not empty down to 10,854.  I then made a start on a new script that looks at pattern matching for category contents (i.e. words), but I didn’t have enough time to make a huge amount of progress with this.

I also fixed an issue with the HT’s Google Analytics not working properly.  It looks like the code stopped working around the time we shifted domains in the summer, but it was a bit of a weird one as everything was apparently set up correctly – the ‘Send test traffic’ option on the .JS tracking code section successfully navigated through to the site and the tracking code was correct, but nothing was getting through.  However, I replaced the existing GA JavaScript that we had on our page with a new snippet from the JS section of Google Analytics and this seems to have done the trick.

However, we also have some calls to GA in our JavaScript so that loading parts of the tree, changing parts of speech, selecting subcats etc are considered ‘page hits’ and were reported to GA.  None of these worked after I changed the code.  I followed some guidelines here:

https://developers.google.com/analytics/devguides/collection/analyticsjs/sending-hits

to try and get things working but the callbacks were never being initiated – i.e. data wasn’t getting through to Google.  Thankfully Stack Overflow had an answer that worked (After trying several that didn’t):

https://stackoverflow.com/a/40761709

I’ve updated this so that pageviews rather than events are sent and now everything seems to be working again.

I spent a bit more time this week working on the Bilingual Thesaurus project, focussing on getting the front end for the thesaurus working.  I’ve reworked the code for the HT’s browse facility to work with the project’s data.  This required quite a lot of work as structurally the datasets are quite different – the HT relies in its ‘tier’ numbers for parent / child / sibling category relationships, and also has different categories for parts of speech and nested subcategories.  The BTH data is much simpler (which is great) as it just has parent and child categories, with things like part of speech handled at word level.  This meant I had to strip a lot of stuff out of the code and rework things.  I’m also taking the opportunity to move to a new interface library (Bootstrap) so had to rework the page layout to take this into consideration too.  I managed to get an initial version of the browse facility working now, which works in much the same way as the main HT site:  clicking on a heading allows you to view its words and clicking on a ‘plus’ sign allows you to view the child categories.  As with the HT you can link directly to a category too.  I do still need to work on the formatting of the category contents, though.  Currently words are just listed all together, with their type (AN or ME) listed first, then the word, then the POS in brackets, then dates (if available).  I haven’t included data about languages of source or citation yet, or URLs.  I’m also going to try and get the timeline visualisations working as well.  I’ll probably split the AN and ME words into separate tabs, and maybe split the list up by POS too.  I’m also wondering whether the full category hierarchy should be represented above the selected category (the right pane), as unlike the HT there’s no category number to show your position in the thesaurus.  Also, as a lot of the categories are empty I’m thinking of making the ones with words in them bold in the tree, or even possibly adding a count of words in brackets after the category heading.  I’ve also updated the project’s homepage to include the ‘sample category’ feature, allowing you to press the ‘reload’ icon to load a new random category.

I also made some further tweaks to the Seeing Speech data (fixing some of the video titles and descriptions) and had a chat with Thomas Clancy about his Iona proposal, which is starting to come together again after several months of inactivity.  Also for the ‘Books and Borrowing’ proposal I replied to a request for more information on the technical side of things from Stirling’s IT Services, who will be hosting the website and database for the project.  I met with Luca this week as well, to discuss how best to grab XML data via an AJAX query and process it using client-side JavaScript.  This is something I had already tackled as part of the ‘New Modernist Editing’ project, so was able to give Luca some (hopefully) useful advice.  I also continued an email conversation with Bryony Randall and Ronan Crowley about the workshop we’re running on digital editions later in the month.

On Friday I spent most of the day working on the RNSN project, adding direct links to the ‘nation’ introductions to the main navigation menu and creating new ‘storymap’ stories based on Powerpoint presentations that had been sent to me.  This is actually quite a time-consuming process as it involves grabbing images from the PPT, reformatting them, uploading them to WordPress, linking to them from the Storymap pages, creating Zoomified versions of the image or images that will be used as the ‘map’ for the story, extracting audio files from the PPT and uploading them, grabbing all of the text and formatting it for display and other such tasks.  However, despite being a long process the end result is definitely worth it as the stroymaps work very nicely.  I managed to get two such stories completed today, and now I’ve re-familiarised myself with the process it should be quicker when the next set get sent to me.

I’m going to be on holiday next week so there won’t be another report from me until the week after that.

Week Beginning 24th September 2018

Having left Rob Maslen’s Fantasy blog in a somewhat unfinished state last Friday due to server access issues, I jumped straight into completing this work on Monday morning.  Thankfully I could access the server again and after spending an hour or so tweaking header images, choosing colour schemes and fonts, reinstating widgets, menus and such things I managed to get the site fully working again, with a fully responsive theme: http://fantasy.glasgow.ac.uk/.  I also updated some content on the Burns Paper Database website for Ronnie Young, completed my PDR, responded to a query about TheGlasgowStory and met with Matt Barr in Computing Science to discuss some possible future developments.  I also made some further tweaks to the Seeing Speech and Dynamic Dialect website upgrades that are still ongoing.  Eleanor had created new versions of some of the videos, so I uploaded them, and also updated all of the images in the image carousels for both sites.

I spent a fair amount of time this week updating the maps on the ‘Saints in Scottish Place-Names’ website.  As mentioned in a previous post, the maps on this site all use Google maps, and Google now blocks access to their maps API unless you connect via an account that has a credit card associated with it.  This is not very good for legacy research projects such as this one, so the plan was that I’d migrate the maps from Google to the free and open source Leaflet.js mapping library.  Another advantage of Leaflet is that the scripts are all stored on the same server as the rest of the resource – we’re no longer reliant on a third-party server so there should be less risk of the maps becoming unavailable in future.  Of course the map layers themselves are all stored on other third-party servers, but the ones I’ve chosen (based on the ones I selected for the REELS project) are all free to use, and another benefit of Leaflet is that it’s very simple to switch out one map layer for another – so if one tileset becomes unavailable I can replace it very quickly with another.

I created a new Leaflet powered version of the website in a subdirectory so I could test things out without messing up the live site.  As far as I could tell there were four pages that featured maps, each using them in different ways.  I migrated all of them over to the Leaflet mapping library and incorporated base maps and other features from the REELS and KCB map interface, namely:

  1. A map ‘display options’ button in the top left of the map that opens a panel through which you can change the base map.
  2. A choice of 6 base maps, as with REELS and KCB:
    1. A default topographical map
    2. A satellite map
    3. A satellite map with things like roads, rivers and settlements marked on it
    4. A modern OS map
    5. A historical OS map from 1840-1888
    6. A historical OS map from 1920-1933
  3. An ‘Attribution and copyright’ popup linked to from the bottom right of the map, which I adapted from REELS.
  4. A ‘full screen’ button in the bottom right of the map that allows you to view any map full screen. I’ve removed the ‘view larger map’ option on the Saints page as I didn’t think this was really necessary when the ‘full screen’ option is available anyway.
  5. A map scale (metric and imperial) appears in the bottom left of the map.

 

Here’s some information about the four map types that I updated:

 

  1. Place map

This is the simplest map and displays a marker showing the location of the place.  Hover over the marker to view the place-name.

  1. Saint map

This map colour codes the markers based on ‘certainty’.  I used the same coloured markers as found on the original map.  I also added a map legend to the top right that shows you what the colours represent.  You can turn any of the layers on or off to make it easier to see the markers you’re interested in (e.g. hide all ‘certain’ markers).  I removed the legend section that appeared underneath the original map as this is no longer needed due to the in-map version.

  1. Search map

As with the original version, when you zoom in on an area any place-names found in the vicinity appear as red dots.  I updated the functionality slightly so that as you pan round the map at one zoom level new markers continue to load (with the previous version you had to change the zoom level to initiate the loading of new markers).  Now as you pan around new red spots appear all over the place like measles.

  1. Search results map

I couldn’t get the original version of this map to work at all, so I think there must have been some problem with it in addition to the Google Maps issue.  Anyway, the new version displays the search results on a map, and if the search included a saint then the results are categorised by ‘certainty’ as with the saint map.  You can turn certainty levels on or off.  You can also open the marker pop-ups to link through to the place-name record and the saint record too.

There will no doubt be a few further tweaks that will be required before I replace the live site with the new version I’ve been working on, but I reckon that bulk of the work is now done.

I also continued with the Bilingual Thesaurus project, although I didn’t have as much time as I had hoped to work on this.  However, I updated the ‘language of origin’ data for the 1829 headwords that had no language of origin, assigning ‘uncertain’ to all of them.  I also noticed that 15 headwords have no ‘date of citation’ and I asked Louise whether this was ok.  I also updated the way I’m storing dates.  Previously I had set up a separate table where any number of date fields could be associated with a headword.  Instead I have now added two new columns to the main ‘lexeme’ table: startdate and enddate.  I then wrote a script that went through the originally supplied dates (e.g. [1230,1450], adding the first date to the startdate column and the second date to the enddate column.  Where an enddate is not supplied or is ‘0’ I’ve added the startdate to this column, just to make it clearer that this is a single year.  Louise had mentioned that some dates would have ‘1450+’ as a second date but I’ve checked the original JSON file I was given and no dates have the plus sign, o I’ve checked with her in case this data has somehow been lost.  I also discovered that there are 16 headwords that have an enddate but no startdate (e.g. the date in the original JSON file is something like [0,1436] and have asked what should happen to these.  Finally, I made a start on the front-end for the resource.  There is very little in place yet, but I’ve started to create a ‘Bootstrap’ based interface using elements from the other thesaurus websites (e.g. logo, fonts).  Once a basic structure is in place I’ll get the required search and browse facilities up and running and we can then thing about things such as colour schemes and site text.

I spent the rest of the week on the somewhat Sisyphean task of matching up the HT and OED category data.  This is a task that Fraser and I have been working with on and off for over a year now, and it seemed like the end was in sight, as we were down to just a few thousand OED categories that were unmatched.  However, last week I noticed some errors in the category matching, and some further occasions where an OED category has been connected to multiple HT categories.  Marc, Fraser and I met on Monday to discuss the process and Marc suggested we start the matching process from scratch again.  I was rather taken aback by this as there appeared to only be a few thousand erroneous matches out of more than 220,000 and it seemed like a shame to abandon all our previous work.  However, I’ve since realised this is something that needed to be done, mainly because the previous process wasn’t very well documented and could not be easily replicated.  It’s a process that Fraser and I could only focus on between other commitments and progress was generally tracked via email conversations and a few Word documents.  It was all very experimental, and we often ran a script, which matched a group of categories, then altered the script and ran it again, often several times in succession.  We also approached the matching from what I realise now is the wrong angle – starting with the HT categories and trying the match these to the OED categories.  However, it’s the OED categories that need to be matched and it doesn’t really matter if HT categories are left unmatched (as plenty will as they are more recent additions or are empty place-holder categories).  We’ve also learned a lot from the initial process and have identified certain scripts and processes that we know are the most likely to result in matches.

It was a bit of a wrench, but we have now abandoned our first stab at category matching and are starting over again.  Of course, I haven’t deleted the previous matches so no data has been lost.  Instead I’ve created new ‘v2’ matching fields and I’m being much more rigorous in documenting the processes that we’re putting the data through and ensuring every script is retained exactly as it was when it performed a specific task rather than tweaking and reusing scripts.

I then ran an initial matching script that looked for identical matches – where the maincat, subcat, part of speech and ‘stripped’ heading were all identical.  This matched 202030 OED categories, leaving just 27,295 unmatched.  However, it is possible that not all of these 202030 matches are actually correct.  This is because quite often a category heading is reused – e.g. there are lots of subcats that have the heading ‘pertaining to’ – so it’s possible that a category might look identical but in actual fact be something completely different.  To check for this I ran a script that the combination of the stripped heading and the part of speech appears in more than one category.  There are 166096 matched categories where this happens.  For these the script then compares the total number of words and the last word in each match to see whether the match looks valid.  There were 12,640 where the number of words or the last word are not the same and I created a further script that then checked whether these had identical parent category headings.  This then identified 2,414 that didn’t.  These will need further checking.

I also noticed that a small number of HT categories had a parent whose combination of ‘oedmaincat’, ‘subcat’ and ‘pos’ information was not unique.  This is an error and I created a further script to list all such categories.  Thankfully there are only 98 and Fraser is going to look at these.  I also created a new stats page for our V2 matching process, which I will hopefully continue to make good progress with next week.

Week Beginning 11th June 2018

I met with Matthew Creasey from English Literature this week to discuss a project website for his recently funded ‘Decadence and Translation Network’ project.  The project website is going to be a fairly straightforward WordPress site, but there will also be a digital edition hosted through it, which will be sort of similar to what I did for the Woolf short story for the New Modernist Editing project (https://nme-digital-ode.glasgow.ac.uk/).  I set up an initial site for Matthew and will work on it further once he receives the images he’d like to use in the site design.

I also gave some further help with Craig Lamont in getting access to Google Analytics for the Ramsay project, and spoke to Quintin Cutts in Computing Science about publishing an iOS app they have created.  I also met with Graeme Cannon to discuss AHRC Data Management Plans, as he’s been asked to contribute to one and hasn’t worked with a plan yet.  I also made a couple of minor fixes to the RNSN timeline and storymap pages and updated the ‘attribution’ text on the REELS map.  There’s quite a lot of text relating to map attribution and copyright so instead of cluttering up the bottom of the maps I’ve moved everything into a new pop-up window.  In addition to the statements about the map tilesets I’ve also added in a statement about our place-name data, the copyright statement that’s required for the parish boundaries, a note about Leaflet and attribution for the map icons too. I think it works a lot better.

Other than these issues I mainly focussed on three projects this week.  For the SCOSYA project I tackled an issue with the ‘or’ search, that was causing the search to not display results in a properly categorised manner when the ‘rated by’ option was set to more than one.  It took a while to work through the code, and my brain hurt a bit by the end of it, but thankfully I managed to figured out what the problem was.  Basically when ‘rated by’ was set to 1 the code only needed to match a single result for a location.  If it found one that matched then the code stopped looking any further.  However, when multiple results need to be found that match, the code didn’t stop looking, but instead had to cycle through all other results for the location, including those for other codes.  So if it found two matches that met the criteria for ‘A1’ it would still go on looking through the ‘A2’ results as well, would realise these didn’t match and set the flag to ‘N’.  I was keeping a count of the number of matches but this part of the code was never reached if the ‘N’ flag was set.  I’ve now updated how the checking for matches works and thankfully the ‘Or’ search now works when you set a ‘rated by’ to be more than 1.

For the reworking of the Seeing Speech and Dynamic Dialects websites I decided for focus on the accent map and accent chart features of Dynamic Dialects.  For the map I switched to using the Leaflet.js mapping library rather than Google Maps.  This is mainly because I prefer Leaflet, you can use it with lots of different map tilesets, data doesn’t have to be posted to Google for the map to work and other reasons, such as the fact that you can zoom in and out with the scrollwheel of a mouse without having to also press ‘ctrl’, which gets really annoying with the existing map.  I’ve removed the option to switch from map to satellite and streetview as well as these didn’t really seem to serve much purpose.  The new base map is a free map supplied by Esri (a big GIS company).  It isn’t cluttered up with commercial map markers when zoomed in, unlike Google.

You can now hover over a map marker to view the location and area details.  Clicking on a marker opens up a pop-up containing all of the information about the speaker and links to the videos as ‘play’ buttons.  Note that unlike the existing map, buttons for sounds only appear if there are actually videos for them.  E.g. on the existing map for Oregon there are links for every video type, but only one (spontaneous) actually works.

Clicking on a ‘play’ button brings down the video overlay, as with the other pages I’ve redeveloped.  As with other pages, the URL is updated to allow direct linking to the video.  Note that any map pop-up you have open does not remain open when you follow such a link, but as the location appears in the video overlay header it should be easy for a user to figure out where the relevant marker is when they close the overlay.

For the Accent Chart page I’ve added in some filter options, allowing you to limit the display of data to a particular area, age range and / or gender.  These options can be combined, and also bookmarked / shared / cited (e.g. so you can follow a link to view only those rows where the area is ‘Scotland’ the age range is ’18-24’ and the gender is ‘F’).  I’ve also added a row hover-over colour to help you keep your eye on a row.  As with other pages, click on the ‘play’ button and a video overlay drops down.  You can also cite / bookmark specific videos.

I’ve made the table columns on this page as narrow as possible, but it’s still a lot of columns and unless you have a very wide monitor you’re going to have to scroll to see everything.  There are two ways I can set this up.  Firstly the table area of the page itself can be set to scroll horizontally.  This keeps the table within the boundaries of the page structure and looks more tidy, but it means you have to vertically scroll to the bottom of the table before you see the scrollbar, which is probably going to get annoying and may be confusing.  The alternative is to allow the table to break out of the boundaries of the page.  This looks messier, but the advantage is the horizontal scrollbar then appears at the bottom of your browser window and is always visible, even if you’re looking at the top section of the table.  I’ve asked Jane and Eleanor how they would prefer the page to work.

My final project of the week was the Historical Thesaurus.  I spent some time working on the new domain names we’re setting up for the thesaurus, and on Thursday I attended the lectures for the new lectureship post for the Thesaurus.  It was very interesting to hear the speakers and their potential plans for the Thesaurus in future, but obviously I can’t say much more about the lectures here.  I also attended the retirement do for Flora Edmonds on Thursday afternoon.  Flora has been a huge part of the thesaurus team since the early days of its switch to digital and I think she had a wonderful send-off from the people in Critical Studies she’s worked closely with over the years.

On Friday I spent some time adding the mini timelines to the search results page.  I haven’t updated the ‘live’ page yet but here’s an image showing how they will look:

It’s been a little tricky to add the mini-timelines in as the search results page is structured rather differently to the ‘browse’ page.  However, they’re in place now, both for general ‘word’ results  and for words within the ‘Recommended Categories’ section.  Note that if you’ve turned mini-timelines off in the ‘browse’ page they stay off on this page too.

We will probably want to add a few more things in before we make this page live.  We could add in the full timeline visualisation pop-up, that I could set up to feature all search results, or at least the results for the current page of search results.  If I did this I would need to redevelop the visualisation to try and squeeze in at least some of the category information and the pos, otherwise the listed words might all be the same.  I will probably try to add in each word’s category and pos, which should provide just enough context, although subcat names like ‘pertaining to’ aren’t going to be very helpful.

We will also need to consider adding in some sorting options.  Currently the results are ordered by ‘Tier’ number, but I could add in options to order results by ‘first attested date’, ‘alphabetically’ and ‘length of attestation’.  ‘Alphabetically’ isn’t going to be hugely useful if you’re looking at a page of ‘sausage’ results, but will be useful for wildcard searches (e.g. ‘*sage’) and other searches like dates.  I would imagine ordering results by ‘length of attestation’ is going to be rather useful in picking out ‘important’ words.  I’ll hopefully have some time to look into these options next week.

 

 

 

Week Beginning 4th June 2018

I’d taken Friday off as a holiday this week, and I was also off on Monday afternoon to attend a funeral.  Despite being off for a day and a half I still managed to achieve quite a lot this week.  Over the weekend Thomas Clancy had alerted me to another excellent resource that has been developed by the NLS Maps people that plots the boundaries of all parishes in Scotland, which you can access here:  http://maps.nls.uk/geo/boundaries/#zoom=10.671666666666667&lat=55.8481&lon=-2.5155&point=0,0.  For REELS we had been hoping to incorporate parish boundaries into our Berwickshire map but didn’t know where to get the coordinates from, and there wasn’t enough time in the project for us to manually create the data.  I emailed Chris Fleet at the NLS to ask where they’d got their data from, and whether we might be able to access the Berwickshire bits of it.  Chris very helpfully replied to say were created by the James Hutton Institute and are hosted on the Scottish government’s Scottish Spatial Data Infrastructure Metadata Portal (see https://www.spatialdata.gov.scot/geonetwork/srv/eng/catalog.search#/metadata/c1d34a5d-28a7-4944-9892-196ca6b3be0c).  The data is free to use, so long as a copyright statement is displayed, and there’s even an API through which the data can be grabbed (see here: http://sedsh127.sedsh.gov.uk/arcgis/rest/services/ScotGov/AreaManagement/MapServer/1/query).  The data can even be outputted in a variety of formats, including shape files, JSON and GeoJSON.  I decided to go for GeoJSON, as this seemed like a pretty good fit for the Leaflet mapping library we use.

Initially I used the latitude and longitude coordinates for one parish (Abbey St Bathans) and added this to the map.  Unfortunately the polygon shape didn’t appear on the map, even though no errors were returned.  This was rather confusing until I realised that whereas Leaflet tends to use latitude and then longitude as the order of the input data, GeoJSON is set to have longitude first and then latitude.  This meant my polygon boundaries had been added to my map, just in a completely different part of the world!  It turns out that in order to use GeoJSON data in Leaflet it’s better to use Leaflet’s in-built ‘L.geoJSON’ functions (See https://leafletjs.com/examples/geojson/).  With this in place, Leaflet very straightforwardly plotted out the boundaries of my sample parish.

I had intended to write a little script that would then grab the GeoJSON data for each of the parishes in our system from the API mentioned above.  However, I noticed that when passing a text string to the API it does a partial match, and can return multiple parishes.  For example, our parish ‘Duns’ also brings back the data for ‘Dunscore’ and ‘Dunsyre’.  I figured therefore that it would be safer if I just manually grabbed the data and inserted it directly into our ‘parishes’ database.  This all worked perfectly, other than for the parish of Coldingham, which is a lot bigger than the rest, meaning the JSON data was also a lot larger.  The size of the data was larger than a setting on the server was allowing me to upload to MySQL, but thankfully Chris McGlashan was able to sort that out for me.

With all of the parish data in place I styled the lines a sort of orange colour that would show up fairly well on all of our base maps.  I also updated the ‘Display options’ to add in facilities to turn the boundary lines on or off.  This also meant updating the citation, bookmarking and page reloading code too.  I also wanted to add in the three-letter acronyms for each parish too.  It turns out that adding plain text directly to a Leaflet map is not actually possible, or at least not easily.  Instead the text needs to be added as a tooltip on an invisible marker, and the tooltip then has to be set as permanently visible, and then styled to remove the bubble around the text.  This still left the little arrow pointing to the marker, but a bit of Googling informed me that if I set the tooltip’s ‘dicrection’ to ‘center’ the arrowheads aren’t shown.  It all feels like a bit of a hack, and I hope that in future it’s a lot easier to just add text to a map in a more direct manner.  However, I was glad to figure a solution out, and once I had manually grabbed the coordinates where I wanted the parish labels to appear I was all set.  Here’s an example of how the map looks with parish boundaries and labels turned on:

I had some other place-name related things to do this week.  On Wednesday afternoon I met with Carole, Simon and Thomas to discuss the Scottish Survey of Place-names, which I will be involved with in some capacity.  We talked for a couple of hours about how the approach taken for REELS might be adapted for other surveys, and how we might connect up multiple surveys to provide Scotland-wide search and browse facilities.  I can’t really say much more about it for now, but it’s good that such issues are being considered.

I spent about a day this week continuing to work on the new pages and videos for the Seeing Speech project.  I fixed a formatting issue with the ‘Other Symbols’ table in the IPA Charts that was occurring in Internet Explorer, which Eleanor had noticed last week.  I also uploaded the 16 new videos for /l/ and /r/ sounds that Eleanor had sent me, and created a new page for accessing these.  As with the IPA Charts page I worked on last week, the videos on this page open in an overlay, which I think works pretty well.  I also noticed that the videos kept on playing if you closed an overlay before the video finished, so I updated the code to ensure that the videos stop when the overlay is closed.

Other than these projects, I investigated an issue relating to Google Analytics that Craig Lamont was encountering for the Ramsay project, and I spent the rest of my time returning to the SCOSYA project.  I’d met with Gary last week and he’d suggested some further updates to the staff Atlas page.  It took a bit of time to get back into how the atlas works as it’s been a long time since I last worked on it, but once I’d got used to it again, and had created a new test version of the atlas for me to play with without messing up Gary’s access, I decided to try and figure out whether it would be possible to add in a ‘save map as image’ feature.  I had included this before, but as the atlas uses a mixture of image types (bitmap, SVG, HTML elements) for base layers and markers the method I’d previously used wasn’t saving everything.

However, I found a plugin called ‘easyPrint’ (https://github.com/rowanwins/leaflet-easyPrint) that does seem to be able to save everything.  By default it prints the map to a printer (or to PDF), but it can also be set up to ‘print’ to a PNG image.  It is a bit clunky, sometimes does weird things and only works in Chrome and Firefox (and possibly Safari, I haven’t tried, but definitely not MS IE or Edge).  It’s not going to be suitable for inclusion on the public atlas for these reasons, but it might be useful to the project team as a means of grabbing screenshots.

With the plugin added a new ‘download’ icon appears above the zoom controls in the bottom right.  If you move your mouse over this some options appear that allow you to save an image at a variety of sizes (current, A4 portrait, A4 landscape and A3 portrait).  The ‘current’ size should work without any weirdness, but the other ones have to reload the page, bringing in map tiles that are beyond what you currently see.  This is where the weirdness comes in, as follows:

  1. The page will display a big white area instead of the map while the saving of the image takes place.  This can take a few seconds.
  2. Occasionally the map tiles don’t load successfully and you get white areas in the image instead of the map.  If this happens pan around the map a bit to load in the tiles and then try saving the image again.
  3. Very occasionally when the map reloads it will have completely repositioned itself, and the map image will be of this location too.  Not sure why this is happening.  If it does happen, reposition the map and try again and things seem to work.

Once the processing is complete the image will be saved as a PNG.  If you select the ‘A3’ option the image will actually be of a much larger area than you see on your screen.  I think this will prove useful to you for getting higher resolution images and also for including Shetland, two issues Gary was struggling with.  Here’s a large image with Shetland in place:

That’s all for this week.

 

Week Beginning 28th May 2018

Monday was a bank holiday so this was a four-day working week.  The big news this weeks was that we went live with the new timeline and mini-timeline feature for the Historical Thesaurus.  This is a feature I started working on just for fun during a less busy period in the week before Christmas and it’s grown and grown since then into what I think is a hugely useful addition to the site.  It’s great to see it live at last.  Marc has been showing the feature to people at a conference this week and the feedback so far has been very positive, which is excellent.  The only slight teething problem was I inadvertently broke the Sparkline interface when I made this feature live (as the Sparkline page was using a test version of the site’s layout script that I deleted when the timelines went live).  Thankfully that was a two-second job to fix.  Anyway, here’s an example page with the timeline options available: https://historicalthesaurus.arts.gla.ac.uk/category/?type=search&qsearch=physician&word=physician&page=1#id=14766

I met with Gary Thoms this week to discuss the public interface for the SCOSYA atlas.  It looks like this is now going to be worked on later this year, possibly from September or October onward, with an aim of launching it in April next year.  We also talked about further updates to the staff version of the atlas that Gary would like to be incorporated, such as better options to save map images and facilities to select groups of locations and automatically display statistics about the group.  I’m hoping to spend some time on these updates over the next few weeks.

I also had a meeting with Thomas Clancy this week to discuss some possible future place-name projects that I might be involved with in some capacity, and I was in communication with SLD about some issues relating to the Google developer account for the Scots School dictionary.  I also fixed a minor error with the Corpus of Modern Scottish Writing that had cropped up during the move to HTTPS and gave further feedback to the latest (and possibly final) version of the Data Management Plan for Faye Hammill’s project.

Other than that I spent my time this week working on the redevelopment of the Seeing Speech and Dynamic Dialect websites for Jane Stuart-Smith.  I’d realised that we never decided how we’d redevelop the interface for the Dynamic Dialects website, so I spent some time setting this up.  As a starting point I took the same interface as for the new Seeing Speech website, but added in the Dynamic Dialects navigation structure (with links to the chart and map at the top).  I wasn’t sure what to do about the logo.  Unfortunately there is no version of the current logo on the server that doesn’t have the ‘Dynamic Dialects’ text in front of it.  Instead I found a couple of free images that might work and created mockups of the interface with them so that Jane and Eleanor could see which might work best.

I then decided to focus on the redevelopment of the Seeing Speech IPA chart interface to the videos, as I figured that in terms of content there probably wouldn’t be many changes to be made.  The charts now appear within the overall site structure rather than on an isolated page.  I’ve split the four charts into separate tabs.  Within each tab there are then buttons for setting the video type and speaker.  The charts all now use the ‘Doulos SIL’ font automatically, so no need to worry about missing symbols.

I’ve added a line of help text above the tables just in case people don’t know they can click on a symbol to open videos.  I can change this text if required.  The charts themselves should be pretty much identical to the existing charts.  The only difference is I’ve removed the hover-over title text, as to me it didn’t seem necessary for things like ‘U+00F0: LATIN SMALL LETTER ETH’ to be visible.  One other tiny difference is I’ve greyed out the ‘Affricates and double articulation’ symbols in the ‘other’ tab as these don’t have videos.

Regarding the videos, these now open in an overlay rather than in a new browser window.  The page greys out and the overlay drops down from the top.  When you click outside of the overlay, or on the ‘close’ button in the top right of the overlay, the page fades back into view and the overlay slides up the screen and disappears.  Most browsers now also display a ‘full screen’ button in the video player options if people want to see a bigger video, and some browsers (e.g. Chrome) also give the user a ‘download’ option.  When the video overlay is open an extra ID is added to the browser’s address bar.  If you copy the full URL when the overlay is open you can then link to a specific video.  This means we could add ‘cite’ text to the overlay to allow people to cite specific videos.  When you close the overlay the information is removed from the address bar, to allow people to bookmark / cite the full page.

I haven’t copied all of the copyright text across as it seemed a bit confusing.  The link to the International Phonetic Association was broken and it was unclear why the chart has copyright attributed to three organisations.  The ‘Weston Ruter’ one is particularly confusing as the link just leads to a personal website for a WordPress developer.  So for now what is displayed is ‘Charts reprinted with permission from The International Phonetic Association’ (with a link to https://www.internationalphoneticassociation.org/).

In terms of responsiveness (i.e. things working on all screen sizes), I’ve tested things out on my phone and the charts and video overlays work fine.  The tabs end up stacked vertically, which I think is fine.  Once the screen narrows beyond a certain point the tables (particularly the pulmonic consonants table) stops getting narrower and instead a scrollbar appears underneath the table.  This ensures the structure of the table is never compromised – i.e. no dropping down of columns onto new lines or anything).  As this feature and the new site design is still in development I can’t post any screenshots yet, but I think it’s coming along nicely.  Eleanor noticed some strange formatting with one of the tables in Internet Explorer, so I’ll have to investigate this next week.