Week Beginning 17th June 2019

I seem to be heading through a somewhat busy patch at the moment, and had to focus my efforts on five major projects and several other smaller bits of work this week.  The major projects were SCOSYA, Books and Borrowing, DSL, HT and Bess of Hardwick’s Account books.  For SCOSYA I continued to implement the public atlas, this week focussing on the highlighting of groups.  I had hoped that this would be a relatively straightforward feature to implement, as I had already created facilities to create and view groups in the atlas I’d made for the content management system.  However, it proved to be much trickier than I’d anticipated as I’d rewritten much of the atlas code in order to incorporate the GeoJSON areas as well as purely point-based data, plus I needed to integrate the selection of groups and the loading of group locations with the API.  My existing code for finding the markers for a specified group and adding a coloured border was just not working, and I spent a frustratingly long amount of time debugging the code to find out what had changed to stop the selection from finding anything.  It turned out that in my new code I was reinstantiating the variable I was using to hold all of the point data within a function, meaning that the scope of the variable containing the data was limited to that function rather than being available to other functions.  Once I figured this out it was a simple fix to make the data available to the parts of the code that needed to find and highlight relevant markers and I then managed to make groups of markers highlight or ‘unhighlight’ at the press of a button, as the following screenshot demonstrates:

You can now select one or more groups and the markers in the group are highlighted in green.  Press a group button a second time to remove the highlighting.  However, there is still a lot to be done.  For one thing, only the markers highlight, not the areas.  It’s proving to be rather complicated to get the areas highlighted as these GeoJSON shapes are handled quite differently to markers.  I spent a long time trying to get the areas to highlight without success and will need to return to this another week.  I also need to implement highlighting in different colours, so each group you choose to highlight is given a different colour to the last.  Also, I need to find a way to make the selected groups be remembered as you change from points to areas to both, and change speaker type, and also possibly as you change between examples.  Currently the group selection resets but the selected group buttons remain highlighted, which is not ideal.

I also spend time this week on the pilot project for Matthew Sangster’s Books and Borrowing project, which is looking at University student (and possibly staff) borrowing records from the 18th century.  Matthew has compiled a spreadsheet that he wants me to create a searchable / browsable online resource for and my first task was to extract the data from the spreadsheet, create an online database and write a script to migrate the data to this database.  I’ve done this sort of task many times before, but unfortunately things are rather more complicated this time because Matthew has included formatting within the spreadsheet that needs to be retained in the online version.  This includes superscript text throughout the more than 8000 records and simply saving the spreadsheet as a CSV file and writing a script to go through each cell and upload the data won’t work as the superscript style will be lost in the conversion to CSV.  PHPMyAdmin also includes a facility to import a spreadsheet in the OpenDocument format, but unfortunately this not only removes the superscript format but also the text that is specified as superscript as well.

Therefore I had to investigate other ways of getting the data out of the spreadsheet while somehow retaining the superscript formatting.  The only means of doing so that I could think of was to save the spreadsheet as an HTML document, which would convert Excel’s superscript formatting into HTML superscript tags, which is what we’d need for displaying the data on a website anyway.  Unfortunately the HTML generated by Excel is absolutely awful and filled with lots of unnecessary junk that I then needed to strip out manually.  I managed to write a script that extracted the data (including the formatting for superscript) and import this into the online database for about 8000 of the 8200 rows, but the remainder had problems that prevented the insertion from taking place.  I’ll need to think about creating multiple passes for the data when I return to it next week.

For the DSL this week I spent rather a lot of time engaged in email conversations with Rhona Alcorn about the tasks required to sort out the data that the team have been working on for several years and which now needs to be extracted from older systems and migrated to a new system, plus the API that I am working on.  It looked like there would be a lot of work for me to do with this, but thankfully midway through the week it became apparent that the company who are supplying the new system for managing the DSL’s data have a member of staff who is expecting to do a lot of the tasks that had previously been assigned to me.  This is really good news as I was beginning to worry about the amount of work I wold have to do for the DSL and how I would fit this in around other work commitments.  We’ll just need to see how this all pans out.

I also spent some time implementing a Boolean search for the new DSL API.  I now have this in place and working for headword searches, which can be performed via the ‘quick search’ box on the test sites I’ve created.  It’s possible to use Boolean AND, OR and NOT (all must be entered upper case to be picked up) and a search can be used in combination with wildcards, and speech-marks can now be used to specify an exact search.  So, for example, if you want to find all the headwords beginning with ‘chang’ but wish to exclude results for ‘change’ and ‘chang’ you can enter ‘chang* NOT “change” NOT “chang”’.

OR searches are likely to bring back lots of results and at the moment I’ve not put a limit on the results, but I will do so before things go live.  Also, while there are no limits on the number of Booleans that can be added to a query, results when using multiple Booleans are likely to get a little weird due to there being multiple ways a query could be interpreted.  E.g. ‘Ran* OR run* NOT rancet’ still brings back ‘rancet’ because the query is interpreted as ‘get all the ‘ran*’ results OR all the ‘run*’ results so long as they don’t include ‘rancet’ – so ran* OR (run* NOT rancet).  But without complicating things horribly with brackets or something similar there’s no way of preventing such ambiguity when multiple different Booleans are used.

For the Historical Thesaurus I met with Marc and Fraser on Monday to discuss our progress with the HT / OED linking and afterwards continued with a number of tasks that were either ongoing or had been suggested at the meeting.  This included ticking off some matches from a monosemous script, creating a new script that brings back up to 1000 random unmatched lexemes at a time for spot-checking and creating an updated Levenshtein script for lexemes, which is potentially going to match a further 5000 lexemes.  I also wrote a document detailing how I think that full dates should be handled in the HT, to replace the rather messy way dates are currently recorded.  We will need to decide on a method in order to get the updated dates from the OED into a comparable format.

Also this week I returned to Alison Wiggins’s Account Books project, or rather a related output about the letters of Mary, Queen of Scots.  Alison had sent me a database containing a catalogue of letters and I need to create a content management system to allow her and other team members to work on this together.  I’ve requested a new subdomain for this system and have begun to look at the data and will get properly stuck into this next week, all being well.

Other than these main projects I also gave feedback on Thomas Clancy’s Iona project proposal, including making some changes to the Data Management Plan, helped sort out access to logo files for the Seeing Speech project, sorted out an issue with the Editing Burns blog that was displaying no content since the server upgrade (it turns out it was using a very old plugin that was not compatible with the newer version of PHP on the server) and helped sort out some app issues.  All in all a very busy week.