I returned to work on Monday after being off last week. As usual there were a bunch of things waiting for me to sort out when I got back, so most of Monday was spent catching up with things. This included replying to Scott Spurlock about his Crowdsourcing project, responding to a couple of DSL related issues, updating access restrictions on the SPADE website, reading through the final versions of the DMP and other documentation for Matt Sangster and Katie Halsey’s project, updating some details on the Medical Humanities Network website, responding to a query about the use of the Thesaurus of Old English and speaking to Thomas Clancy about his Iona proposal.
With all that out of the way I returned to the OED / HT data linking issues for the Historical Thesaurus. In my absence last week Marc and Fraser had made some further progress with the linking, and had made further suggestions as to what strategies I should attempt to implement next. Before I left I was very much in the middle of working on a script that matched words and dates, and I hadn’t had time to figure out why this script was bringing back no matches. It turns out the HT ‘fulldate’ field was using long dashes, whereas I was joining the OED GHT dates with a short dash. So all matches failed. I replaced the long dashes with short ones and the script then displayed 2733 ‘full matches’ (where every stripped lexeme and its dates match) and 99 ‘partial matches’ (where more than 6 and 80% match both dates and stripped lexeme text). I also added in a new column that counts the number of matches not including dates.
Marc had alerted me to an issue where the number of OED matches was coming back as more than 100% so I then spent some time trying to figure out what was going on here. I updated both the ‘with dates’ and ‘no date check’ versions of the lexeme pattern matching scripts to add in the text ‘perc error’ to any percentage that’s greater than 100, to more easily search for all occurrences. There are none to be found in the script with dates, as matches are only added to the percentage score if their dates match too. On the ‘no date check’ script there are several of these ‘perc error’ rows and they’re caused for the most part by a stripped form of the word being identical to an existing non-stripped form. E.g. there are separate lexemes ‘she’ and ‘she-‘ in the HT data, and the dash gets stripped, so ‘she’ in the OED data ends up matching two HT words. There are some other cases that look like errors in the original data, though. E.g. in OED catid 91505 severity there’s the HT word ‘hard (OE-)’ and ‘hard (c1205-)’ and we surely shouldn’t have this word twice. Finally there are some forms where stripping out words results in duplicates – e.g. ‘pro and con’ and ‘pro or con’ both end up as ‘pro con’ in both OED and HT lexemes, leading to 4 matches where there should only be 2. There are no doubt situations where the total percentage is pushed over the 80% threshold or to 100% by a duplicate match – any duplicate matches where the percentage doesn’t get over 100 are not currently noted in the output. This might need some further work. Or, as I previously said, with the date check incorporated the duplicates are already filtered out, so it might not be so much of an issue.
I also then moved on to a new script that looks at monosemous forms. This script gets all of the unmatched OED categories that have a POS and at least one word and for each of these categories it retrieves all of the OED words. For each word the script queries the OED lexeme table to get a count of the number of times the word appears. Note that this is the full word, not the ‘stripped’ form, as the latter might end up with erroneous duplicates, as mentioned above. Each word, together with its OED date and GHT dates (in square brackets) and a count of the number of times it appears in the OED lexeme table is then listed. If an OED word only appears once (i.e. is monosemous) it appears in bold text. For each of these monosemous words the script then queries the HT data to find out where and how many times each of these words appears in the unmatched HT categories. All queries keep to the same POS but otherwise look at all unmatched categories, including those without an OEDmaincat. Four different checks are done, with results appearing in different columns: HT words where full word (not the stripped variety) matches and the GHT start date matches the HT start date; failing that, HT words where the full word matches but the dates don’t; failing either of these, HT words where the stripped forms of the words match and the dates match; failing all these, HT words where the stripped forms match but the dates don’t. For each of these the HT catid, OEDmaincat (or the text ‘No Maincat’ if there isn’t one), subcat, POS, heading, lexeme and fulldate are displayed. There are lots of monosemous words that just don’t appear in the HT data. These might be new additions or we might need to try pattern matching. Also, sometimes words that are monosemous in the OED data are polysemous in the HT data. These are marked with a red background in the data (as opposed to green for unique matches). Examples of these are ‘sedimental’, ‘meteorologically’, ‘of age’. Any category that has a monosemous OED word that is polysemous in the HT has a red border. I also added in some stats below the table. In our unmatched OED categories there are 24184 monosemous forms. There are 8086 OED categories that have at least one monosemous form that matches exactly one HT form. There are 220 OED monosemous forms that are polysemous in the HT. Now we just need to decide how to use this data.
Also this week I looked into an issue one of the REELS team was having when accessing the content management system (it turns out that some anti-virus software was mislabelling the site as having some kind of phishing software in it), and responded to a query about the Decadence and Translation Network website I’d set up. I also started to look at sourcing some Data Management Plans for an Arts Lab workshop that Dauvit Broun has asked me to help with next week. I also started to prepare my presentation for the Digital Editions workshop next week, which took a fair amount of time. I also met with Jennifer Smith and a new member of the SCOSYA project team in Friday morning to discuss the project and to show the new member of staff how the content management system works. It looks like my involvement with this project might be starting up again fairly soon.
On Tuesday Jeremy Smith contacted me to ask me to help out with a very last minute proposal that he is putting together. I can’t say much about the proposal, but it had a very tight deadline and required rather a lot of my time from the middle of the week onwards (and even into the weekend). This involved lots of email exchanges, time spent reading documentation, meeting with Luca, who might be doing the technical work for the project if it gets funded, and writing a Data Management Plan for the project. This all meant that I was unable to spend time working on other projects I’d hoped to work on this week, such as the Bilingual Thesaurus. Hopefully I’ll have time to get back into this next week, once the workshops are out of the way.
This was a week of many different projects, most of which required fairly small jobs doing, but some of which required most of my time. I responded to a query from Simon Taylor about a potential new project he’s putting together that will involve the development of an app. I fixed a couple of issues with the old pilot Scots Thesaurus website for Susan Rennie, and I contributed to a Data Management Plan for a follow-on project that Murray Pittock is working on. I also made a couple of tweaks to the new maps I’d created for Thomas Clancy’s Saints Places project (the new maps haven’t gone live yet) and I had a chat with Rachel Macdonald about some further updates to the SPADE website. I also made some small updates to the Digital Humanities Network website, such as replacing HATII with Information Studies. I also had a chat with Carole Hough about the launch of the REELS resource, which will happen next month, and spoke to Alison Wiggins about fixing the Bess of Hardwick resource, which is currently hosted at Sheffield and is unfortunately no longer working properly. I also continued to discuss the materials for an upcoming workshop on digital editions with Bryony Randall and Ronan Crowley. I also made a few further tweaks to the new Seeing Speech and Dynamic Dialects websites for Jane Stuart-Smith.
I had a meeting with Kirsteen McCue and Brianna Robertson-Kirkland to discuss further updates to the Romantic National Song Network website. There are going to be about 15 ‘song stories’ that we’re going to publish between the new year and the project’s performance event in March, and I’ll be working on putting these together as soon as the content comes through. I also need to look into developing an overarching timeline with contextual events.
I spent some time updating the pilot crowdsourcing platform I had set up for Scott Spurlock. Scott wanted to restrict access to the full-size manuscript images and also wanted to have two individual transcriptions per image. I updated the site so that users can no longer right click on an image to save or view it. This should stop most people from downloading the image, but I pointed out that it’s not possible to completely lock the images. If you want people to be able to view an image in a browser it is always going to be possible for the user to get the image somehow – e.g. saving a screenshot, or looking at the source code for the site and finding the reference to the image. I also pointed out that by stopping people easily getting access to the full image we might put people off from contributing – e.g. some people might want to view the full image in another browser window, or print it off to transcribe from a hard copy.
I also spent a bit of time continuing to work on the Bilingual Thesaurus. I moved the site I’m working on to a new URL, as requested by Louise Sylvester, and updated the thesaurus data after receiving feedback on a few issues I’d raised previously. This included updating the ‘language of citation’ for the 15 headwords that had no data for this, instead making them ‘uncertain’. I also added in first dates for a number of words that previously only had end dates, based on information Louise sent to me. I also noticed that several words have duplicate languages in the original data, for example the headword “Clensing (mashinge, yel, yeling) tonne” has for language of origin: “Old English|?Old English|Middle Dutch|Middle Dutch|Old English”. My new relational structure ideally should have a language of origin / citation linked only once to a word, otherwise things get a bit messy, so I asked Louise whether these duplicates are required, and whether a word can have both an uncertain language of origin (“?Old English”) and a certain language of origin (“Old English”). I haven’t heard back from her about this yet, but I wrote a script that strips out the duplicates, and where both an uncertain and certain connection exists keeps the uncertain one. If needs be I’ll change this. Other than these issues relating to the data, I spent some time working on the actual site for the Bilingual Thesaurus. I’m taking the opportunity to learn more about the Bootstrap user interface library and am developing the website using this. I’ve been replicating the look and feel of the HT website using Bootstrap syntax and have come up with a rather pleasing new version of the HT banner and menu layout. Next week I’ll see about starting to integrate the data itself.
This just leaves the big project of the week to discuss: the ongoing work to align the HT and OED datasets. I continued to implement some of the QA and matching scripts that Marc, Fraser and I discussed at our meeting last week. Last week I ‘dematched’ 2412 categories that don’t have a perfect number of lexemes match and have the same parent category. I created a further script that checks how many lexemes in these potentially matched categories are the same. This script counts the number of words in the potentially matched HT and OED categories and counts how many of them are identical (stripped). A percentage of the number of HT words that are matched is also displayed. If the number of HT and OED words match and the total number of matches is the same as the number of words in the HT and OED categories the row is displayed in green. If the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1 this is also considered a match. If the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1 this is also considered a match. The total matches given are 1154 out of 2412.
I then moved onto creating a script that checks the manually matched data from our ‘version 1’ matching process. There are 1407 manual matches in the system. Of these:
- 795 are full matches (number of words and stripped last word match or have a Levenshtein score of 1 and 100% of HT words match OED words, or the categories are empty)
- There are 205 rows where all words match or the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1, or the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1
- There are 122 rows where the last word matches (or has a Levenshtein score of 1) but nothing else does
- There are 18 part of speech mismatches
- There are 267 rows where nothing matches
I then created a ‘pattern matching’ script, which changes the category headings based on a number of patterns and checks whether this then results in any matches. The following patterns were attempted:
- inhabitant of the -> inhabitant
- inhabitant of -> inhabitant
- relating to -> pertaining to
- spec. -> specific
- spec -> specific
- specific -> specifically
- assoc. -> associated
- esp. -> especially
- north -> n.
- south -> s.
- january -> jan.
- march -> mar.
- august -> aug.
- september -> sept.
- october -> oct.
- november -> nov.
- december -> dec.
- Levenshtein difference of 1
- Adding ‘ing’ onto the end
The script identified 2966 general pattern matches, 129 Levenshtein score 1 matches and 11 ‘ing’ matches, leaving 17660 OED categories that have a corresponding HT catnum with different details and a further 6529 OED categories that have no corresponding HT catnum. Where there is a matching category number of lexemes / last lexeme / total matched lexeme checks as above are applied and rows are colour coded accordingly.
On Friday Marc, Fraser and I had a further meeting to discuss the above, and we came up with a whole bunch of further updates that I am going to focus on next week. It feels like real progress is being made.
It’s been a rather full-on week this week. It always tends to get busy for me at this time of year as staff tend to return to work after their holidays itching to get things done before the new term starts, which tends to mean lots of requests come my way. I also needed to get things done before the end of the week as I’m attending the ICEHL conference (http://www.conferences.cahss.ed.ac.uk/icehl20/) in Edinburgh all next week so won’t be able to do much regular work then.
I spent quite a bit of time on Historical Thesaurus related duties this week. This included reading through the new proposal Fraser is working on and commenting on it, and continuing the email discussion about the new thesaurus that we’re hoping to host. I met with Marc and Fraser on Tuesday morning to continue our discussions about the integration of the HT and the OED data, and we made a bit more progress on the linking up of the categories. I have to say I wasn’t feeling all that great on Tuesday, Wednesday and Thursday this week and it was a bit of a struggle to make it through these days, so I don’t feel like I contributed as much to this meeting as I would normally have done. However, struggle through the days I did, and by Friday I was thankfully back to normal again.
Fraser is presenting a session about the HT visualisations at a workshop at ICEHL next week and I was asked to write some slides about the visualisations for him to use. When it came down to writing these I figured that a few slides in isolation without background material and context would not be much use, so instead I ended up writing a document about the four types of visualisations that are currently part of the HT website. This took some time, and ended up being almost 3000 words in length, but I felt it was useful to document how and why the visualisation were created – for future reference for me if not for Fraser to use next week!
I had my PDR session on Tuesday afternoon, which took up a fair bit of time, both attending the meeting and acting on some of the outcomes from the meeting. Overall I think it went really well, though.
Gerry Carruthers emailed me this week about a new proposal he is in the middle of writing. He wanted me to supply a few sentences about the technical aspects of the proposal, so after reading through his document and sending a few questions his way I spent a bit of time writing the required sections.
I also met with Matthew Sangster from English Literature and Katie Halsey from Stirling, who are in the middle of putting a proposal together. I’m going to write the Data Management Plan for the proposal and will also undertake the development work. I can’t really go into any details about the project at this stage, but it seems like just the sort of project I enjoy being involved with, and I managed to give some suggestions and feedback on their existing documentation.
Bryony Randall from English Literature was also in touch this week, asking if I would like to participate in a workshop she’s hoping to run in the Autumn. I said I would help out and she introduced me by email to Ronan Crowley, who will also be involved in the workshop. We had an email conversation about what the workshop should contain and other such matters – a conversation that will no doubt continue over the coming weeks.
As mentioned earlier, I’ll be at the ICEHL conference next week, so the next report will be more of a conference review than a summary of work done.
I spent a fair amount of time this week working on Historical Thesaurus duties following our team meeting on Friday last week. At the meeting Marc had mentioned another thesaurus that we are potentially going to host at Glasgow, the Bilingual Thesaurus of Medieval England. Marc gave me access to the project’s data and I spent some time looking through it and figuring out how we might be able to develop an online resource for it that would be comparable to the thesauri we currently host. I met with Marc and Fraser again on Tuesday to discuss the ongoing issue of matching up the HT and OED datasets, and prior to the meeting I spent some time getting back up to speed with the issues involved and figuring out where we’d left off. I also created some CSV files containing the unmatched data for us to use.
The meeting itself was pretty useful and I came out of it with a list of several new things to do, which I focussed on for much of the remainder of the week. This included writing a script that goes through each unmatched HT category, brings back the non-OE words and compares these with the ght_lemma field of all the words in unmatched OE categories. The script outputs a table featuring information about the categories as well as the words, and I think the output will be useful for identifying unmatched categories as well as words contained therein. Also at the meeting we’d noticed that if you perform a search on the front-end that contains an apostrophe the search itself works, but following a link in the search results to a word that also contains an apostrophe wasn’t working. I added in a bit of urlencoding magic and that sorted the issue.
I also created a few more scripts aimed at identifying categories and words to match (or to identify things that would have no matches). This included a script to display unmatched HT and OED categories that have non-alphanumeric characters in them, creating a CSV output of HT words that doesn’t feature OE words (as the OED doe not include OE words) and creating another script that identifies categories that have ‘pertaining to’ in their headings.
I also created a script that generated the full hierarchical pathway for each unmatched HT and OED category and then ran a Levenshtein test to figure out which OED path was the closest to which HT path (in the same part of speech). It took the best part of a morning to write the script, and the script itself took about 30 minutes to run, but unfortunately the output is not going to be much use in identifying potential matches.
For every unmatched HT category the script currently displays the OED category with the lowest Levenshtein score when comparing the full hierarchy of each. There’s very little in the way of matches that are of any value, but things might improve with some tweaking. As it stands the script generates the full HT hierarchy within the chosen POS, meaning for non-nouns the hierarchy generally doesn’t go all the way to the top. I could potentially use the noun hierarchy instead. Similarly, for the OED data I’ve kept within the POS, which means it hasn’t taken into consideration the top level OED categories that have no POS. Also, rather than generating the full hierarchy we might have more luck if we just looked at a smaller slice, for example two levels up from the current main cat, plus full subcat hierarchy. But even this might result in some useless results – e.g. the HT adverb ‘>South>most’ currently has as its closest match the OED adverb ‘>four>>four’ with a Levenshtein score of 6. But clearly it’s not a valid match.
My final script was one that identifies empty HT categories (or those that only include OE words). I figured that a lot of these probably don’t need to match up to an OED category. I also included any empty OED categories (not including the ‘top level’ OED categories that have no part of speech and are empty). Out of the 12034 unmatched HT cats 4977 are empty or only contain OE words. Out of the 6648 unmatched OED categories that have a POS there are 1918 that are empty. Hopefully we can do something about ticking these off as checked at some point.
While going through this data I made a slightly worrying discovery: At the meeting we’d found an OED word that referenced an OED category ID that didn’t exist in our database. This seemed rather odd. The next day I discovered another, and I figured out what was going on. It would appear that when uploading the OED data from their XML files to our database any OED category or word that included an apostrophe silently failed to upload. This unfortunately is not good news as it means many potential matches that should have been spotted by the countless sweeps through the data that we’ve already done have been missed due to the corresponding OED data simply not being there. I ran the XML through another script to count the OED categories and words that include apostrophes and there are 1843 categories and 26,729 words (the latter due to apostrophes in word definitions also causing words to fail to upload). This is not good news and it’s something we’re going to have to investigate next week. However, it does mean we should be able to match up more HT categories and words than we had previously matched, which is at least a sort of a silver lining.
Other than HT duties I did small bits of work for a number of different projects. I generated some data for Carole for the REELS project from the underlying database, and investigated a possible issue with the certainty levels for place-names (which thankfully turn out to not be an issue at all). I also responded to a couple of queries from Thomas Widmann of SLD, started to think about the new Galloway Glens place-name project and updated the images and image credits that appear on Matthew Creasy’s Decadence and Translation website.
I also spend the best part of a day preparing for this year’s P&DR process, ahead of my meeting next week.
I met with Matthew Creasey from English Literature this week to discuss a project website for his recently funded ‘Decadence and Translation Network’ project. The project website is going to be a fairly straightforward WordPress site, but there will also be a digital edition hosted through it, which will be sort of similar to what I did for the Woolf short story for the New Modernist Editing project (https://nme-digital-ode.glasgow.ac.uk/). I set up an initial site for Matthew and will work on it further once he receives the images he’d like to use in the site design.
I also gave some further help with Craig Lamont in getting access to Google Analytics for the Ramsay project, and spoke to Quintin Cutts in Computing Science about publishing an iOS app they have created. I also met with Graeme Cannon to discuss AHRC Data Management Plans, as he’s been asked to contribute to one and hasn’t worked with a plan yet. I also made a couple of minor fixes to the RNSN timeline and storymap pages and updated the ‘attribution’ text on the REELS map. There’s quite a lot of text relating to map attribution and copyright so instead of cluttering up the bottom of the maps I’ve moved everything into a new pop-up window. In addition to the statements about the map tilesets I’ve also added in a statement about our place-name data, the copyright statement that’s required for the parish boundaries, a note about Leaflet and attribution for the map icons too. I think it works a lot better.
Other than these issues I mainly focussed on three projects this week. For the SCOSYA project I tackled an issue with the ‘or’ search, that was causing the search to not display results in a properly categorised manner when the ‘rated by’ option was set to more than one. It took a while to work through the code, and my brain hurt a bit by the end of it, but thankfully I managed to figured out what the problem was. Basically when ‘rated by’ was set to 1 the code only needed to match a single result for a location. If it found one that matched then the code stopped looking any further. However, when multiple results need to be found that match, the code didn’t stop looking, but instead had to cycle through all other results for the location, including those for other codes. So if it found two matches that met the criteria for ‘A1’ it would still go on looking through the ‘A2’ results as well, would realise these didn’t match and set the flag to ‘N’. I was keeping a count of the number of matches but this part of the code was never reached if the ‘N’ flag was set. I’ve now updated how the checking for matches works and thankfully the ‘Or’ search now works when you set a ‘rated by’ to be more than 1.
For the reworking of the Seeing Speech and Dynamic Dialects websites I decided for focus on the accent map and accent chart features of Dynamic Dialects. For the map I switched to using the Leaflet.js mapping library rather than Google Maps. This is mainly because I prefer Leaflet, you can use it with lots of different map tilesets, data doesn’t have to be posted to Google for the map to work and other reasons, such as the fact that you can zoom in and out with the scrollwheel of a mouse without having to also press ‘ctrl’, which gets really annoying with the existing map. I’ve removed the option to switch from map to satellite and streetview as well as these didn’t really seem to serve much purpose. The new base map is a free map supplied by Esri (a big GIS company). It isn’t cluttered up with commercial map markers when zoomed in, unlike Google.
You can now hover over a map marker to view the location and area details. Clicking on a marker opens up a pop-up containing all of the information about the speaker and links to the videos as ‘play’ buttons. Note that unlike the existing map, buttons for sounds only appear if there are actually videos for them. E.g. on the existing map for Oregon there are links for every video type, but only one (spontaneous) actually works.
Clicking on a ‘play’ button brings down the video overlay, as with the other pages I’ve redeveloped. As with other pages, the URL is updated to allow direct linking to the video. Note that any map pop-up you have open does not remain open when you follow such a link, but as the location appears in the video overlay header it should be easy for a user to figure out where the relevant marker is when they close the overlay.
For the Accent Chart page I’ve added in some filter options, allowing you to limit the display of data to a particular area, age range and / or gender. These options can be combined, and also bookmarked / shared / cited (e.g. so you can follow a link to view only those rows where the area is ‘Scotland’ the age range is ’18-24’ and the gender is ‘F’). I’ve also added a row hover-over colour to help you keep your eye on a row. As with other pages, click on the ‘play’ button and a video overlay drops down. You can also cite / bookmark specific videos.
I’ve made the table columns on this page as narrow as possible, but it’s still a lot of columns and unless you have a very wide monitor you’re going to have to scroll to see everything. There are two ways I can set this up. Firstly the table area of the page itself can be set to scroll horizontally. This keeps the table within the boundaries of the page structure and looks more tidy, but it means you have to vertically scroll to the bottom of the table before you see the scrollbar, which is probably going to get annoying and may be confusing. The alternative is to allow the table to break out of the boundaries of the page. This looks messier, but the advantage is the horizontal scrollbar then appears at the bottom of your browser window and is always visible, even if you’re looking at the top section of the table. I’ve asked Jane and Eleanor how they would prefer the page to work.
My final project of the week was the Historical Thesaurus. I spent some time working on the new domain names we’re setting up for the thesaurus, and on Thursday I attended the lectures for the new lectureship post for the Thesaurus. It was very interesting to hear the speakers and their potential plans for the Thesaurus in future, but obviously I can’t say much more about the lectures here. I also attended the retirement do for Flora Edmonds on Thursday afternoon. Flora has been a huge part of the thesaurus team since the early days of its switch to digital and I think she had a wonderful send-off from the people in Critical Studies she’s worked closely with over the years.
On Friday I spent some time adding the mini timelines to the search results page. I haven’t updated the ‘live’ page yet but here’s an image showing how they will look:
It’s been a little tricky to add the mini-timelines in as the search results page is structured rather differently to the ‘browse’ page. However, they’re in place now, both for general ‘word’ results and for words within the ‘Recommended Categories’ section. Note that if you’ve turned mini-timelines off in the ‘browse’ page they stay off on this page too.
We will probably want to add a few more things in before we make this page live. We could add in the full timeline visualisation pop-up, that I could set up to feature all search results, or at least the results for the current page of search results. If I did this I would need to redevelop the visualisation to try and squeeze in at least some of the category information and the pos, otherwise the listed words might all be the same. I will probably try to add in each word’s category and pos, which should provide just enough context, although subcat names like ‘pertaining to’ aren’t going to be very helpful.
We will also need to consider adding in some sorting options. Currently the results are ordered by ‘Tier’ number, but I could add in options to order results by ‘first attested date’, ‘alphabetically’ and ‘length of attestation’. ‘Alphabetically’ isn’t going to be hugely useful if you’re looking at a page of ‘sausage’ results, but will be useful for wildcard searches (e.g. ‘*sage’) and other searches like dates. I would imagine ordering results by ‘length of attestation’ is going to be rather useful in picking out ‘important’ words. I’ll hopefully have some time to look into these options next week.
I’d taken Friday off as a holiday this week, and I was also off on Monday afternoon to attend a funeral. Despite being off for a day and a half I still managed to achieve quite a lot this week. Over the weekend Thomas Clancy had alerted me to another excellent resource that has been developed by the NLS Maps people that plots the boundaries of all parishes in Scotland, which you can access here: http://maps.nls.uk/geo/boundaries/#zoom=10.671666666666667&lat=55.8481&lon=-2.5155&point=0,0. For REELS we had been hoping to incorporate parish boundaries into our Berwickshire map but didn’t know where to get the coordinates from, and there wasn’t enough time in the project for us to manually create the data. I emailed Chris Fleet at the NLS to ask where they’d got their data from, and whether we might be able to access the Berwickshire bits of it. Chris very helpfully replied to say were created by the James Hutton Institute and are hosted on the Scottish government’s Scottish Spatial Data Infrastructure Metadata Portal (see https://www.spatialdata.gov.scot/geonetwork/srv/eng/catalog.search#/metadata/c1d34a5d-28a7-4944-9892-196ca6b3be0c). The data is free to use, so long as a copyright statement is displayed, and there’s even an API through which the data can be grabbed (see here: http://sedsh127.sedsh.gov.uk/arcgis/rest/services/ScotGov/AreaManagement/MapServer/1/query). The data can even be outputted in a variety of formats, including shape files, JSON and GeoJSON. I decided to go for GeoJSON, as this seemed like a pretty good fit for the Leaflet mapping library we use.
Initially I used the latitude and longitude coordinates for one parish (Abbey St Bathans) and added this to the map. Unfortunately the polygon shape didn’t appear on the map, even though no errors were returned. This was rather confusing until I realised that whereas Leaflet tends to use latitude and then longitude as the order of the input data, GeoJSON is set to have longitude first and then latitude. This meant my polygon boundaries had been added to my map, just in a completely different part of the world! It turns out that in order to use GeoJSON data in Leaflet it’s better to use Leaflet’s in-built ‘L.geoJSON’ functions (See https://leafletjs.com/examples/geojson/). With this in place, Leaflet very straightforwardly plotted out the boundaries of my sample parish.
I had intended to write a little script that would then grab the GeoJSON data for each of the parishes in our system from the API mentioned above. However, I noticed that when passing a text string to the API it does a partial match, and can return multiple parishes. For example, our parish ‘Duns’ also brings back the data for ‘Dunscore’ and ‘Dunsyre’. I figured therefore that it would be safer if I just manually grabbed the data and inserted it directly into our ‘parishes’ database. This all worked perfectly, other than for the parish of Coldingham, which is a lot bigger than the rest, meaning the JSON data was also a lot larger. The size of the data was larger than a setting on the server was allowing me to upload to MySQL, but thankfully Chris McGlashan was able to sort that out for me.
With all of the parish data in place I styled the lines a sort of orange colour that would show up fairly well on all of our base maps. I also updated the ‘Display options’ to add in facilities to turn the boundary lines on or off. This also meant updating the citation, bookmarking and page reloading code too. I also wanted to add in the three-letter acronyms for each parish too. It turns out that adding plain text directly to a Leaflet map is not actually possible, or at least not easily. Instead the text needs to be added as a tooltip on an invisible marker, and the tooltip then has to be set as permanently visible, and then styled to remove the bubble around the text. This still left the little arrow pointing to the marker, but a bit of Googling informed me that if I set the tooltip’s ‘dicrection’ to ‘center’ the arrowheads aren’t shown. It all feels like a bit of a hack, and I hope that in future it’s a lot easier to just add text to a map in a more direct manner. However, I was glad to figure a solution out, and once I had manually grabbed the coordinates where I wanted the parish labels to appear I was all set. Here’s an example of how the map looks with parish boundaries and labels turned on:
I had some other place-name related things to do this week. On Wednesday afternoon I met with Carole, Simon and Thomas to discuss the Scottish Survey of Place-names, which I will be involved with in some capacity. We talked for a couple of hours about how the approach taken for REELS might be adapted for other surveys, and how we might connect up multiple surveys to provide Scotland-wide search and browse facilities. I can’t really say much more about it for now, but it’s good that such issues are being considered.
I spent about a day this week continuing to work on the new pages and videos for the Seeing Speech project. I fixed a formatting issue with the ‘Other Symbols’ table in the IPA Charts that was occurring in Internet Explorer, which Eleanor had noticed last week. I also uploaded the 16 new videos for /l/ and /r/ sounds that Eleanor had sent me, and created a new page for accessing these. As with the IPA Charts page I worked on last week, the videos on this page open in an overlay, which I think works pretty well. I also noticed that the videos kept on playing if you closed an overlay before the video finished, so I updated the code to ensure that the videos stop when the overlay is closed.
Other than these projects, I investigated an issue relating to Google Analytics that Craig Lamont was encountering for the Ramsay project, and I spent the rest of my time returning to the SCOSYA project. I’d met with Gary last week and he’d suggested some further updates to the staff Atlas page. It took a bit of time to get back into how the atlas works as it’s been a long time since I last worked on it, but once I’d got used to it again, and had created a new test version of the atlas for me to play with without messing up Gary’s access, I decided to try and figure out whether it would be possible to add in a ‘save map as image’ feature. I had included this before, but as the atlas uses a mixture of image types (bitmap, SVG, HTML elements) for base layers and markers the method I’d previously used wasn’t saving everything.
However, I found a plugin called ‘easyPrint’ (https://github.com/rowanwins/leaflet-easyPrint) that does seem to be able to save everything. By default it prints the map to a printer (or to PDF), but it can also be set up to ‘print’ to a PNG image. It is a bit clunky, sometimes does weird things and only works in Chrome and Firefox (and possibly Safari, I haven’t tried, but definitely not MS IE or Edge). It’s not going to be suitable for inclusion on the public atlas for these reasons, but it might be useful to the project team as a means of grabbing screenshots.
With the plugin added a new ‘download’ icon appears above the zoom controls in the bottom right. If you move your mouse over this some options appear that allow you to save an image at a variety of sizes (current, A4 portrait, A4 landscape and A3 portrait). The ‘current’ size should work without any weirdness, but the other ones have to reload the page, bringing in map tiles that are beyond what you currently see. This is where the weirdness comes in, as follows:
- The page will display a big white area instead of the map while the saving of the image takes place. This can take a few seconds.
- Occasionally the map tiles don’t load successfully and you get white areas in the image instead of the map. If this happens pan around the map a bit to load in the tiles and then try saving the image again.
- Very occasionally when the map reloads it will have completely repositioned itself, and the map image will be of this location too. Not sure why this is happening. If it does happen, reposition the map and try again and things seem to work.
Once the processing is complete the image will be saved as a PNG. If you select the ‘A3’ option the image will actually be of a much larger area than you see on your screen. I think this will prove useful to you for getting higher resolution images and also for including Shetland, two issues Gary was struggling with. Here’s a large image with Shetland in place:
That’s all for this week.
I spent most of this week working on the new timeline features for the Historical Thesaurus. Marc, Fraser and I had a useful meeting on Wednesday where we discussed some final tweaks to the mini-timelines and the category page in general, and also discussed some future updates to the sparklines.
I made the mini-timelines slightly smaller than they were previously, and Marc changed the colours used for them. I also updated the script that generates the category page content via an AJAX call so that an additional ‘sort by’ option could be passed to it. I then implemented sorting options that matched up with those available through the full Timeline feature, namely sorting by first attested date, alphabetically, and length of use. I also updated this script to allow users to control whether the mini-timelines appear on the page or not. With these options available via the back-end script I then set up the choices to be stored as a session variable, meaning the user’s choices are ‘remembered’ as they navigate throughout the site and can be applied automatically to the data.
While working on the sorting options I noticed that the alphabetical ordering of the main timeline didn’t properly order ashes and thorns – e.g. words beginning with these were appearing at the end of the list when ordered alphabetically. I fixed this so that for ordering purposes an ash is considered ‘ae’ and a thorn ‘th’. This doesn’t affect how words are displayed, just how they are ordered.
We also decided at the meeting that we would move the thesaurus sites that were on a dedicated (but old) server (namely HT, Mapping Metaphor, Thesaurus of Old English and a few others) to a more centrally hosted server that is more up to date. This switch would allow these sites to be made available via HTTPS as opposed to HTTP and will free up the old server for us to use for other things, such as some potential corpus based resources. Chris migrated the content over and after we’d sorted a couple of initial issues with the databases all of the sites appear to be working well. It is also a really good thing to have the sites available via HTTPS. We are also now considering setting up a top-level ‘.ac.uk’ address for the HT and spent some time making a case for this.
A fairly major feature I added to the HT this week was a ‘menu’ section for main categories, which contains some additional options, such as the options to change the sorting of the category pages and turn the mini-timelines on and off. For the button to open the section I decided to use the ‘hamburger’ icon, which Marc favoured, rather than a cog, which I was initially thinking of using, because a cog suggests managing options whereas this section contains both options and additional features. I initially tried adding the drop-down section as near to the icon as possible, but I didn’t like the way it split up the category information, so instead I set it to appear beneath the part of speech selection. I think this will be ok as it’s not hugely far away from the icon. I did wonder whether instead I should have a section that ‘slides up’ above the category heading, but decided this was a bad idea as if the user has the heading at the very top of the screen it might not be obvious that anything has happened.
The new section contains buttons to open the ‘timeline’ and ‘cite’ options. I’ve expanded the text to read ‘Timeline visualization’ and ‘Cite this category’ respectively. Below these buttons there are the options to sort the words. Selecting a sort option reloads the content of the category pane (maincat and subcats), while keeping the drop-down area open. Your choice is ‘remembered’ for the duration of your session, so you don’t have to keep changing the ordering as you navigate about. Changing to another part of speech or to a different category closes the drop-down section. I also updated the ‘There are xx words’ text to make it clearer how the words are ordered if the drop-down section is not open.
Below the sorting option is a further option that allows you to turn on or off the mini-timelines. As with the sorting option, your choice is ‘remembered’. I also added some tooltip text to the ‘hamburger’ icon, as I thought it was useful to have some detail about what the button does.
I then updated the main timeline so that the default sorting option aligns itself with the choice you made on the category page. E.g. If you’ve ordered the category by ‘length of use’ then the main timeline will be ordered this way too when you open it. I also set things up so that if you change the ordering via the main timeline pop-up then the ordering of the category will be updated to reflect your choice when you close the popup, although Fraser didn’t like this so I’ll probably remove this feature next week. Here’s how the new category page looks with the options menu opened:
I spent some more time on the REELS project this week, as Eila had got back to me with some feedback about the front-end. This included changing the ‘Other’ icon, which Eila didn’t like. I wasn’t too keen on it either, was I was happy to change it. I now use a sort of archway instead of the tall, thin monument, which I think works better. I also removed non-Berwickshire parishes from the Advanced Search page, tweaked some of the site text and also fixed the search for element language, which I had inadvertently broken when changing the way date searches worked last week.
Also this week I fixed an issue with the SCOTS corpus, which was giving 403 errors instead of playing the audio and video files, and was giving no results on the Advanced Search page. It turned out that this was being caused by a security patch that had been installed on the server recently, which was blocking legitimate requests for data. I was also in touch with Scot Spurlock about his crowdsourcing project, that looks to be going ahead in some capacity, although not with the funding that was initially hoped for.
Finally, I had received some feedback from Faye Hammill and her project partners about the data management plan I’d written for her project. I responded to some queries and finalised some other parts of the plan, sending off a rather extensive list of comments to her on Friday.
I again split my time mostly between REELS and Linguistic DNA and the Historical Thesaurus this week. For REELS, Carole had sent an email with lots of feedback and suggestions, so I spent some time addressing these. This included replacing the icon I’d chosen for settlements, and updating the default map zoom level to be a bit further out, so that the entire county fits on screen initially. I also updated the elements glossary ordering so that Old English “æ” and “þ” appear as if they were ‘ae’ and ‘th’ rather than at the end of the lists, and set the ordering to ignore diacritics, which were messing up the ordering a little. I also took the opportunity to update the display of the glossary so that the whole entry box for each item isn’t a link. This is because I’ve realised that some entries (e.g. St Leonard) have their own ‘find out more’ link and having a link within a link is never a good idea. Instead, there is now a ‘Search’ button at the bottom right of the entry, and if the ‘find out more’ button is present this appears next to it. I’ve changed the styling of the number of place-names and historical forms in the top right to make them look less like buttons too.
I also updated the default view of the map so that the ‘unselected’ data doesn’t appear on the map by default. You now have to manually tick the checkbox in the legend to add these in if you want them. When they are added in they appear ‘behind’ the other map markers rather than appearing on top of them, which was previously happening if you turned off the grey dots then turned them on again.
Leaflet has a method called ‘bringToBack’, which can be used to change the ordering of markers. Unfortunately you can’t apply this to an entire layer group (i.e. apply it to all grey dots in my grey dots group with one call). It took me a bit of time to figure out why this wasn’t working, but eventually I figured out I needed to call the ‘eachLayer’ method on my layer group to iterate over the contents and apply the ‘bringToBack’ method to each individual grey dot.
In addition to this update, I also set it so that changing marker categorisation in the ‘Display Options’ section now keeps the ‘Unselected’ dots off unless you choose to turn them on. I think this will be better for most users. I know when testing the map and changing categorisation the first thing I always then did was turn off the grey dots to reduce the clutter.
Carole had also pointed out an issue with the browse for sources, in that one source was appearing out of its alphabetical order and with more associated place-names than it should have. It turned out that this was a bug introduced when I’d previously added a new field for the browse list that strips out all tags (e.g. italics) from the title. This field gets populated when the source record is created or edited in the CMS. Unfortunately, I’d forgotten that sources can be added and edited directly through the add / edit historical forms page too, and I hadn’t added in the code to populate the field in these places. This meant that the field was being left blank, resulting in strange ordering and place-name numbering in the browse source page.
The biggest change that Carole had suggested was to the way in which date searches work. Rather than having the search and browse options allow the user to find place-names that have historical forms with a start / end date within the selected date or date range, Carole reckoned that identifying the earliest date for a place-name would be more useful. This was actually a pretty significant change, requiring a rewrite of large parts of the API, but I managed to get it all working. End dates have now been removed from the search and browse. The ‘browse start date’ looks for the earliest recorded start date rather than bringing back a count of place-names that have any historical form with the specified year, which I agree is much more useful. The advanced search now allows you to specify a single year, a range of years, or you can use ‘<’ and ‘>’ to search for place-names whose earliest historical form has a start date before or after a particular date.
I also finally got round to replacing the base maps with free alternatives this week. I was previously using MapBox maps for all but one of our base maps, but as MapBox only allows 50,000 map views in a month, and I’d managed almost 10,000 myself, we agreed that we couldn’t rely so heavily on the service, as the project has no ongoing funds. Thanks to some very useful advice from Chris Fleet at the NLS, I managed to switch to some free alternatives, including three that are hosted by the NLS Maps people themselves. The Default view is now Esri Topomap, the satellite view is now Esri WorldImagery (both free). Satellite with labels is still MapBox (the only one now). I’ve also included modern OS maps, courtesy of NLS, OS maps 1840-1880 from NLS and OS maps 1920-1933 as before. We now have six base maps to choose from, and I think the resource is looking pretty good. Here’s an example with OS Maps from the 1840s to 1880s selected:
For Linguistic DNA this week I continued to monitor my script that I’d set running last week to extract frequency data about the usage of Thematic Headings per decade in all of the EEBO data I have access to. I had hoped that the process would have completed by Monday, and it probably would have done, were it not for the script running out of memory as it tried to tackle the category ‘AP:04 Number’. This category is something of an outlier, and contains significantly more data than the other categories. It contains more than 2,600,000 rows, of which almost 200,000 are unique. My script stores all unique words in an associative array, with frequencies for each decade then added to it. The more unique words the larger the array and the more memory required. I skipped over the category and my script successfully dealt with the remaining categories, finishing the processing on Wednesday. I then temporarily updated the PHP settings to remove memory restrictions and set my script to deal with ‘AP:04’, which took a while but completed successfully, resulting in a horribly large spreadsheet containing almost 200,000 rows. I zipped the resulting 2,077 CSV files up and sent them on to the DHI people in Sheffield, who are going to incorporate this data into the LDNA resource.
For the Historical Thesaurus I continued to work on the new Timeline feature, this time adding in mini-timelines that will appear beside each word on the category page. Marc suggested using the ‘Bullet Chart’ option that’s available in the jQuery Sparkline library found here: https://omnipotent.net/jquery.sparkline/#s-about and I’ve been looking into this.
Initially I ran into some difficulty with the limited number of options available. E.g. you can’t specify a start value for the chart, only an end value (although I later discovered that there is an undocumented setting for this in the source code), and individual blocks also don’t have start and end points but instead are single points that take their start value from the previous block. Also, data needs to be added in reverse order or things don’t display properly.
I must admit that trying to figure out how to hack about with our data to fit it in as the library required gave me a splitting headache and I eventually abandoned the library and wondered whether I could just make a ‘mini’ version using the D3 timeline plugin I was already using. After all there are lots of example of single bar timelines in the documentation: https://github.com/denisemauldin/d3-timeline. However, after more playing around with this library I realised that it just wasn’t very well suited to being shrunk to an inline size. Things started to break in weird ways when the dimensions were made very small and I didn’t want to have to furtle about with the library’s source code too much, after already having had to do so for the main timeline.
So, after taking some ibuprofen I returned to the ‘Bullet Chart’ and finally managed to figure out how to make our data work and get it all added in reverse order. As the start has to be zero, I made the end of the chart 1000, and all data has 1000 years taken off it. If I hadn’t done this then OE would have started midway through the chart. Individual years were not displaying due to being too narrow so I’ve added a range of 50 years on to them, which I later reduced to 20 years after feedback from Fraser. I also managed to figure out how to reduce the thickness of the bar running along the middle of the visualisation. This wasn’t entirely straightforward as the library uses HTML Canvas rather than SVG. This means you can’t just view the source of the visualisation using the browser’s ‘select element’ feature and tinker with it. Instead I had to hack about with the library’s source code to change the coordinates of the rectangle that gets created. Here’s an example of where I’d got to during the week:
I positioned the timelines to the right of each word’s section, next to the magnifying glass. There’s a tooltip that displays the fulldate field on hover. I figured out how to position the ‘line’ at the bottom of the timeline, rather in the middle, and I’ve disabled highlighting of sections on mouse over and have made the background look transparent. It’s not, actually. I tried this but the ‘white’ blocks actually cover up unwanted sections of the other colour so setting things to transparent messed up the timeline. Instead the code works out if the row is odd or even and grabs the row colour based on this. I had to remove the shades of grey from the subcat backgrounds to make this work. But actually I think the page looks better without the subcats being in grey. So, here is an example of the mini timelines in the category test page:
I think it’s looking pretty good. The only downside is these mini-timelines sort of make my original full timeline a little obsolete.
I worked on a few other projects this week as well. I sorted out access to the ‘Editing Burns’ website for a new administrator who has started, and I investigated some strange errors with the ‘Seeing Speech’ website whereby the video files were being blocked. It turned out to be down to a new security patch that had been installed on the server and after Chris updated this things started working again.
I also met with Megan Coyer to discuss her ‘Hogg in Fraser’s Magazine’ project. She had received XML files containing OCR text and metadata for all of the Fraser’s Magazine issues and wanted me to process the files to convert them to a format that she and her RA could more easily use. Basically she wanted the full OCR text, plus the Record ID, title, volume, issue, publication date and contributor information to be added to one Word file.
There were 17,072 XML files and initially I wrote a script that grabbed the required data and generated a single HTML file, that I was then going to convert into DOCX format. However, the resulting file was over 600Mb in size, which was too big to work with. I decided therefore to generate individual documents for each volume in the data. This results in 81 files (including one for all of the XML files that don’t seem to include a volume). The files are a more manageable size, but are still thousands of pages long in Word. This seemed to suit Megan’s needs and I moved the Word files to her shared folder for her to work with.
Monday this week was the May Day holiday, so it was a four-day week for me. I divided my time primarily between REELS and Linguistic DNA and updates to the Historical Thesaurus timeline interface. For REELS I contacted Chris Fleet at the NLS about using one of their base maps in our map interface. I’d found one that was apparently free to use and I wanted to check we had the details and attribution right. Thankfully we did, and Chris very helpfully suggested another base map of theirs that we might be able to incorporate too. He also pointed me towards an amazing crowdsourced resource that they had set up that has gathered more than 2 million map labels from the OS six-inch to the mile, 1888-1913 maps (see http://geo.nls.uk/maps/gb1900/). It’s very impressive.
I also tackled the issue of adding icons to the map for classification codes rather than just having coloured spots. This is something I’d had in mind from the very start of the project, but I wasn’t sure how feasible it would be to incorporate. I started off by trying to add in Font Awesome icons, which is pretty easy to do with a Leaflet plugin. However, I soon realised that Font Awesome just didn’t have the range of icons that I required for things like ‘coastal’, ‘antiquity’ ‘ecclesiastical’ and the like. Instead I found some more useful icons: https://mapicons.mapsmarker.com/category/markers/. The icons are released under a Creative Commons license and are free to use. Unfortunately they are PNG rather than SVG icons, so they won’t scale quite as nicely, but they don’t look too bad on an iPad’s ‘retina’ display, so I think they’ll do. I created custom markers for each icon and gave them additional styling with CSS. I updated the map legend to incorporate them as well, and I think they’re looking pretty good. It’s certainly easier to tell at a glance what each marker represents. Here’s a screenshot of how things currently look (but this of course still might change):
I also slightly changed all of the regular coloured dots on the map to give them a dark grey border, which helps them stand out a bit more on the maps, and I have updated the way map marker colours are used for the ‘start date’ and ‘altitude’ maps. If you categorise the map by start date the marker colours now have a fixed gradient, ranging from dark blue for 1000-1099 to red for after 1900 (the idea being things that are in the distant past are ‘cold’ and more recent things are still ‘hot’). Hopefully this will make it easier to tell at a glance which names are older and which are more recent. Here’s an example:
For the ‘categorised by altitude’ view I made the fixed gradient use the standard way of representing altitude on maps – ranging from dark green for low altitude, through browns and dark reds for high altitude, as this screenshot shows:
From the above screenshots you can see that I’ve also updated the map legend so that the coloured areas match the map markers, and I also added a scale to the map, with both metric and imperial units shown, which is what the team wanted. There are still some further changes to be made, such as updating the base maps, and I’ll continue with this next week.
For Linguistic DNA and the Historical Thesaurus I met with Marc and Fraser on Wednesday morning to discuss updates. We agreed that I would return to working on the sparklines in the next few weeks and I received a few further suggestions regarding the Historical Thesaurus timeline feature. Marc has noticed that if your cursor was over the timeline then it wasn’t possible to scroll the page, even though a long timeline might go off the bottom of the screen. If you moved your cursor to the sides of the timeline graphic scrolling worked normally, though. It turned out that the SVG image was grabbing all of the pointer events so the HTML in the background never knew the scroll event was happening. By setting the SVG to ‘pointer-events: none’ in the CSS the scroll events cascade down to the HTML and scrolling can take place. However, this then stops the SVG being able to process click events, meaning the tooltips break. Thankfully adding in ‘pointer-events: all’ to the bars, spots and OE label fixes this, apart from one oddity: if your cursor is positioned over a bar, spot or the OE label and you try to scroll then nothing happens. This is a relatively minor thing, though. I also updated the timeline font so that it uses the font we use elsewhere on the site.
I also made the part of speech in the timeline heading lower-case to match the rest of the site, and I also realised that the timeline wasn’t using the newer versions of the abbreviations we’d decided upon (e.g. ‘adj.’ rather than ‘aj.’) so I updated this, and also added in the tooltip. Finally, I addressed another bug whereby very short timelines were getting cut off. I added extra height to the timeline when there are only a few rows, which stops this happening.
I had a Skype meeting with Mike Pidd had his team at DHI about the EEBO frequency data for Linguistic DNA on Wednesday afternoon. We agreed that I would write a script that would output the frequency data for each Thematic Heading per decade as a series of CSV files that I would then send on to the team. We also discussed the Sparkline interface and the HT’s API a bit more, and I gave some further explanation as to how the sparklines work. After the meeting I started work on the export script, which does the following:
- It goes through every thematic heading that is up to the third hierarchy down.
- If the heading in question is a third level one then all lexemes from any lower levels are added into the output for this level
- Each CSV is given the heading number as a filename, but with dashes instead of colons as colons are bad characters for filenames
- Columns one and two are the heading and title of the thematic heading. This is the same for every row in a file – i.e. words from lower down the hierarchy do not display their actual heading. E.g. words from ‘AA:03:e:01 Volcano’ will display ‘AA:03:e High/rising ground’
- Column 3 contains the word and 4 the POS.
- Column 5 contains the number of senses in the HT. I had considered excluding words that had zero senses in the HT as a means of cutting out a lot of noise from the data, but decided against this in the end, as it would also remove a lot of variant spellings and proper names, which might turn out to be useful at some point. It will be possible to filter the data to remove all zero sense rows at a later date.
- The next 24 columns contain the data per decade, starting at 1470-1479 and ending with 1700-1709
- The final column contains the total frequency count
I started my script running on Thursday, and left my PC on over night to try and get the processing complete. I left it running when I went home at 5pm, expecting to find several hundred CSV files had been outputted. Instead, Windows had automatically installed an update and restarted my PC at 5:30, thus cancelling the script, which was seriously annoying. It does not seem to be possible to stop Windows doing such things, as although there are plenty of Google results about how to stop Windows automatically restarting when installing updates Microsoft changes Windows so often that all the listed ways I’ve looked at no longer work. It’s absolutely ridiculous as it means running batch processes that might take a few days is basically impossible to do with any reliability on a Windows machine.
Moving on to other tasks I undertook this week: I sorted out payment for the annual Apple Developer subscription, which is necessary for our apps to continue to be listed on the App Store. I also responded to a couple of app related queries from an external developer who is making an app for the University. I also sorted out the retention period for user statistics for Google Analytics for all of the sites we host, after Marc asked me to look at this.
I continued to work on the REELS website for a lot of this week, and attended a team meeting for the project on Wednesday afternoon. In the run-up to the meeting I worked towards finalising the interface for the map. Previously I’d just been using colour schemes and layouts I’d taken from previous projects I’d worked on, but I needed to develop an interface that was right for the current project. I played around with some different colour schemes before settling on one that’s sort of green and blue, with red as a hover-over. I also updated the layout of the textual list of records to make the buttons display a bit more nicely, and updated the layout of the record page to place the description text above the map. Navigation links and buttons also now appear as buttons across the top of pages, whereas previously they were all over the place. Here’s an example of the record page:
The team meeting was really useful, as Simon had some useful feedback on the CMS and we all went through the front-end and discussed some of the outstanding issues. By the end of the meeting I had accumulated quite a number of items to add to my ‘to do’ list, and I worked my way through these during the rest of the week. These included:
- Unique record IDs now appear in the cross reference system in the CMS, so the team can more easily figure out which place-name to select if there is more than one with the same name. I’ve also added this unique record ID to the top of the ‘edit place’ page.
- I’ve added cross references to the front-end record page, as I’d forgotten to add these in before
- I’ve replaced the ‘export’ menu item in the CMS with a new ‘Tools’ menu item. This page includes a link to the ‘export’ page plus links to new pages I’m adding in
- I’ve created a script that lists all duplicate elements within each language. It is linked to from the ‘tools’ page. Each duplicate is listed, together with its unique ID and the number of current and historical names each is associated with and a link through to the ‘edit element’ page
- The ‘edit element’ page now lists all place-names and historical forms that the selected element is associated with. These are links leading to the ‘manage elements’ page for the item.
- When adding a new element the element ID appears in the autocomplete in addition to the element and language, hopefully making it easier to ensure you link to the correct element.
- ‘Description’ has been changed to ‘analysis’ in both the CMS and in the API (for the CSV / JSON downloads)
- ‘Proper name’ language has been changed to ‘Personal name’
- The new roles ‘affixed name’ and ‘simplex’ have been added
- The new part of speech ‘Numeral’ has been added.
- I’ve created a script that lists all elements that have a role of ‘other’, linked to from the ‘tools’ menu in the CMS. The page lists the element that has this role, its language, the ID and name of the place-name this appears in, and a link to the ‘manage elements’ page for the item. For historical forms the historical form name also appears.
- I’ve fixed the colour of the highlighted item in the elements glossary when reached via a link on the record page
- I’ve changed the text in the legend for grey dots from ‘Other place-names’ to ‘unselected’. We had decided on ‘Unselected place-names’ but this made the box too wide and I figured ‘unselected’ worked just as well – we don’t say ‘Settlement place-names’, after all, but just ‘Settlement’)
- I’ve removed place-name data from the API that doesn’t appear in the front-end. This is basically just the additional element fields
- I’ve checked that records that are marked as ‘on website’ but don’t appear on landranger maps are set to appear on the website. They weren’t, but they are now.
- I’ve also made the map on the record page use the base map you had selected on the main map, rather than always loading the default view. Similarly, if you change the base map on the record page and then return to the map using the ‘return’ button.
I also investigated some issues with the Export script that Daibhidh had reported. It turned out that these were being caused by Excel. The output file is a comma separated value file encoded in UTF-8. I’d included instructions on how to import the file into Excel to allow UTF-8 characters to display properly, but for some reason this method was causing some of the description fields to be incorrectly split up. If instead of importing the file following the instructions it was opened directly into Excel the fields get split up into their proper columns correctly, but you end up with a bunch of garbled UTF-8 characters.
After a bit of research I figured out a way for the CSV file to be directly opened in Excel with the UTF-8 characters intact (and with the columns not getting split up where they shouldn’t). By setting my script to include ‘Byte Order Marking’ at the top of the file, Excel magically knows to render the UTF-8 characters properly.
In addition to the REELS project, I attended an IT Services meeting on Wednesday morning. It was billed as a ‘Review of IT Support for Researchers’ meeting but in reality the focus of pretty much the whole meeting was on the proposal for the high performance compute cluster, with most of the discussions being about the sorts of hardware setup it should feature. This is obviously very important for researchers dealing with petabytes and exabytes of data and there were heated debates about whether there were too many GPUs when CPUs would be more useful (and vice versa) but really this isn’t particularly important for anything I’m involved with. The other sections of the agenda (training, staff support etc) were also entirely focussed on HPC and running intensive computing jobs, not on things like web servers and online resources. I’m afraid there wasn’t really anything I could contribute to the discussions.
I did learn a few interesting things, though, namely: IT Services are going to start offering a training course in R, which might be useful. Also, Machine Learning is very much considered the next big thing and is already being used quite heavily in other parts of the University. Machine Learning works better with GPUs rather than CPUs and there are apparently some quite easy to use Machine Learning packages out there now. Google has an online tool called Colaboratory (https://colab.research.google.com) for Machine Learning education and research, which might be useful to investigate. Also, IT Services offer Unix tutorials here: http://nyx.cent.gla.ac.uk/unix/ and other help documentation about HPC, R and other software here: http://nyx.cent.gla.ac.uk/unix/ These don’t seem to be publicised anywhere, but might be useful.
I also worked on a number of other projects this week, including creating a timeline feature based on data about the Burns song ‘Afton Water’ that Brianna had sent me for the RNSN project. I created this using the timeline.js library (https://timeline.knightlab.com/), which is a great library and really easy to use. I also responded to a query about some maps of the Ramsay ARHC project, which is now underway. Also, Jane and Eleanor got back to me with some feedback on my mock-up designs for the new Seeing Speech website. They have decided on a version that is very similar in layout to the old site, and they had suggested several further tweaks. I created a new mock-up with these tweaks in place, which they both seem happy with. Once they have worked a bit more on the content of the site I will then be able to begin the full migration to the new design.