After an enjoyable week’s holiday I returned to work on Monday, spending quite a bit of Monday catching up with some issues people had emailed me about whilst I was away, such as making further tweaks to the ‘Concise Scots Dictionary’ page on the DSL website for Rhona Alcorn (the page is now live if you’d like to order the book: http://dsl.ac.uk/concise-scots-dictionary/), speaking with Luca about a project he’s involved in the planning of that’s going to use some of the DSL data, helping Carolyn Jess-Cooke with some issues she was encountering when accessing one of her websites, giving some information to Brianna of the RNSN project about timeline tools we might use, and a few other such things.
I had a couple of queries from Wendy Anderson this week. The first was for Mapping Metaphor. Wendy wanted to grab all of the bidirectional metaphors in both the main and OE datasets, including all of their sample lexemes. I wrote a script that extracted the required data and formatted it as a CSV file, which is just the sort of thing she wanted. The second query was for all of the metadata associated with the Corpus of Modern Scots Writing texts. A researcher had contacted Wendy to ask for a copy but although the metadata is in the database and can be viewed on a per text basis through the website, we didn’t have the complete dataset in an easy to share format. I wrote a little script that queried the database and retrieved all of the data. I had to do a little digging into how the database was structure in order to do this, as it is a system that wasn’t developed by me. However, after a little bit of exploration I managed to write a script that grabbed the data about each text, including multiple authors that can be associated with each text. I then formatted this as a CSV file and sent the outputted file to Wendy.
I met with Gary on Monday to discuss some changes to the SCOSYA atlas and CMS that he wanted me to implement ahead of an event the team are at next week. This included adding Google Analytics to the website, updating the legend of the Atlas to make it clearer what the different rating levels meant, separating out the grey squares (which mean no data is present) and the grey circles (meaning data is present but doesn’t meet the specified criteria) into separate layers so they can be switched on and off independently of each other, making the map markers a little smaller, and adding in facilities to allow Gary to delete codes, attributes and code parents via the CMS. This all took a fair amount of time to implement, and unfortunately I lost a lot of time on Thursday due to a very strange situation with my access to the server.
I work from home on Thursdays and I had intended to work on the ‘delete’ facilities that day, but when I came to log into the server the files and the database appeared to have reverted back to the state they were in in May – i.e. it looked like we had lost almost six months of data, plus all of the updates to the code I’d implemented during this time. This was obviously rather worrying and I spent a lot of time toing and froing with Arts IT Support to try and figure out what had gone wrong. This included restoring a backup from the weekend before, which strangely still seemed to reflect the state of things in May. I was getting very concerned about this when Gary noted that he was seeing two different views of the data on his laptop. In Safari on his laptop his view of the data appeared to have ‘stuck’ at May while in Chrome he could see the up to date dataset. I then realised that perhaps the issue wasn’t with the server after all but instead the problem was my home PC (and Safari on Gary’s laptop) was connecting to the wrong server. Arts IT Support’s Raymond Brasas suggested it might be an issue with my ‘hosts’ file and that’s when I realised what had happened. As the SCOSYA domain is an ‘ac.uk’ domain and it takes a while for these domains to be set up, we had set up the server long before the domain was running, so to allow me to access the server I had added a line to the ‘hosts’ file on my PC to override what happens when the SCOSYA URL is requested. Instead of it being resolved by a domain name service my PC pointed at the IP address of the server as I had entered it in my ‘hosts’ file. Now in May, the SCOSYA site was moved to a new server, with a new IP address, but the old server had never been switched off, so my home PC was still connecting to this old server. I had only encountered the issue this week because I hadn’t worked on SCOSYA from home since May. So, it turned out there was no problem with the server, or the SCOSYA data. I removed the line from my ‘hosts’ file, restarted my browser and immediately I could access the up to date site. All this took several hours of worry and stress, but it was quite a relief to actually figure out what the issue was and to be able to sort it.
I had intended to start setting up the server for the SPADE project this week, but the machine has not yet been delivered, so I couldn’t work on this. I did make a few further tweaks to the SPADE website, however, and responded to a couple of queries from Rachel about the SCOTS data and metadata, which the project will be using.
I also met with Fraser to discuss the ongoing issue of linking up the HT and OED data. We’re at the stage now where we can think about linking up the actual words with categories. I’d previously written a script that goes through each HT category that matches an OED category and compares the words in each, checking whether an HT word matches the next found in either the OED ‘ght_lemma’ or ‘lemma’ fields. After our meeting I updated the HT lexeme table to include extra fields for the ID of a matching OED lexeme and whether the lexeme had been checked. After that I updated the script to go through every matching category in order to ‘tick off’ the matching words within. The first time I ran my script it crashed the browser, but with a bit of tweaking I got it to successfully complete the second time. Here are some stats:
There are 655513 HT lexemes that are now matched up with an OED lexeme. There are 47074 HT lexemes that only have OE forms, so with 793733 HT lexemes in total this means there are 91146 HT lexemes that should have an OED match but don’t. Note, however, that we still have 12373 HT categories that don’t match OED categories and these categories contain a total of 25772 lexemes.
On the OED side of things, we have a total of 688817 lexemes, and of these 655513 now match an HT lexeme, meaning there are 33304 OED lexemes that don’t match anything. At least some of these will also be cleared up by future HT / OED category matches. Of the 655513 OED lexemes that now match, 243521 of them are ‘revised’. There are 262453 ‘revised’ OED lexemes in total, meaning there are 18932 ‘revised’ lexemes that don’t currently match an HT lexeme. I think this is all pretty encouraging as it looks like my script has managed to match up bulk of the data. It’s just the several thousand edge cases that are going to be a bit more work.
On Wednesday I met with Thomas Widmann of Scots Language Dictionaries to discuss our plans to merge all three of the SLD websites (DSL, SLD and Scuilwab) into one resource that will have the DSL website’s overall look and feel. We’re going to use WordPress as a CMS for all of the site other than the DSL’s dictionary pages, so as to allow SLD staff to very easily update the content of the site. It’s going to take a bit of time to migrate things across (e.g. making a new WordPress theme based on the DSL website, create quick search widgets, updating the DSL dictionary pages to work with the WordPress theme), but we now have the basis of a plan. I’ll try to get started on this before the year is out.
Finally this week, I responded to a request from Simon Taylor to make a few updates to the REELS system, and I replied to Thomas Clancy about how we might use existing Ordinance Survey data in the Scottish Place-Names survey. All in all it has been a very busy week.
It was another week of working on fairly small tasks for lots of different projects. I helped Gerry McKeever to put the finishing touches to his new project website, and this has now gone live and can be accessed here: http://regionalromanticism.glasgow.ac.uk/. I also spent some further time making updates to the Burns Paper Database website for Ronnie Young. This included adding in a site menu to facilitate navigation, adding a subheader to the banner, creating new pages for ‘about’ and ‘contact’, adding some new content, making repositories appear with their full names rather than acronyms, updating the layout of the record page and tweaking how the image pop-up works. It’s all pretty much done and dusted now, although I can’t share the URL as the site is password protected due to the manuscript images being under copyright restrictions.
I spent about a day this week on AHRC review duties and also spent some time working on the new interface for Kirsteen McCue’s ‘Romantic National Song Network’ project website. This took up a fair amount of time as I had to try out a few different designs, work with lots of potential images, set up a carousel, and experiment with fonts for the site header. I’m pretty pleased with how things are looking now, although there are four different font styles that we still need to choose one from.
I had a couple of conference calls and a meeting with Marc and Fraser about the Linguistic DNA project. I met with Marc and Fraser first, in order to go over the work Fraser is currently doing and how my involvement in the project might proceed. Fraser and I then had a Skype call with Iona and Seth in Sheffield about the work the researchers are currently doing and some of the issues they are encountering when dealing with the massive dataset they’re working with. After the call Fraser sent me a sample of the data, which really helped me to understand some of the technical issues that are cropping up. On Friday afternoon the whole project had a Skype call. This included the DHI people in Sheffield and it was useful to hear something about the technical work they are currently doing.
I had a couple of other meetings this week too. On Wednesday morning I had a meeting with Jennifer Smith about a new pilot project she’s putting together in order to record Scots usage in schools. We talked through a variety of technical solutions and I was able to give some advice on how the project might be managed from a technical point of view. On Wednesday afternoon I had a meeting for The People’s Voice project, at which I met with new project RA, who has taken over from Michael Shaw as he’s now moved to a different institution. I helped the new RA get up to speed with the database and how to update the front-end.
Also this week I had an email conversation with the SPADE people about how we will set up a server for the project’s infrastructure at Glasgow. I’m going to be working on this the week after next. I also made a few further updates to the DSL website and had a chat with Thomas Widmann about a potential reworking of some of the SLD’s websites.
There’s not a huge amount more to say about the work I did this week. I was feeling rather unwell all week and it was a bit of a struggle getting through some days during the middle of the week, but I made it through to the end. I’m on holiday all of next week so there won’t be an update from me until the week after.
This was another week of doing lots of fairly small bits of work for many different projects. I was involved in some discussions about some possible updates to websites with Scottish Language Dictionaries, and created a new version of a page for the Concise Scots Dictionary for them. I also made a couple of minor tweaks to a DSL page for them as well.
For the Edinburgh Gazetteer project I added in all of the ancillary material that Rhona Brown had sent me, added in some new logos, set up a couple of new pages and made a couple of final tweaks to the Gazetteer and reform societies map pages. The site is now live and can be accessed here: http://edinburghgazetteer.glasgow.ac.uk/
I also read through the Case for Support for Thomas Clancy’s project proposal and made a couple of updated to the Technical Plan based on this, and I spent some time reading over the applications for a post that I’m on the interview panel for. I also spent a bit more time on the Burns Paper Database project. There were some issues with the filenames of the images used. Some included apostrophes and ampersands, which meant the images wouldn’t load on the server. I decided to write a little script to rename all of the images in a more uniform way, while keeping a reference to the original filenames in the database for display and for future imports. It took a bit of time to get this sorted but the images work a lot better now.
I met with Fraser on Wednesday to get back into the whole issue of merging the new OED data with the HT data. It had been a few months since either of us had looked at the issues relating to this, so it took a bit of time to get back up to speed with things. The outcome of our meeting was that I would create three new scripts. The first would find all of the categories where there was no ‘oedmaincat’ and the part of speech was not a noun. The script would then check to see whether there was a noun at the same level and if so grab its ‘oedmaincat’ and then see if this matched anything in the OED data for the given part of speech. This managed to match up a further 183 categories that weren’t previously matched so we could tick these off. The second script generated a CSV for Fraser to use that ordered unmatched categories by size. This is going to be helpful for manual checking and it thankfully demonstrated that of the more than 12,000 non-matched categories only about 750 have more than 5 words in them. The final script was an update to the ‘all the non-matches’ script that added in counts of the number of words within the non-matching HT and OED categories. It’s now down to Fraser and some assistants to manually go through things.
I did some further work for the SPADE project this week, extracting some information about the SCOTS corpus. I wrote a script that queries the SCOTS database and pulls out some summary information about the audio recordings. For each audio recording the ID, title, year recorded and duration in minutes are listed. Details for each participant (there are between 1 and 6) are also listed: ID, Gender, decade of birth (this is the only data about the age of the person that there is), place of birth and occupation (there is no data about ‘class’). This information appears in a table. Beneath this I also added some tallies: the total number of recordings, the total duration, the number of unique speakers (as a speaker can appear in multiple recordings) and a breakdown of how many of these are male, female or not specified. Hopefully this will be of use to the project.
Finally, I had a meeting with Kirsteen McCue and project RA Brianna Robertson-Kirkland about the Romantic National Song Network project. We discussed potential updates to the project website, how it would be structured, how the song features might work and other such matters. I’m intending to produce a new version of the website next week.
My time this week was split amongst many different projects. I continued to work on the Burns Paper Database, setting up a proper subdomain for the project and creating a more unified interface for the site, which previously just used styles taken from pervious sites that I had borrowed functionality from. I think my work on this website is now pretty much complete and it’s been a useful experience, especially working with image pan and zoom libraries, which I will no doubt make further use of in future projects. It’s a shame I can’t share the URL, though, as the site needs to be password protected due to the use of high resolution copyrighted images.
I had a chat with Chris McGlashan this week about maybe migrating the project websites to HTTPS rather than just using HTTP. This week the main University website was being migrated over to this more secure, encrypted protocol for accessing web pages, and I wondered whether all of the project websites that exist as subdomains of the main University URL could also maybe make use of the main site’s SSL certificate. This would be good because we have lots of log-in forms for accessing content management systems and as these all submit data using non-encrypted HTTP that data could be intercepted. Browsers these days are also flagging up ‘insecure’ forms, which makes our sites look bad. However, the University’s IT people have advised against migrating to HTTPS for a couple of reasons. Firstly, we couldn’t just use the certificate from the main site (well, technically we might have been able to but from a security point of view this would be a bad idea as this certificate is used for finance sites and such things). This would mean we’d have to pay for our own certificates and to projects generally don’t have the funds for that. Secondly, most of the data we deal with is considered ‘low risk’, and therefore doesn’t warrant an SSL certificate. So, we’re keeping things as they are, for the time being at least.
I spent a bit of time this week reworking the site design for Gerry McKeever’s Regional Romanticism project, as he wasn’t too keen on the design I’d previously created. The new design looks a lot better, and he seems happy with it, so all is well there. I also spent quite a bit of time this week working with Rachel MacDonald on the interface for the SPADE website. I created an initial website for the project many weeks ago but just left it with a placeholder interface, but this week I implemented a proper interface, which I think looks pretty good. I just need to wait for feedback from the project team now, though. Neither of these websites is officially live yet, so I can’t share the URLS for them.
Also this week I had a chat with the DSL people about a new page they want me to make on the website. I also created a new version of the Technical Plan for Thomas Clancy’s Iona project, based on feedback, and also created the Technical Summary paragraph for the main part of the proposal. I spent a bit of time following up on a task for my PDR too, and responded for a request for me to be on another interview panel.
I also returned to some Historical Thesaurus duties. A couple of weeks ago I was alerted to the existence of a non-noun category that didn’t have a noun category at the same level. This meant the category didn’t appear within the new ‘tree browse’ interface and neither it nor its child categories could be found by browsing. This issue was fixed by creating a new empty noun category. I wondered whether there might be any other similar categories in the database, so this week I wrote a little script to check. It turns out that there are similar categories, but thankfully not too many – between 20 and 30, in fact. After identifying these I asked Fraser to check what the headings of the empty noun categories should be, and once I heard back from him I created them, meaning all of the previously inaccessible categories can now be found. There may be more HT stuff to come back to next week, but we’ll see what is sent my way.