This week I spent a lot of time continuing with the HT/OED linking task, tackling the outstanding items on my ‘to do’ list before I met with Marc and Fraser on Friday. This included the following:
Re-running category pattern matching scripts on the new OED categories: The bulk of the category matching scripts rely on matching the HT’s oedmaincat field against the OED’s path field (and then doing other things like comparing category contents). However, these scripts aren’t really very helpful with the new OED category table as the path has changed for a lot of the categories. The script that seemed the most promising was number 17 in our workflow document, which compares first dates of all lexemes in all unmatched OED and HT categories and doesn’t check anything else. I’ve created an updated version of this that uses the new OED data, and the script only brings back unmatched categories that have at least one word that has a GHT date, and interestingly the new data has less unmatched categories featuring GHT dates than the old data (591 as opposed to 794). I’m not really sure why this is, or what might have happened to the GHT dates. The script brings back five 100% matches (only 3 more than the old data, all but one containing just one word) and 52 matches that don’t meet our criteria (down from 56 with the old data) so was not massively successful.
Ticking off all matching HT/OED lexemes rather than just those within completely matched categories: 627863 lexemes are now matched. There are 731307 non-OE words in the HT, so about 86% of these are ticked off. There are 751156 lexemes in the new OED data, so about 84% of these are ticked off. Whilst doing this task I noticed another unexpected thing about the new OED data: the number of categories in ’01’ and ‘02’ have decreased while the number in ‘03’ has increased. In the old OED data we have the following number of matched categories:
In the new OED data we have the following number of matched categories:
The totals match up, other than the 42 matched categories that have been deleted in the new data, so (presumably) some categories have changed their top level. Matching up the HT and OED lexemes has introduced a few additional duplicates, caused when a ‘stripped’ form means multiple words within a category match. There aren’t too many, but they will need to be fixed manually.
Identifying all words in matched categories that have no GHT dates and see which of these can be matched on stripped form alone: I created a script to do this, which lists every unmatched OED word that doesn’t have a GHT date in every matched OED category and then tries to find a matching HT word from the remaining unmatched words within the matched HT category. Perhaps I misunderstood what was being requested because there are no matches returned in any of the top-level categories. But then maybe OED words that don’t have a GHT date are likely to be new words that aren’t in the HT data anyway?
Create a monosemous script that finds all unmatched HT words that are monosemous and sees whether there are any matching OED words that are also monosemous: Again, I think the script I created will need more work. It is currently set to only look at lexemes within matched categories. It finds all the unmatched HT words that are in matched categories, then checks how many times each word appears amongst the unmatched HT words in matched categories of the same POS. If the word only appears once then the script looks within the matched OED category to find a currently unmatched word that matches. At the moment the script does not check to see if this word is monosemous as I figured that if the word matches and is in a matched category it’s probably a correct match. Of the 108212 unmatched HT words in matched categories, 70916 are monosemous within their POS and of these 14474 can be matched to an OED lexeme in the corresponding OED category.
Deciding which OED dates to use: I created a script that gets all of the matched HT and OED lexemes in one of the top-level categories (e.g. 01) and then for each matched lexeme works out the largest difference between OED sortdate and HT firstd (if sortdate is later then sortdate-firstd, otherwise firstd-sortdate); works out the largest difference between OED enddate and HT lastd in the same way; adds these two differences together to work out the largest overall difference. It then sorts the data on the largest difference and then displays all lexemes in a table ordered by largest difference, with additional fields containing the start difference, end difference and total difference for info. I did, however, encounter a potential issue: Not all HT lexemes have a firstd and lastd. E.g. words that are ‘OE-‘ have nothing in firstd and lastd but instead have ‘OE’ in the ‘oe’ column and ‘_’ in the ‘current’ column. In such cases the difference between HT and OED dates are massive, but not accurate. I wonder whether using HT’s apps and appe columns might work better.
Looking at lexemes that have an OED citation after 1945, which should be marked as ‘current’: I created a script that goes through all of the matched lexemes and lists all of the ones that either have an OED sortdate greater than 1945 or an OED enddate greater than 1945 where the matched HT lexeme does not have the ‘current’ flag set to ‘_’. There are 73919 such lexemes.
On Friday afternoon I had a meeting with Marc and Fraser where we discussed the above and our next steps. I now have a further long ‘to do’ list, which I will no doubt give more information about next week.
Other than HT duties I helped out with some research proposals this week. Jane Stuart-Smith and Eleanor Lawson are currently putting a new proposal together and I helped to write the data management plan for this. I also met with Ophira Gamliel in Theology to discuss a proposal she’s putting together. This involved reading through a lot of materials and considering all the various aspects of the project and the data requirements of each, as it is a highly multifaceted project. I’ll need to spend some further time next week writing a plan for the project.
I also had a chat to Wendy Anderson about updating the Mapping Metaphor database, and also the possibility of moving the site to a different domain. I also met with Gavin Miller to discuss the new website I’ll be setting up for his new Glasgow-wide Medical Humanities Network, and I ran some queries on the DSL database in order to extract entries that reference the OED for some work Fraser is doing.
Finally, I had to make some changes to the links from the Bilingual Thesaurus to the Middle English dictionary website. The site has had a makeover, and is looking great, but unfortunately when they redeveloped the site they didn’t put redirects from the old URLs to the new ones. This is pretty bas as it means anyone who has cited or bookmarked a page will end up with broken links, not just BTh. I would imagine entries have been cited in countless academic papers and all these citations will now be broken, which is not good. Anyway, I’ve fixed the MED links in BTh now. Unfortunately there are two forms of link in the database, for example: http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=id&id=MED6466 and http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=byte&byte=24476400&egdisplay=compact. I’m not sure why this is the case and I’ve no idea what the ‘byte’ number refers to in the second link type. The first type includes the entry ID, which is still used in the new MED URLs. This means I can get my script to extract the ID from the URL in the database and then replace the rest with the new URL, so the above becomes https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary/MED6466 as the target for our MED button and links directly through to the relevant entry page on their new site.
Unfortunately there doesn’t seem to be any way to identify an individual entry page for the second type of link. This means there is no way to link directly to the relevant entry page. However, I can link to the search results page by passing the headword, and this works pretty well. So, for example the three words on this page: https://thesaurus.ac.uk/bth/category/?type=search&hw=2&qsearch=catourer&page=1#id=1393 have the second type of link, but if you press on one of the buttons you’ll find yourself at the search results page for that word on the MED website, e.g. https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary?utf8=%E2%9C%93&search_field=hnf&q=Catourer.
I returned to work after my Easter holiday on Tuesday this week, making it another four-day week for me. On Tuesday I spent some time going through my emails and dealing with some issues that had arisen whilst I’d been away. This included sorting out why plain text versions of the texts in the Corpus of Modern Scottish Writing were giving 403 errors (it turned out the server was set up to not allow plain text files to be accessed and an email to Chris got this sorted). I also spent some time going through the Mapping Metaphor data for Wendy. She wanted me to structure the data to allow her to easily see which metaphors continued from Old English times and I wrote a script that gave a nice colour-coded output to show those that continued or didn’t. I also created another script that lists the number (and the details of) metaphors that begin in each 50-year period across the full range. In addition, I spoke to Gavin Miller about an estimate of my time for a potential follow-on project he’s putting together.
The rest of my week was split between two projects: LinguisticDNA and REELS. For LinguisticDNA I continued to work on the search facilities for the semantically tagged EEBO dataset. Chris gave me a test server on Tuesday (just an old desktop PC to add to the several others I now have in my office) and I managed to get the database and the scripts I’d started working on before Easter transferred onto it. With everything set up I continued to add new features to the search facility. I completed the second search option (Choose a Thematic Heading and a specific book to view the most frequent words) which allowa you to specify a Thematic Heading, a book, a maximum number of returned words and whether the theme selection includes lower levels. I also made it so that you can miss out the selection of a thematic heading to bring back all of the words in the specified book listed by frequency. If you do this each word’s thematic heading is also listed in the output, and it’s a useful way of figuring out which thematic headings you might want to focus on.
I also added a new option to both searches 1 and 2 that allows you to amalgamate the different noun and verb types. There are several different types (e.g. NN1 and NN2 for singular and plural forms of nouns) and it’s useful to join these together as single frequency counts rather than having them listed separately.
I also completed search option 3 (Choose a specific book to view the most frequent Thematic Headings). This allows the user to select a book from an autocomplete list and optionally provide a limit to the returned headings. The results display the thematic headings found in the book listed in order of frequency. The returned headings are displayed as links that perform a ‘search 2’ for the heading in the book, allowing you to more easily ‘drill down’ into the data. For all results I have added in a count column, so you can easily see how many results are returned or reference a specific result, and I also added titles to the search results pages that tell you exactly what it is you’ve searched for. I also created a list of all thematic headings, as I thought it might be handy to be able to see what’s what. When looking at this list you can perform a ‘search 1’ for any of the headings by clicking on one, and similarly, I created an option to list all of the books that form the dataset. This list displays each book’s ID, author, title, terms and number of pages, and you can perform a ‘search 3’ for a book by clicking on its ID.
On Friday I participated in the Linguistic DNA project conference call, following which I wrote a document describing the EEBO search facilities, as project members outside of Glasgow can’t currently access the site I’ve put together.
For REELS I continued to work on the public interface for the place-name data, which included the following:
- The number of returned place-names is now displayed in the ‘you searched for…’ box
- The textual list of results now features two buttons for each result, one to view the record and one to view the place-name on the map. I’m hoping the latter might be quite useful as I often find an interesting name in the textual list and wonder which dot on the map it actually corresponds to. Now with one click I can find it.
- Place-name labels on the map now appear when you zoom in past a certain level (currently set to zoom level 12). Note that only results rather than grey spots get the visible labels as otherwise there’s too much clutter and the map takes ages to load too.
- The record page now features a map with the place-name at the centre, and all other place-names as grey dots. The marker label is automatically visible.
- Returning back to the search results from a record when you’ve done a quick search now works – previously this was broken.
- The map zoom controls have been moved to the bottom right, and underneath them is a new icon for making the map ‘full screen’. Pressing on this will make the map take up the whole of your screen. Press ‘Esc’ or on the icon again to return to the regular view. Note that this feature requires a modern web browser, although I’ve just tested in in IE on Windows 10 and it works. Using full screen mode makes working with the map much more pleasant. Note, however, that navigating away form the map (e.g. if you click a ‘view record’ button) will return you to the regular view.
- There is a new ‘menu’ icon in the top-left of the map. Press on this and a menu slides out from the left. This presents you with options to change how the results are categorised on the map. In addition to the ‘by classification code’ option that has always been there, you can now categorise and colour code the markers by start date, altitude and element language. As with code, you can turn on and off particular levels using the legend in the top right. E.g. if you only want to display markers that have an altitude of 300m or more.
This was a short week for me as I only worked from Monday to Wednesday due to Christmas coming along. I spent most of Monday and Tuesday continuing to work on the Technical Plan for Joanna Kopaczyk’s proposal. As it’s a project with quite a large technical component there was a lot to think about and lots of detail to try and squeeze into the maximum of four pages allowed for a Plan. My first draft was five pages long, so I had to chop some information out and reformat things to try and bring the length down a bit, but thankfully I managed to get it within the limit whilst still making sense and retaining the important points. I also chatted with Graeme some more about some of the XML aspects of the project and had an email conversation with Luca about it too. It was good to get the Plan sent on to Joanna, although it’s still very much a first draft that will need some further tweaking as other aspects of the proposal are firmed up.
I had to fix an issue with the Thesaurus of Old English staff pages on Monday. The ‘edit lexemes’ form was set to not allow words to be more than 21 characters long. Jane Roberts had been trying to update the positioning of the word ‘(ge)mearcian mid . . . rōde’, and as this is more than 21 characters any changes made to this row were being rejected. I’m not sure why I’d set the maximum word length to 21 as the database allows up to 60 characters in this field. But I updated the check to allow up to 60 characters and that fixed the problem. I also spent a bit of time on Tuesday gathering some stats for Wendy about the various Mapping Metaphor resources (i.e. the main website, the blog, the iOS app and the Android app). I also had a chat with Jane Stuart Smith about an older but still very important site that she would like me to redesign at some point next year, and I started looking through this and thinking how it could be improved.
On Wednesday, as it was my last day before the hols, I decided to focus on something from my ‘to do’ list that would be fun. I’d been wanting to make a timeline for the Historical Thesaurus for a while so I thought I’d look into that. What I’ve created so far is a page through which you can pass a category ID and then see all of the words in the category in a visualisation that shows when the word was used, based on the ‘apps’ and ‘appe’ fields in the database. When a word’s ‘apps’ and ‘appe’ fields are the same it appears as a dot in the timeline, and where the fields are different the word appears as a coloured bar showing the extent of the attested usage. Note that more complicated date structures such as ‘a1700 + 1850–‘ are not visualised yet, but could be incorporated (e.g. a dot for 1700 then a bar from 1850 to 2000).
When you hover over a dot or bar the word and its dates appear below the visualisation. Eventually (if we’re going to use this anywhere) I would instead have this as a tool-tip pop-up sort of thing.
Here are a couple of screenshots of fitting examples for the festive season. First up is words for ‘Be gluttonous’:
And here are words for ‘Excess in drinking’:
The next step with this would be to incorporate all subcategories for a category, with different shaded backgrounds for sections for each subcategory and a subcategory heading added in. I’m not entirely sure where we’d link to this, though. We could allow people to view the timeline by clicking on a button in the category browse page. Or we might not want to incorporate it at all, as it might just clutter things up. BTW, this is a D3 based visualisation created by adapting this code: https://github.com/denisemauldin/d3-timeline
That’s all from me for 2017. Best wishes for Christmas and the New Year to one and all!
I was struck down with some sort of tummy bug at the weekend and wasn’t well enough to come into work on Monday, but I worked from home instead. Unfortunately although I struggled through the day I was absolutely wiped out by the end of it and ended up being off work sick on Tuesday and Wednesday. I was mostly back to full health on Thursday, which is the day I normally work from home anyway, so I made it through that day and was back to completely full health on Friday, thankfully. So I only managed to work for three days this week, and for two of those I wasn’t exactly firing on all cylinders. However, I still managed to get a few things done this week.
Last week I’d migrated the Mapping Metaphor blog site and after getting approval from Wendy I deleted the old site on Monday. I took a backup of the database and files before I did so, and then I wrote a little redirect that ensures Google links and bookmarks to specific blog pages point to the correct page on the main Metaphor site. I also had some further AHRC review duties to take care of, plus I spent some time reading through the Case for Support for Joanna Kopaczyk’s project and thinking about some of the technical implications. Pauline Mackay also sent me a sample of an Access database she’s put together for her Scots Bawdry project. I’m going to create an online version of this so I spent a bit of time going through it and thinking about how it would work.
I spent most of Thursday and Friday working on this new system for Pauline, and by the end of the week I had created an initial structure for the online database, had created some initial search and browse facilities and I also created some management pages to allow Pauline to add / edit / delete records. The search page allows users to search for any combination of the following fields:
Verse title, first line, language, theme, type, ms title, publication year, place, publisher and location. Verse title, first line and ms title are free text and will bring back any records with matching text – e.g. if you enter ‘the’ into ‘verse title’ you can find all records where these three characters appear together in a title. Publication year allows users to search for an individual year or a range of years (e.g. 1820-1840 brings back everything that has a date between and including these years). Language, place, publisher and location are drop-down lists that allow you to select one option. Themes and type are checkboxes allowing you to select any number of options, with each joined by an ‘or’ (e.g. all the records that have a theme of ‘illicit love’ or ‘marriage’). I can change any of the single selection drop-downs to multiple options (or vice versa) if required. If multiple boxes are filled in these are joined by ‘and’ – e.g. publication place is Glasgow AND publication year is 1820.
The browse page presents all of the options in the search form as clickable lists, with each entry having a count to show you how many records match. For ‘publication year’ only those records with a year supplied are included. Clicking on a search or browse result displays the full record. Any content that can be searched for (e.g. publication type) is a link and clicking on it performs a search for that thing.
For the management pages, once logged in a staff user can browse the data, which displays all of the records in one big table. From here the user can access options to edit or delete a record. Deleting a record simply deactivates it in the database and I can retrieve it again if required. Users can also add new records by clicking on the ‘add new row’ link. I also created a script for importing all of the data from the Access database and I will run this again on a more complete version of the database when Pauline is ready to import everything. This is all just an initial version, and there will no doubt be a few changes required, but I think it’s all come together pretty well so far.
I was off on Tuesday this week to attend my uncle’s funeral. I spent the rest of the week working on a number of relatively small tasks for a variety of different projects. The Dictionary of Old English people got back to me on Monday to say they had updated their search system to allow our Thesaurus of Old English site to link directly from our word records to a search for that word on their site. This was really great news, and I updated our site to add in the direct links. This is going to be very useful for users of the both sites. I spent a bit more time on AHRC review duties this week, and I also had an email discussion with Joanna Kopaczyk in English Language about a proposal she is putting together. She sent me on the materials she is working on and I read through them all and gave some feedback about the technical aspects. I’m going to help her to write the Technical Plan for her project soon too. I also met with Rachel Douglas from the School of Modern Languages to offer some advice on technical matters relating to a projest she’s putting together. Althoguh Rachel is not in my School and I therefore can’t be involved in her project it was still good to be able to give her a bit of help and show her some examples of digital outputs similar to the sorts of thing she is hoping to produce.
I also spent some further time working on the integration of OED data with the Historical Thesaurus data with Fraser. Fraser had sent me some further categories that he and a student had manually matched up, and had also asked me to write another script that picks out all of the unmatched HT categories and all of the unmatched OED categories and for each HT category goes through all of the OED categories and finds the one with the lowest Levenshtein score (an algorithm that returns a number showing how many steps it would take to turn one string into another). My initial version of this script wasn’t ideal, as it included all unmatched OED categories and I’d forgotten that this included several thousand that are ‘top level’ categories that don’t have a part of speech and shouldn’t be matched with our categories at all. I also realised that the script should only compare categories that have the same part of speech, as my first version was ending up with (for example) a noun category being matched up with an adjective. I updated the script to bear these things in mind, but unfortunately the output still doesn’t look all that useful. However, there are definitely some real matches that can be manually picked out from the list, e.g. 31890 ‘locustana pardalina or rooibaadjie’ and ‘locustana pardalina (rooibaadjie)’ and some others around there. Also 14149 ‘applied to weapon etc’ and ‘applied to weapon, etc’. It’s over to Fraser again to continue with this.
I mentioned last week that I’d updated all of our WordPress sites to version 4.9, but that 4.9.1 would no doubt soon be released. And in fact it was released this week, so I had to update all of the sites once more. It’s a bit of a tedious task but it doesn’t really take too long – maybe about half an hour in total. I also decided to tick an item off my long-term ‘to do’ list as I had a bit of time available. The Mapping Metaphor site had a project blog, located at a different URL from the main site. As the project has now ended there are no more blog posts being made so it seems a bit pointless hosting this WordPress site, and having to keep it maintained, when I could just migrate the content to the main MM website as static HTML and delete the WordPress site. I spent some time investigating WordPress plugins that could export entire sites as static HTML, for example https://en-gb.wordpress.org/plugins/static-html-output-plugin/ and https://wordpress.org/plugins/simply-static/. These plugins go through a WordPress site, convert all pages and posts to static HTML, pull in the WordPress file uploads folder and wrap everything up as a ZIP file. This seemed ideal, and the tools both worked very well, but I realised they weren’t exactly what I needed. Firstly, the Metaphor blog (which was set up before I was involved with the project) just uses page IDs in the URLs, not other sorts of permalinks. Both the plugins don’t work with the default URL style in place, so I’d need to change the link type, meaning the new pages would have different URLs to the old pages which would be a problem for redirects. Secondly, both plugins pull in all of the page elements, including the page design, the header and all the rest. I didn’t actually want all of this stuff but just the actual body of the posts (plus titles and a few other details) so I could slot this into the main MM website template. So instead of using a plugin I realised it was probably simpler and easier if I just wrote my own little export script that grabbed just the published posts (not pages), for each getting the ID, the title, the main body, the author and the date of creation. My script hooked into the WordPress functions to make use of the ‘wpautop’ function, which adds paragraph markup to texts, and I also replaced absolute URLs with relative ones. I then created a temporary table to hold just this data, set my script to insert into it and then I exported this table. I imported this into the main MM site’s database and wrote a very simple script to pull out the correct post based on the passed ID and that was that. Oh, I also copied the WordPress uploads directory across too, so images and PDFs and such things embedded in posts would continue to work. Finally, I created a simple list of posts. It’s exactly what was required and was actually pretty simple to implement, which is a good combination.
On Thursday I heard that the Historical Thesaurus had been awarded the ‘Queen’s Anniversary Prize for Higher Education’, which is a wonderful achievement for the project. Marc had arranged a champagne reception on Friday afternoon to celebrate the announcement, so I spent most of afternoon sipping champagne and eating chocolates, which was a nice way to end the week.
After an enjoyable week’s holiday I returned to work on Monday, spending quite a bit of Monday catching up with some issues people had emailed me about whilst I was away, such as making further tweaks to the ‘Concise Scots Dictionary’ page on the DSL website for Rhona Alcorn (the page is now live if you’d like to order the book: http://dsl.ac.uk/concise-scots-dictionary/), speaking with Luca about a project he’s involved in the planning of that’s going to use some of the DSL data, helping Carolyn Jess-Cooke with some issues she was encountering when accessing one of her websites, giving some information to Brianna of the RNSN project about timeline tools we might use, and a few other such things.
I had a couple of queries from Wendy Anderson this week. The first was for Mapping Metaphor. Wendy wanted to grab all of the bidirectional metaphors in both the main and OE datasets, including all of their sample lexemes. I wrote a script that extracted the required data and formatted it as a CSV file, which is just the sort of thing she wanted. The second query was for all of the metadata associated with the Corpus of Modern Scots Writing texts. A researcher had contacted Wendy to ask for a copy but although the metadata is in the database and can be viewed on a per text basis through the website, we didn’t have the complete dataset in an easy to share format. I wrote a little script that queried the database and retrieved all of the data. I had to do a little digging into how the database was structure in order to do this, as it is a system that wasn’t developed by me. However, after a little bit of exploration I managed to write a script that grabbed the data about each text, including multiple authors that can be associated with each text. I then formatted this as a CSV file and sent the outputted file to Wendy.
I met with Gary on Monday to discuss some changes to the SCOSYA atlas and CMS that he wanted me to implement ahead of an event the team are at next week. This included adding Google Analytics to the website, updating the legend of the Atlas to make it clearer what the different rating levels meant, separating out the grey squares (which mean no data is present) and the grey circles (meaning data is present but doesn’t meet the specified criteria) into separate layers so they can be switched on and off independently of each other, making the map markers a little smaller, and adding in facilities to allow Gary to delete codes, attributes and code parents via the CMS. This all took a fair amount of time to implement, and unfortunately I lost a lot of time on Thursday due to a very strange situation with my access to the server.
I work from home on Thursdays and I had intended to work on the ‘delete’ facilities that day, but when I came to log into the server the files and the database appeared to have reverted back to the state they were in in May – i.e. it looked like we had lost almost six months of data, plus all of the updates to the code I’d implemented during this time. This was obviously rather worrying and I spent a lot of time toing and froing with Arts IT Support to try and figure out what had gone wrong. This included restoring a backup from the weekend before, which strangely still seemed to reflect the state of things in May. I was getting very concerned about this when Gary noted that he was seeing two different views of the data on his laptop. In Safari on his laptop his view of the data appeared to have ‘stuck’ at May while in Chrome he could see the up to date dataset. I then realised that perhaps the issue wasn’t with the server after all but instead the problem was my home PC (and Safari on Gary’s laptop) was connecting to the wrong server. Arts IT Support’s Raymond Brasas suggested it might be an issue with my ‘hosts’ file and that’s when I realised what had happened. As the SCOSYA domain is an ‘ac.uk’ domain and it takes a while for these domains to be set up, we had set up the server long before the domain was running, so to allow me to access the server I had added a line to the ‘hosts’ file on my PC to override what happens when the SCOSYA URL is requested. Instead of it being resolved by a domain name service my PC pointed at the IP address of the server as I had entered it in my ‘hosts’ file. Now in May, the SCOSYA site was moved to a new server, with a new IP address, but the old server had never been switched off, so my home PC was still connecting to this old server. I had only encountered the issue this week because I hadn’t worked on SCOSYA from home since May. So, it turned out there was no problem with the server, or the SCOSYA data. I removed the line from my ‘hosts’ file, restarted my browser and immediately I could access the up to date site. All this took several hours of worry and stress, but it was quite a relief to actually figure out what the issue was and to be able to sort it.
I had intended to start setting up the server for the SPADE project this week, but the machine has not yet been delivered, so I couldn’t work on this. I did make a few further tweaks to the SPADE website, however, and responded to a couple of queries from Rachel about the SCOTS data and metadata, which the project will be using.
I also met with Fraser to discuss the ongoing issue of linking up the HT and OED data. We’re at the stage now where we can think about linking up the actual words with categories. I’d previously written a script that goes through each HT category that matches an OED category and compares the words in each, checking whether an HT word matches the next found in either the OED ‘ght_lemma’ or ‘lemma’ fields. After our meeting I updated the HT lexeme table to include extra fields for the ID of a matching OED lexeme and whether the lexeme had been checked. After that I updated the script to go through every matching category in order to ‘tick off’ the matching words within. The first time I ran my script it crashed the browser, but with a bit of tweaking I got it to successfully complete the second time. Here are some stats:
There are 655513 HT lexemes that are now matched up with an OED lexeme. There are 47074 HT lexemes that only have OE forms, so with 793733 HT lexemes in total this means there are 91146 HT lexemes that should have an OED match but don’t. Note, however, that we still have 12373 HT categories that don’t match OED categories and these categories contain a total of 25772 lexemes.
On the OED side of things, we have a total of 688817 lexemes, and of these 655513 now match an HT lexeme, meaning there are 33304 OED lexemes that don’t match anything. At least some of these will also be cleared up by future HT / OED category matches. Of the 655513 OED lexemes that now match, 243521 of them are ‘revised’. There are 262453 ‘revised’ OED lexemes in total, meaning there are 18932 ‘revised’ lexemes that don’t currently match an HT lexeme. I think this is all pretty encouraging as it looks like my script has managed to match up bulk of the data. It’s just the several thousand edge cases that are going to be a bit more work.
On Wednesday I met with Thomas Widmann of Scots Language Dictionaries to discuss our plans to merge all three of the SLD websites (DSL, SLD and Scuilwab) into one resource that will have the DSL website’s overall look and feel. We’re going to use WordPress as a CMS for all of the site other than the DSL’s dictionary pages, so as to allow SLD staff to very easily update the content of the site. It’s going to take a bit of time to migrate things across (e.g. making a new WordPress theme based on the DSL website, create quick search widgets, updating the DSL dictionary pages to work with the WordPress theme), but we now have the basis of a plan. I’ll try to get started on this before the year is out.
Finally this week, I responded to a request from Simon Taylor to make a few updates to the REELS system, and I replied to Thomas Clancy about how we might use existing Ordinance Survey data in the Scottish Place-Names survey. All in all it has been a very busy week.
On Monday this week I spent a bit of time creating a new version of the MetaphorIC app, featuring the ‘final’ dataset from Mapping Metaphor. This new version features almost 12,000 metaphorical connections between categories and more than 30,000 examples of metaphor. Although the creation of the iOS version went perfectly smoothly (this time), I ran into some difficulties updating the Android app as the build process started giving me some unexplained errors. I eventually tried dropping the Android app in order to rebuild it, but that didn’t work either and unfortunately dropping the app also deleted its icon files. After that I had to build the app in a new location, which thankfully worked. Also thankfully I still had the source files for the icons so I could create them again. There’s always something that doesn’t go smoothly when publishing apps. The new version of the app was made available on the Apple App and Google Play stores by the end of the week and you can download either version by following the links here: http://mappingmetaphor.arts.gla.ac.uk/metaphoric/. That’s Mapping Metaphor and its follow-on project MetaphorIC completely finished now, other than the occasional tweak that will no doubt be required.
I spent the bulk of the rest of the week working on the Burns Paper Database for Ronnie Young. Last week I started looking at the Access version of the database that Ronnie had sent me, and I’m managed to make an initial version of a MySQL database to hold the data and I created an upload script that populated this table with the data via a CSV file. This week I met with Ronnie to discuss how to take the project further. We agreed that rather than having an online content management system through which Ronnie would continue to update the database, he would instead continue to use his Access version and I would then run this through my ‘import’ script to replace the old online version whenever updates are required. This is a more efficient approach as I already have an upload script and Ronnie is already used to working with his Access database.
We went through the data together and worked out which fields would need to be searchable and browseable, and how the data should be presented. This was particularly useful as there are some consistency issues with the data, for example in how uncertain dates are recorded, which may include square brackets, asterisks, question marks, the use of ‘or’ and also date ranges.
Other than the above I also did a little bit of work for the SPADE project, beginning to create a proper interface for the website with Rachel MacDonald, and I had a further chat with Gerry McKeever regarding the website for his new project.
I spent a lot of this week continuing with the redevelopment of the ARIES app and thankfully after laying the groundwork last week (e.g. working out the styles and the structure, implementing a couple of exercise types) my rate of progress this week was considerably improved. In fact, by the end of the week I had added in all of the content and had completed an initial version of the web version of the app. This included adding in some new quiz types, such as one that allows the user to reorder the sentences in a paragraph by dragging and dropping them, and also a simple multiple choice style quiz. I also received some very useful feedback from members of the project team and made a number of refinements to the content based on this.
This included updating the punctuation quiz so that if you get three incorrect answers in a quiz a ‘show answer’ button is displayed. Clicking on this puts in all of the answers and shows the ‘well done’ box. This was rather tricky to implement as the script needed to reset the question, including removing all previous answers, ticks, and resetting the initial letter case as if you select a full stop the following letter is automatically capitalised. I also implemented a workaround for answers where a space is acceptable. These no longer count towards the final tally of correct answers, so leaving a space rather than selecting a comma can now result in the ‘well done’ message being displayed. Again, this was rather tricky to implement and it would be good if you could test out this quiz thoroughly to make sure there aren’t any occasions where the quiz breaks.
I also improved navigation throughout the app. I added ‘next’ buttons to all of the quizzes, which either take you to the next section, or to the next part of the quiz, as applicable. I think this works much better than just having the option to return to the page the quiz was linked from. I also added in a ‘hamburger’ button to the footer of every page within a section. Pressing on this takes you to the section’s contents page, and I added ‘next’ and ‘previous’ buttons to the contents pages too, so you can navigate between sections without having to go back to the homepage.
I spent a bit of time fixing the drag / drop quizzes so that the draggable boxes were constrained to each exercise’s boundaries. This seemed to work great until I got to the references quiz, which has quite long sections of draggable text. With the constraint in place it became impossible for the part of the draggable button that triggers the drop to reach the boxes nearest the boundaries of the question as none of the button could pass the borders. So rather annoyingly I had to remove this feature and just allow people to drag the buttons all over the page. But dropping a button from one question into another will always give you an incorrect answer now, so it’s not too big a problem.
With all of this in place I’ll start working on the app version of the resource next week and will hopefully be able to submit it to the app stores by the end of the week, all being well.
In addition to my work on ARIES, I completed some other tasks for a number of other projects. For Mapping Metaphor I created a couple of scripts for Wendy that output some statistics about the metaphorical connections in the data. For the Thesaurus of Old English I created a facility to enable staff to create new categories and subcatetories (previously it was only possible to edit existing categories or add / edit / remove words from existing categories). I met with Nigel Leask and some of the Curious Travellers team on Friday to discuss some details for a new post associated with this project. I had an email discussion with Ronnie Young about the Burns database he wants me to make an online version of. I also met with Jane Stuart-Smith and Rachel MacDonald, who is the new project RA for the SPADE project, and set up a user account for Rachel to manage the project website. I had a chat with Graeme Cannon about a potential project he’s helping put together that may need some further technical input and I updated the DSL website and responded to a query from Ann Ferguson regarding a new section of the site.
I also spent most of a day working on the Edinburgh Gazetteer project, during which I completed work on the new ‘keywords’ feature. It was great to be able to do this as I had been intending to work on this last week but just didn’t have the time. I took Rhona’s keywords spreadsheet, which had page ID in one column and keywords separated by a semi-colon in another and created two database tables to hold the information (one for information about keywords and a joining table to link keywords to individual pages). I then wrote a little script that went through the spreadsheet, extracted the information and added it to my database. I then set to work on adding the actual feature to the website.
The index page of the Gazetteer now has a section where all of the keywords are listed. There are more than 200 keywords so it’s a lot of information. Currently the keywords appear like ‘bricks’ in a scrollable section, but this might need to be updated as it’s maybe a bit much information. If you click on a keyword a page loads that lists all of the pages that the keyword is associated with. When you load a specific page, either from the keyword page or from the regular browse option, there’s now a section above the page image that lists the associated keywords. Clicking on one of these loads the keyword’s page, allowing you to access any other pages that are associated with it. It’s a pretty simple system but it works well enough. The actual keywords need a bit of work, though, as some are too specific and there are some near duplications due to typos and things like that. Rhona is going to send me an updated spreadsheet and I will hopefully upload this next week.
Oh yes, it was five years ago this week that I started in this post. How time flies.
This week was rather a hectic one as I was contacted by many people who wanted my help and advice with things. I think it’s the time of year – the lecturers are returning from their holidays but the students aren’t back yet so they start getting on with other things, meaning busy times for me. I had my PDR session on Monday morning, so I spent a fair amount of time at this and then writing things up afterwards. All went fine, and it’s good to know that the work I do is appreciated. After that I had to do a few things for Wendy for Mapping Metaphor. I’d forgotten to run my ‘remove duplicates’ script after I’d made the final update to the MM data, which meant that many of the sample lexemes were appearing twice. Thankfully Wendy spotted this and a quick execution of my script removed 14,286 duplicates in a flash. I also had to update some of the text on the site, update the way search terms are highlighted in the HT to avoid links through from MM highlighting multiple terms. I also wrote a little script that displays the number of strong and weak metaphorical connections there are for each of the categories, which Wendy wanted.
My big task for the week was to start on the redevelopment of the ARIES app. I had been expecting to receive the materials for this several weeks earlier as Marc wanted the new app to be ready to launch at the beginning of term. As I’d heard nothing I assumed that this was no longer going to happen, but on Monday Marc gave me access to the files and said the launch must still go ahead at the start of term. There is rather a lot to do and very little time to do it in, especially as preparing stuff for the App Store takes so much time once the app is actually developed. Also, Marc is still revising the materials so even though I’m now creating the new version I’m still going to have to go back and make further updates later on. It’s not exactly an ideal situation. However, I did manage to get started on the redevelopment on Tuesday, and spent pretty much all of my time on Tuesday, Wednesday and Thursday on this task. This involved designing a new interface based on the colours found in the logo file, creating the structure of the app, and migrating the static materials that the team had created in HTML to the JSON file I’m creating for the app contents. This included creating new styles for the new content where required and testing things out on various devices to make sure everything works ok. I also implemented two of the new quizzes, which also took quite a bit of time, firstly because I needed to manually migrate the quiz contents to a format that my scripts could work with and secondly because although the quizzes were similar to ones I’ve written before they were not identical in structure, so needed some reworking in order to meet the requirements. I’m pretty happy with how things are developing, but progress is slow. I’ve only completed the content for three subsections of the app, and there are a further nine sections remaining. Hopefully the pace will quicken as I proceed, but I’m worried that the app is not going to be ready for the start of term, especially as the quizzes should really be tested out by the team and possibly tweaked before launch.
I spent most of Friday this week writing the Technical Plan for Thomas Clancy’s new place-name project. Last week I’d sent off a long list of questions about the project and Thomas got back to me with some very helpful answers this week, which really helped in writing the plan. It’s still only a first version and will need further work, but I think the bulk of the technical issues have been addressed now.
Other than these tasks, I responded to a query from Moira Rankin from the Archives about an old project I was involved with, I helped Michael Shaw deal with some more data for The People’s Voice project, I had a chat to Catriona MacDonald about backing up The People’s Voice database, I looked through a database that Ronnie Young had sent me, which I will be turning into an online resource sometime soon (hopefully), I replied to Gerry McKeever about a project he’s running that’s just starting up which I will be involved with, and I replied to John Davies in History about a website query he had sent me. Unfortunately I didn’t get a chance to continue with the Edinburgh Gazetteer work I’d started last week, but I’ll hopefully get a chance to do some further work on this next week.
I was on holiday last week but was back to work on Monday this week. I’d kept tabs on my emails whilst I was away but as usual there were a number of issues that had cropped up in my absence that I needed to sort out. I spent some time on Monday going through emails and updating my ‘to do’ list and generally getting back up to speed again after a lazy week off.
I had rather a lot of meetings and other such things to prepare for and attend this week. On Monday I met with Bryony Randall for a final ‘sign off’ meeting for the New Modernist Editing project. I’ve really enjoyed working on this project, both the creation of the digital edition and taking part in the project workshop. We have now moved the digital edition of Virginia Woolf’s short story ‘Ode written partly in prose on seeing the name of Cutbush above a butcher’s shop in Pentonville’ to what will hopefully be its final and official URL and you can now access it here: http://nme-digital-ode.glasgow.ac.uk
On Tuesday I was on the interview panel for Jane Stuart-Smith’s SPADE project, which I’m also working on for a small percentage of my time. After the interviews I also had a further meeting with Jane to discuss some of the technical aspects of her project. On Wednesday I met with Alison Wiggins to discuss her ‘Archives and Writing Lives’ project, which is due to begin next month. This project will involve creating digital editions of several account books from the 16th century. When we were putting the bid together I did quite a bit of work creating a possible TEI schema for the account books and working out how best to represent all of the various data contained within the account entries. Although this approach would work perfectly well, now that Alison has started transcribing some entries herself we’ve realised that managing complex relational structures via taxonomies in TEI via the Oxygen editor is a bit of a cumbersome process. Instead Alison herself investigated using a relational database structure and had created her own Access database. We went through the structure when we met and everything seems to be pretty nicely organised. It should be possible to record all of the types of data and the relationships between these types using the Access database and so we’ve decided that Alison should just continue to use this for her project. I did suggest making a MySQL database and creating a PHP based content management system for the project, but as there’s only one member of staff doing the work and Alison is very happy using Access it seemed to make sense to just stick with this approach. Later on in the project I will then extract the data from Access, create a MySQL database out of it and develop a nice website for searching, browsing and visualising the data. I will also write a script to migrate the data to our original TEI XML structure as this might prove useful in other projects.
It’s Performance and Development Review time again, and I have my meeting with my line manager coming up, so I spent about a day this week reviewing last year’s objectives and writing all of the required sections for this year. Thankfully having my weekly blog posts makes it easier to figure out exactly what I’ve been up to in the review period.
Other than these tasks I helped Jane Roberts out with an issue with the Thesaurus of Old English, I fixed an issue with the STARN website that Jean Anderson had alerted me to, I had an email conversation with Rhona Brown about her Edinburgh Gazetteer project and I discussed data management issues with Stuart Gillespie. I also uploaded the final set of metaphor data to the Mapping Metaphor database. That’s all of the data processing for this project now completed, which is absolutely brilliant. All categories are now complete and the number of metaphors has gone down from 12938 to 11883, while the number of sample lexemes (including first lexemes) has gone up from 25129 to a whopping 45108.
Other than the above I attended the ‘Future proof IT’ event on Friday. This was an all-day event organised by the University’s IT services and included speakers from JISC, Microsoft, Cisco and various IT related people across the University. It was an interesting day with some excellent speakers, although the talks weren’t as relevant to my role as I’d hoped they would be. I did get to see Microsoft’s HoloLens technology in action, which was great, although I didn’t personally get a chance to try the headset on, which was a little disappointing.