Week Beginning 7th December 2020

I spent most of the week working on the Anglo-Norman Dictionary as we’re planning on launching this next week and there was still much to be done before that.  One of the big outstanding tasks was to reorder all of the citations in all senses within all entries so they are listed by their date.  This was a pretty complex task as each entry may any number of up to four different types of sense:  main senses, subsenses and then main senses and subsenses within locutions.  My script needed to be able to extract the dates for each citation within each of these blocks, figure out their date order, rearrange the citations by this order and then overwrite the XML section with the reordered data.  Any loss of or mangling of the data would be disastrous and with almost 60,000 entries being updated it would not be possible to manually check that everything worked in all circumstances.

Updating the XML proved to be a little tricky as I had been manipulating the data with PHP’s simplexml functionsand it doesn’t include a facility to replace a child node.  This meant that I couldn’t tell the script to identify a sense and replace its citations with a new block.  In addition, the XML was not structured to include a ‘citations’ element that contained all of the individual citations for an entry but instead just listed each citation as an ‘attestation’ element within the sense, therefore it wasn’t straightforwardly possible to replace the clock of citations with an updated block.  Instead I needed to reconstruct the sense XML in its entirety, including both the complete set of citations and all other elements and attributes contained within the sense, such as IDs, categories and labels.  With a completely new version of the sense XML stored in memory by the script I then needed to write this to the XML, and for this I needed to use PHP’s DOM manipulation functions because (as mentioned earlier) simplexml has no means of identifying and replacing a child node.

I managed to get a version of my script working and all seemed to be well with the entries I was using for test purposes so I ran the script on the full dataset and replaced the data on the website (ensuring that I kept a record of the pre-reordered data handy in case of any problems).  When the editors reviewed the data they noticed that while the reordering had worked successfully for some senses, it had not reordered others.  This was a bit strange and I therefore had to return to my script to figure out what had gone wrong.  I noticed that only the citations in the first sense / subsense / locution sense / locution subsense had been reordered, with others being skipped.  But when I commented out the part of the script that updated the XML all senses were successfully being picked out.  This seemed strange to me as I didn’t see why the act of identifying senses should be affected by the writing of data.  After some investigation I discovered that with PHP’s simplexml implementation if you iterate through nodes using a ‘foreach’ and then update the item picked out by the loop (so for example in ‘foreach($sense as $s)’ updating $s) then subsequent iterations fail.  It would appear that updating $s in this example changes the XML string that’s loaded into memory which then means the loop reckons it’s reached the end of the matching elements and stops.  My script had different loops for going through senses / subsenses / locution senses / locution subsenses which is why the first of each type was being updated while others weren’t.  After I figured this out I updated my script to use a ‘for’ loop instead of a ‘foreach’ and stored $s within the scope of the loop only and this worked.  With the change in place I reran the script on the full dataset and uploaded it to the website and all thankfully appears to have worked.

For the rest of the week I worked through my ‘to do’ list, ticking items off. I updated the ‘Blog’ menu item to point to the existing blog site (this will eventually be migrated across).  The ‘Textbase’ menu item now loads a page stating that this feature will be added in 2021.  I managed to implement the ‘source texts’ page as it turns out that I’d already developed much of the underpinnings for this page whilst developing other features.  As with citation popups, it links into the advanced search and also to the DEAF website.  I figured out how to ensure that words with accented characters is citation searches now appear separately in the list from their non-accented versions.  E.g. a search for ‘apres*’ now has ‘apres (28)’ separate from ‘après (4)’ and ‘aprés (2229)’.  We may need to think about the ordering, though, as accented characters are currently appearing at the end of the list.  I also made the words lower case here – they were previously being transformed into upper case.  Exact searches (surrounded by quotes) are still accent-sensitive.  This is required so that the link through the list of forms to the search results works (otherwise the results display all accented and non-accented forms).  I also ensured that word highlighting in snippets in results now works as it should with accented characters and upper case initial letters are now retained too.

I added in an option to return to the list of forms (i.e. the intermediate page) from the search results.  In addition to ‘Refine your search’ there is also a ‘Select another form’ button and I ensured that the search results page still appears when there is only one search result for citation and translation searches now.  I also figured out why multiple words were sometimes being returned in the citation and translation searches.  This was because what looked like spaces between words in the XML were sometimes not regular spaces but non-breaking space characters (\u00a0).  As my script split up citations and translations on spaces these were not being picked up as divisions between words.  I needed to update my script to deal with these characters and then regenerate all of the citation and translation data again in order to fix this.

I also ensured that when conducting a label search the matching labels in an entry page are now highlighted and the page automatically scrolls down to the first matching label.  I also made several tweaks to the XSLT, ensuring that where there are no dates for citations the text ‘TBD’ appears instead and ensuring a number of tags that were not getting properly transformed were handled.

Also this week I made some final changes to the interactive map of Burns Suppers, including tweaking the site icon so it looks a bit nicer and adding in a ‘read more’ button to the intro text and fixing the scrolling issue on small screens, plus updating the text to show 17 filters.  I fixed the issue with the attendance filter and have also updated the layout of the filters so they look better on both monitors and mobile devices.

My other main task of the week was to restructure the Mapping Metaphor website based on suggestions for REF from Wendy and Carole.  This required a lot of work as the visualisations needed to be moved to different URLs and the Old English map, which was previously a separate site in a subdirectory, needed to be amalgamated with the main site.

I removed the top-level tabs that linked between MM, MMOE and MetaphorIC and also the ‘quick search’ box.  The ‘metaphor of the day’ page now displays both a main and an OE connection and the ‘Metaphor Map of English’ / ‘Metaphor Map of Old English’ text in the header has been removed.  I reworked the navigation bar in order to allow a sub-navigation bar to appear.  It is now positioned within the header and is centre-aligned.  ‘Home’ now features introductory text rather than the visualisation.  ‘About the project’ now has the new secondary menu rather than the old left-panel menu.  This is because the secondary menu on the map pages couldn’t have links in the left-hand panel as it’s already used for something else.  It’s better to have the sub-menu displaying consistently across different sections of the site.  I updated the text within several ‘About’ pages and ‘How to Use’, which also now has the new secondary menu.  The main metaphor map is now in the ‘Metaphor Map of English’ menu item.  This has sub-menu items for ‘search’ and ‘browse’.  The OE metaphor map is now in the ‘Metaphor Map of Old English’ menu item.  It also has sub-menu items for ‘search’ and ‘browse’.  The OE pages retain their purple colour to make a clear distinction between the OE map and the main one.  MetaphorIC retains the top-level navigation bar but now only features one link back to the main MM site.  This is right-aligned to avoid getting in the way of the ‘Home’ icon that appears in the top left of sub-pages.  The new site replaced the old one on Friday and I also ensured that all of the old URLs continue to work (e.g. the ‘cite this’ will continue to work.

Week Beginning 18th March 2019

This week I spent a lot of time continuing with the HT/OED linking task, tackling the outstanding items on my ‘to do’ list before I met with Marc and Fraser on Friday.  This included the following:

Re-running category pattern matching scripts on the new OED categories:  The bulk of the category matching scripts rely on matching the HT’s oedmaincat field against the OED’s path field (and then doing other things like comparing category contents).  However, these scripts aren’t really very helpful with the new OED category table as the path has changed for a lot of the categories.  The script that seemed the most promising was number 17 in our workflow document, which compares first dates of all lexemes in all unmatched OED and HT categories and doesn’t check anything else.  I’ve created an updated version of this that uses the new OED data, and the script only brings back unmatched categories that have at least one word that has a GHT date, and interestingly the new data has less unmatched categories featuring GHT dates than the old data (591 as opposed to 794).  I’m not really sure why this is, or what might have happened to the GHT dates.  The script brings back five 100% matches (only 3 more than the old data, all but one containing just one word) and 52 matches that don’t meet our criteria (down from 56 with the old data) so was not massively successful.

Ticking off all matching HT/OED lexemes rather than just those within completely matched categories: 627863 lexemes are now matched.  There are 731307 non-OE words in the HT, so about 86% of these are ticked off.  There are 751156 lexemes in the new OED data, so about 84% of these are ticked off.  Whilst doing this task I noticed another unexpected thing about the new OED data:  the number of categories in ’01’ and ‘02’ have decreased while the number in ‘03’ has increased.  In the old OED data we have the following number of matched categories:

01: 114968

02: 29077

03: 79282

In the new OED data we have the following number of matched categories:

01: 109956

02: 29069

03: 84260

The totals match up, other than the 42 matched categories that have been deleted in the new data, so (presumably) some categories have changed their top level.  Matching up the HT and OED lexemes has introduced a few additional duplicates, caused when a ‘stripped’ form means multiple words within a category match.  There aren’t too many, but they will need to be fixed manually.

Identifying all words in matched categories that have no GHT dates and see which of these can be matched on stripped form alone: I created a script to do this, which lists every unmatched OED word that doesn’t have a GHT date in every matched OED category and then tries to find a matching HT word from the remaining unmatched words within the matched HT category.  Perhaps I misunderstood what was being requested because there are no matches returned in any of the top-level categories.  But then maybe OED words that don’t have a GHT date are likely to be new words that aren’t in the HT data anyway?

Create a monosemous script that finds all unmatched HT words that are monosemous and sees whether there are any matching OED words that are also monosemous: Again, I think the script I created will need more work.  It is currently set to only look at lexemes within matched categories.  It finds all the unmatched HT words that are in matched categories, then checks how many times each word appears amongst the unmatched HT words in matched categories of the same POS. If the word only appears once then the script looks within the matched OED category to find a currently unmatched word that matches.  At the moment the script does not check to see if this word is monosemous as I figured that if the word matches and is in a matched category it’s probably a correct match.  Of the 108212 unmatched HT words in matched categories, 70916 are monosemous within their POS and of these 14474 can be matched to an OED lexeme in the corresponding OED category.

Deciding which OED dates to use: I created a script that gets all of the matched HT and OED lexemes in one of the top-level categories (e.g. 01) and then for each matched lexeme works out the largest difference between OED sortdate and HT firstd (if sortdate is later then sortdate-firstd, otherwise firstd-sortdate); works out the largest difference between OED enddate and HT lastd in the same way; adds these two differences together to work out the largest overall difference.  It then sorts the data on the largest difference and then displays all lexemes in a table ordered by largest difference, with additional fields containing the start difference, end difference and total difference for info.  I did, however, encounter a potential issue:  Not all HT lexemes have a firstd and lastd.  E.g. words that are ‘OE-‘ have nothing in firstd and lastd but instead have ‘OE’ in the ‘oe’ column and ‘_’ in the ‘current’ column.  In such cases the difference between HT and OED dates are massive, but not accurate.  I wonder whether using HT’s apps and appe columns might work better.

Looking at lexemes that have an OED citation after 1945, which should be marked as ‘current’:  I created a script that goes through all of the matched lexemes and lists all of the ones that either have an OED sortdate greater than 1945 or an OED enddate greater than 1945 where the matched HT lexeme does not have the ‘current’ flag set to ‘_’.  There are 73919 such lexemes.

On Friday afternoon I had a meeting with Marc and Fraser where we discussed the above and our next steps.  I now have a further long ‘to do’ list, which I will no doubt give more information about next week.

Other than HT duties I helped out with some research proposals this week.  Jane Stuart-Smith and Eleanor Lawson are currently putting a new proposal together and I helped to write the data management plan for this.  I also met with Ophira Gamliel in Theology to discuss a proposal she’s putting together.  This involved reading through a lot of materials and considering all the various aspects of the project and the data requirements of each, as it is a highly multifaceted project.  I’ll need to spend some further time next week writing a plan for the project.

I also had a chat to Wendy Anderson about updating the Mapping Metaphor database, and also the possibility of moving the site to a different domain.  I also met with Gavin Miller to discuss the new website I’ll be setting up for his new Glasgow-wide Medical Humanities Network, and I ran some queries on the DSL database in order to extract entries that reference the OED for some work Fraser is doing.

Finally, I had to make some changes to the links from the Bilingual Thesaurus to the Middle English dictionary website.  The site has had a makeover, and is looking great, but unfortunately when they redeveloped the site they didn’t put redirects from the old URLs to the new ones.  This is pretty bas as it means anyone who has cited or bookmarked a page will end up with broken links, not just BTh.  I would imagine entries have been cited in countless academic papers and all these citations will now be broken, which is not good.  Anyway, I’ve fixed the MED links in BTh now.  Unfortunately there are two forms of link in the database, for example: http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=id&id=MED6466 and http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=byte&byte=24476400&egdisplay=compact.  I’m not sure why this is the case and I’ve no idea what the ‘byte’ number refers to in the second link type.  The first type includes the entry ID, which is still used in the new MED URLs.  This means I can get my script to extract the ID from the URL in the database and then replace the rest with the new URL, so the above becomes https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary/MED6466 as the target for our MED button and links directly through to the relevant entry page on their new site.

Unfortunately there doesn’t seem to be any way to identify an individual entry page for the second type of link.  This means there is no way to link directly to the relevant entry page.  However, I can link to the search results page by passing the headword, and this works pretty well.  So, for example the three words on this page: https://thesaurus.ac.uk/bth/category/?type=search&hw=2&qsearch=catourer&page=1#id=1393 have the second type of link, but if you press on one of the buttons you’ll find yourself at the search results page for that word on the MED website, e.g. https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary?utf8=%E2%9C%93&search_field=hnf&q=Catourer.

 

 

Week Beginning 9th April 2018

I returned to work after my Easter holiday on Tuesday this week, making it another four-day week for me.  On Tuesday I spent some time going through my emails and dealing with some issues that had arisen whilst I’d been away.  This included sorting out why plain text versions of the texts in the Corpus of Modern Scottish Writing were giving 403 errors (it turned out the server was set up to not allow plain text files to be accessed and an email to Chris got this sorted).  I also spent some time going through the Mapping Metaphor data for Wendy.  She wanted me to structure the data to allow her to easily see which metaphors continued from Old English times and I wrote a script that gave a nice colour-coded output to show those that continued or didn’t.  I also created another script that lists the number (and the details of) metaphors that begin in each 50-year period across the full range.  In addition, I spoke to Gavin Miller about an estimate of my time for a potential follow-on project he’s putting together.

The rest of my week was split between two projects:  LinguisticDNA and REELS.  For LinguisticDNA I continued to work on the search facilities for the semantically tagged EEBO dataset.  Chris gave me a test server on Tuesday (just an old desktop PC to add to the several others I now have in my office) and I managed to get the database and the scripts I’d started working on before Easter transferred onto it.  With everything set up I continued to add new features to the search facility.  I completed the second search option (Choose a Thematic Heading and a specific book to view the most frequent words) which allowa you to specify a Thematic Heading, a book, a maximum number of returned words and whether the theme selection includes lower levels.  I also made it so that you can miss out the selection of a thematic heading to bring back all of the words in the specified book listed by frequency.  If you do this each word’s thematic heading is also listed in the output, and it’s a useful way of figuring out which thematic headings you might want to focus on.

I also added a new option to both searches 1 and 2 that allows you to amalgamate the different noun and verb types.  There are several different types (e.g. NN1 and NN2 for singular and plural forms of nouns) and it’s useful to join these together as single frequency counts  rather than having them listed separately.

I also completed search option 3 (Choose a specific book to view the most frequent Thematic Headings).  This allows the user to select a book from an autocomplete list and optionally provide a limit to the returned headings.  The results display the thematic headings found in the book listed in order of frequency.  The returned headings are displayed as links that perform a ‘search 2’ for the heading in the book, allowing you to more easily ‘drill down’ into the data.  For all results I have added in a count column, so you can easily see how many results are returned or reference a specific result, and I also added titles to the search results pages that tell you exactly what it is you’ve searched for.  I also created a list of all thematic headings, as I thought it might be handy to be able to see what’s what.  When looking at this list you can perform a ‘search 1’ for any of the headings by clicking on one, and similarly, I created an option to list all of the books that form the dataset.  This list displays each book’s ID, author, title, terms and number of pages, and you can perform a ‘search 3’ for a book by clicking on its ID.

On Friday I participated in the Linguistic DNA project conference call, following which I wrote a document describing the EEBO search facilities, as project members outside of Glasgow can’t currently access the site I’ve put together.

For REELS I continued to work on the public interface for the place-name data, which included the following:

  1. The number of returned place-names is now displayed in the ‘you searched for…’ box
  2. The textual list of results now features two buttons for each result, one to view the record and one to view the place-name on the map.  I’m hoping the latter might be quite useful as I often find an interesting name in the textual list and wonder which dot on the map it actually corresponds to.  Now with one click I can find it.
  3. Place-name labels on the map now appear when you zoom in past a certain level (currently set to zoom level 12).  Note that only results rather than grey spots get the visible labels as otherwise there’s too much clutter and the map takes ages to load too.
  4. The record page now features a map with the place-name at the centre, and all other place-names as grey dots.  The marker label is automatically visible.
  5. Returning back to the search results from a record when you’ve done a quick search now works – previously this was broken.
  6. The map zoom controls have been moved to the bottom right, and underneath them is a new icon for making the map ‘full screen’.  Pressing on this will make the map take up the whole of your screen.  Press ‘Esc’ or on the icon again to return to the regular view.  Note that this feature requires a modern web browser, although I’ve just tested in in IE on Windows 10 and it works.  Using full screen mode makes working with the map much more pleasant.  Note, however, that navigating away form the map (e.g. if you click a ‘view record’ button) will return you to the regular view.
  7. There is a new ‘menu’ icon in the top-left of the map.  Press on this and a menu slides out from the left.  This presents you with options to change how the results are categorised on the map.  In addition to the ‘by classification code’ option that has always been there, you can now categorise and colour code the markers by start date, altitude and element language.  As with code, you can turn on and off particular levels using the legend in the top right. E.g. if you only want to display markers that have an altitude of 300m or more.

 

 

Week Beginning 18th December 2017

This was a short week for me as I only worked from Monday to Wednesday due to Christmas coming along.  I spent most of Monday and Tuesday continuing to work on the Technical Plan for Joanna Kopaczyk’s proposal.  As it’s a project with quite a large technical component there was a lot to think about and lots of detail to try and squeeze into the maximum of four pages allowed for a Plan.  My first draft was five pages long, so I had to chop some information out and reformat things to try and bring the length down a bit, but thankfully I managed to get it within the limit whilst still making sense and retaining the important points.  I also chatted with Graeme some more about some of the XML aspects of the project and had an email conversation with Luca about it too.  It was good to get the Plan sent on to Joanna, although it’s still very much a first draft that will need some further tweaking as other aspects of the proposal are firmed up.

I had to fix an issue with the Thesaurus of Old English staff pages on Monday.  The ‘edit lexemes’ form was set to not allow words to be more than 21 characters long.  Jane Roberts had been trying to update the positioning of the word ‘(ge)mearcian mid . . . rōde’, and as  this is more than 21 characters any changes made to this row were being rejected.  I’m not sure why I’d set the maximum word length to 21 as the database allows up to 60 characters in this field.  But I updated the check to allow up to 60 characters and that fixed the problem.  I also spent a bit of time on Tuesday gathering some stats for Wendy about the various Mapping Metaphor resources (i.e. the main website, the blog, the iOS app and the Android app).  I also had a chat with Jane Stuart Smith about an older but still very important site that she would like me to redesign at some point next year, and I started looking through this and thinking how it could be improved.

On Wednesday, as it was my last day before the hols, I decided to focus on something from my ‘to do’ list that would be fun.  I’d been wanting to make a timeline for the Historical Thesaurus for a while so I thought I’d look into that.  What I’ve created so far is a page through which you can pass a category ID and then see all of the words in the category in a visualisation that shows when the word was used, based on the ‘apps’ and ‘appe’ fields in the database.  When a word’s ‘apps’ and ‘appe’ fields are the same it appears as a dot in the timeline, and where the fields are different the word appears as a coloured bar showing the extent of the attested usage.  Note that more complicated date structures such as ‘a1700 + 1850–‘ are not visualised yet, but could be incorporated (e.g. a dot for 1700 then a bar from 1850 to 2000).

When you hover over a dot or bar the word and its dates appear below the visualisation.  Eventually (if we’re going to use this anywhere) I would instead have this as a tool-tip pop-up sort of thing.

Here are a couple of screenshots of fitting examples for the festive season.  First up is words for ‘Be gluttonous’:

And here are words for ‘Excess in drinking’:

The next step with this would be to incorporate all subcategories for a category, with different shaded backgrounds for sections for each subcategory and a subcategory heading added in.  I’m not entirely sure where we’d link to this, though.  We could allow people to view the timeline by clicking on a button in the category browse page.  Or we might not want to incorporate it at all, as it might just clutter things up.  BTW, this is a D3 based visualisation created by adapting this code: https://github.com/denisemauldin/d3-timeline

That’s all from me for 2017.  Best wishes for Christmas and the New Year to one and all!

Week Beginning 4th December 2017

I was struck down with some sort of tummy bug at the weekend and wasn’t well enough to come into work on Monday, but I worked from home instead.  Unfortunately although I struggled through the day I was absolutely wiped out by the end of it and ended up being off work sick on Tuesday and Wednesday.  I was mostly back to full health on Thursday, which is the day I normally work from home anyway, so I made it through that day and was back to completely full health on Friday, thankfully.  So I only managed to work for three days this week, and for two of those I wasn’t exactly firing on all cylinders.  However, I still managed to get a few things done this week.

Last week I’d migrated the Mapping Metaphor blog site and after getting approval from Wendy I deleted the old site on Monday.  I took a backup of the database and files before I did so, and then I wrote a little redirect that ensures Google links and bookmarks to specific blog pages point to the correct page on the main Metaphor site.  I also had some further AHRC review duties to take care of, plus I spent some time reading through the Case for Support for Joanna Kopaczyk’s project and thinking about some of the technical implications.  Pauline Mackay also sent me a sample of an Access database she’s put together for her Scots Bawdry project.  I’m going to create an online version of this so I spent a bit of time going through it and thinking about how it would work.

I spent most of Thursday and Friday working on this new system for Pauline, and by the end of the week I had created an initial structure for the online database, had created some initial search and browse facilities and I also created some management pages to allow Pauline to add / edit / delete records.  The search page allows users to search for any combination of the following fields:

Verse title, first line, language, theme, type, ms title, publication year, place, publisher and location.  Verse title, first line and ms title are free text and will bring back any records with matching text – e.g. if you enter ‘the’ into ‘verse title’ you can find all records where these three characters appear together in a title.  Publication year allows users to search for an individual year or a range of years (e.g. 1820-1840 brings back everything that has a date between and including these years).  Language, place, publisher and location are drop-down lists that allow you to select one option.  Themes and type are checkboxes allowing you to select any number of options, with each joined by an ‘or’ (e.g. all the records that have a theme of ‘illicit love’ or ‘marriage’).  I can change any of the single selection drop-downs to multiple options (or vice versa) if required.  If multiple boxes are filled in these are joined by ‘and’ – e.g. publication place is Glasgow AND publication year is 1820.

The browse page presents all of the options in the search form as clickable lists, with each entry having a count to show you how many records match.  For ‘publication year’ only those records with a year supplied are included.  Clicking on a search or browse result displays the full record.  Any content that can be searched for (e.g. publication type) is a link and clicking on it performs a search for that thing.

For the management pages, once logged in a staff user can browse the data, which displays all of the records in one big table.  From here the user can access options to edit or delete a record.  Deleting a record simply deactivates it in the database and I can retrieve it again if required.  Users can also add new records by clicking on the ‘add new row’ link.  I also created a script for importing all of the data from the Access database and I will run this again on a more complete version of the database when Pauline is ready to import everything.  This is all just an initial version, and there will no doubt be a few changes required, but I think it’s all come together pretty well so far.

Week Beginning 27th November 2017

I was off on Tuesday this week to attend my uncle’s funeral.  I spent the rest of the week working on a number of relatively small tasks for a variety of different projects.  The Dictionary of Old English people got back to me on Monday to say they had updated their search system to allow our Thesaurus of Old English site to link directly from our word records to a search for that word on their site.  This was really great news, and I updated our site to add in the direct links.  This is going to be very useful for users of the both sites.  I spent a bit more time on AHRC review duties this week, and I also had an email discussion with Joanna Kopaczyk in English Language about a proposal she is putting together.  She sent me on the materials she is working on and I read through them all and gave some feedback about the technical aspects.  I’m going to help her to write the Technical Plan for her project soon too.  I also met with Rachel Douglas from the School of Modern Languages to offer some advice on technical matters relating to a projest she’s putting together.  Althoguh Rachel is not in my School and I therefore can’t be involved in her project it was still good to be able to give her a bit of help and show her some examples of digital outputs similar to the sorts of thing she is hoping to produce.

I also spent some further time working on the integration of OED data with the Historical Thesaurus data with Fraser.  Fraser had sent me some further categories that he and a student had manually matched up, and had also asked me to write another script that picks out all of the unmatched HT categories and all of the unmatched OED categories and for each HT category goes through all of the OED categories and finds the one with the lowest Levenshtein score (an algorithm that returns a number showing how many steps it would take to turn one string into another).  My initial version of this script wasn’t ideal, as it included all unmatched OED categories and I’d forgotten that this included several thousand that are ‘top level’ categories that don’t have a part of speech and shouldn’t be matched with our categories at all.  I also realised that the script should only compare categories that have the same part of speech, as my first version was ending up with (for example) a noun category being matched up with an adjective.  I updated the script to bear these things in mind, but unfortunately the output still doesn’t look all that useful.  However, there are definitely some real matches that can be manually picked out from the list, e.g. 31890 ‘locustana pardalina or rooibaadjie’ and ‘locustana pardalina (rooibaadjie)’ and some others around there.  Also 14149 ‘applied to weapon etc’ and ‘applied to weapon, etc’.  It’s over to Fraser again to continue with this.

I mentioned last week that I’d updated all of our WordPress sites to version 4.9, but that 4.9.1 would no doubt soon be released.  And in fact it was released this week, so I had to update all of the sites once more.  It’s a bit of a tedious task but it doesn’t really take too long – maybe about half an hour in total.  I also decided to tick an item off my long-term ‘to do’ list as I had a bit of time available.  The Mapping Metaphor site had a project blog, located at a different URL from the main site.  As the project has now ended there are no more blog posts being made so it seems a bit pointless hosting this WordPress site, and having to keep it maintained, when I could just migrate the content to the main MM website as static HTML and delete the WordPress site.  I spent some time investigating WordPress plugins that could export entire sites as static HTML, for example https://en-gb.wordpress.org/plugins/static-html-output-plugin/ and https://wordpress.org/plugins/simply-static/.  These plugins go through a WordPress site, convert all pages and posts to static HTML, pull in the WordPress file uploads folder and wrap everything up as a ZIP file.  This seemed ideal, and the tools both worked very well, but I realised they weren’t exactly what I needed.  Firstly, the Metaphor blog (which was set up before I was involved with the project) just uses page IDs in the URLs, not other sorts of permalinks.  Both the plugins don’t work with the default URL style in place, so I’d need to change the link type, meaning the new pages would have different URLs to the old pages which would be a problem for redirects.  Secondly, both plugins pull in all of the page elements, including the page design, the header and all the rest.  I didn’t actually want all of this stuff but just the actual body of the posts (plus titles and a few other details) so I could slot this into the main MM website template.  So instead of using a plugin I realised it was probably simpler and easier if I just wrote my own little export script that grabbed just the published posts (not pages), for each getting the ID, the title, the main body, the author and the date of creation.  My script hooked into the WordPress functions to make use of the ‘wpautop’ function, which adds paragraph markup to texts, and I also replaced absolute URLs with relative ones.  I then created a temporary table to hold just this data, set my script to insert into it and then I exported this table.  I imported this into the main MM site’s database and wrote a very simple script to pull out the correct post based on the passed ID and that was that. Oh, I also copied the WordPress uploads directory across too, so images and PDFs and such things embedded in posts would continue to work.  Finally, I created a simple list of posts.  It’s exactly what was required and was actually pretty simple to implement, which is a good combination.

On Thursday I heard that the Historical Thesaurus had been awarded the ‘Queen’s Anniversary Prize for Higher Education’, which is a wonderful achievement for the project.  Marc had arranged a champagne reception on Friday afternoon to celebrate the announcement, so I spent most of afternoon sipping champagne and eating chocolates, which was a nice way to end the week.

Week Beginning 23rd October 2017

After an enjoyable week’s holiday I returned to work on Monday, spending quite a bit of Monday catching up with some issues people had emailed me about whilst I was away, such as making further tweaks to the ‘Concise Scots Dictionary’ page on the DSL website for Rhona Alcorn (the page is now live if you’d like to order the book: http://dsl.ac.uk/concise-scots-dictionary/), speaking with Luca about a project he’s involved in the planning of that’s going to use some of the DSL data, helping Carolyn Jess-Cooke with some issues she was encountering when accessing one of her websites, giving some information to Brianna of the RNSN project about timeline tools we might use, and a few other such things.

I also spent some time adding paragraph IDs to the ‘Scots Language’ page of the DSL (http://dsl.ac.uk/about-scots/the-scots-language/) for Ann Fergusson to enable references to specific paragraphs to be embedded in other pages.  Implementing this was somewhat complicated by the ‘floating’ contents section on the left as when a ‘hash’ is included in a URL a browser automatically jumps to the ID of the element that has this value.  But for the contents section to float or be fixed to the top of the page depending on which section the user is viewing the page needs to load at the top for the position to be calculated.  If the page loads halfway down then the contents section remains fixed at the top of the page, which is not much use.  However, I managed to get the ‘jump to paragraph from a URL’ feature working with the floating contents section now with a bit of a hack.  Basically, I’ve made it so that the ‘hash’ that gets passed to the page doesn’t actually correspond to an element on the page, so the browser doesn’t jump anywhere.  But my JavaScript grabs the hash after the page has loaded, reworks it to a format that does match an actual element and then smoothly scrolls to this element.   I’ve tested this in Firefox, Chrome, Internet Explorer and Edge and it works pretty well.

I had a couple of queries from Wendy Anderson this week.  The first was for Mapping Metaphor.  Wendy wanted to grab all of the bidirectional metaphors in both the main and OE datasets, including all of their sample lexemes.  I wrote a script that extracted the required data and formatted it as a CSV file, which is just the sort of thing she wanted.  The second query was for all of the metadata associated with the Corpus of Modern Scots Writing texts.  A researcher had contacted Wendy to ask for a copy but although the metadata is in the database and can be viewed on a per text basis through the website, we didn’t have the complete dataset in an easy to share format.  I wrote a little script that queried the database and retrieved all of the data.  I had to do a little digging into how the database was structure in order to do this, as it is a system that wasn’t developed by me.  However, after a little bit of exploration I managed to write a script that grabbed the data about each text, including multiple authors that can be associated with each text.  I then formatted this as a CSV file and sent the outputted file to Wendy.

I met with Gary on Monday to discuss some changes to the SCOSYA atlas and CMS that he wanted me to implement ahead of an event the team are at next week.  This included adding Google Analytics to the website, updating the legend of the Atlas to make it clearer what the different rating levels meant, separating out the grey squares (which mean no data is present) and the grey circles (meaning data is present but doesn’t meet the specified criteria) into separate layers so they can be switched on and off independently of each other, making the map markers a little smaller, and adding in facilities to allow Gary to delete codes, attributes and code parents via the CMS.  This all took a fair amount of time to implement, and unfortunately I lost a lot of time on Thursday due to a very strange situation with my access to the server.

I work from home on Thursdays and I had intended to work on the ‘delete’ facilities that day, but when I came to log into the server the files and the database appeared to have reverted back to the state they were in in May – i.e. it looked like we had lost almost six months of data, plus all of the updates to the code I’d implemented during this time.  This was obviously rather worrying and I spent a lot of time toing and froing with Arts IT Support to try and figure out what had gone wrong.  This included restoring a backup from the weekend before, which strangely still seemed to reflect the state of things in May.  I was getting very concerned about this when Gary noted that he was seeing two different views of the data on his laptop.  In Safari on his laptop his view of the data appeared to have ‘stuck’ at May while in Chrome he could see the up to date dataset.  I then realised that perhaps the issue wasn’t with the server after all but instead the problem was my home PC (and Safari on Gary’s laptop) was connecting to the wrong server.  Arts IT Support’s Raymond Brasas suggested it might be an issue with my ‘hosts’ file and that’s when I realised what had happened.  As the SCOSYA domain is an ‘ac.uk’ domain and it takes a while for these domains to be set up, we had set up the server long before the domain was running, so to allow me to access the server I had added a line to the ‘hosts’ file on my PC to override what happens when the SCOSYA URL is requested.  Instead of it being resolved by a domain name service my PC pointed at the IP address of the server as I had entered it in my ‘hosts’ file.  Now in May, the SCOSYA site was moved to a new server, with a new IP address, but the old server had never been switched off, so my home PC was still connecting to this old server.  I had only encountered the issue this week because I hadn’t worked on SCOSYA from home since May.  So, it turned out there was no problem with the server, or the SCOSYA data.  I removed the line from my ‘hosts’ file, restarted my browser and immediately I could access the up to date site.  All this took several hours of worry and stress, but it was quite a relief to actually figure out what the issue was and to be able to sort it.

I had intended to start setting up the server for the SPADE project this week, but the machine has not yet been delivered, so I couldn’t work on this.  I did make a few further tweaks to the SPADE website, however, and responded to a couple of queries from Rachel about the SCOTS data and metadata, which the project will be using.

I also met with Fraser to discuss the ongoing issue of linking up the HT and OED data.  We’re at the stage now where we can think about linking up the actual words with categories.  I’d previously written a script that goes through each HT category that matches an OED category and compares the words in each, checking whether an HT word matches the next found in either the OED ‘ght_lemma’ or ‘lemma’ fields.  After our meeting I updated the HT lexeme table to include extra fields for the ID of a matching OED lexeme and whether the lexeme had been checked.  After that I updated the script to go through every matching category in order to ‘tick off’ the matching words within.  The first time I ran my script it crashed the browser, but with a bit of tweaking I got it to successfully complete the second time.  Here are some stats:

There are 655513 HT lexemes that are now matched up with an OED lexeme.  There are 47074 HT lexemes that only have OE forms, so with 793733 HT lexemes in total this means there are 91146 HT lexemes that should have an OED match but don’t.  Note, however, that we still have 12373 HT categories that don’t match OED categories and these categories contain a total of 25772 lexemes.

On the OED side of things, we have a total of 688817 lexemes, and of these 655513 now match an HT lexeme, meaning there are 33304 OED lexemes that don’t match anything.  At least some of these will also be cleared up by future HT / OED category matches.  Of the 655513 OED lexemes that now match, 243521 of them are ‘revised’.  There are 262453 ‘revised’ OED lexemes in total, meaning there are 18932 ‘revised’ lexemes that don’t currently match an HT lexeme.  I think this is all pretty encouraging as it looks like my script has managed to match up bulk of the data.  It’s just the several thousand edge cases that are going to be a bit more work.

On Wednesday I met with Thomas Widmann of Scots Language Dictionaries to discuss our plans to merge all three of the SLD websites (DSL, SLD and Scuilwab) into one resource that will have the DSL website’s overall look and feel.  We’re going to use WordPress as a CMS for all of the site other than the DSL’s dictionary pages, so as to allow SLD staff to very easily update the content of the site.  It’s going to take a bit of time to migrate things across (e.g. making a new WordPress theme based on the DSL website, create quick search widgets, updating the DSL dictionary pages to work with the WordPress theme), but we now have the basis of a plan.  I’ll try to get started on this before the year is out.

Finally this week, I responded to a request from Simon Taylor to make a few updates to the REELS system, and I replied to Thomas Clancy about how we might use existing Ordinance Survey data in the Scottish Place-Names survey.  All in all it has been a very busy week.

Week Beginning 18th September 2017

On Monday this week I spent a bit of time creating a new version of the MetaphorIC app, featuring the ‘final’ dataset from Mapping Metaphor.  This new version features almost 12,000 metaphorical connections between categories and more than 30,000 examples of metaphor.  Although the creation of the iOS version went perfectly smoothly (this time), I ran into some difficulties updating the Android app as the build process started giving me some unexplained errors.  I eventually tried dropping the Android app in order to rebuild it, but that didn’t work either and unfortunately dropping the app also deleted its icon files.  After that I had to build the app in a new location, which thankfully worked.  Also thankfully I still had the source files for the icons so I could create them again.  There’s always something that doesn’t go smoothly when publishing apps.  The new version of the app was made available on the Apple App and Google Play stores by the end of the week and you can download either version by following the links here: http://mappingmetaphor.arts.gla.ac.uk/metaphoric/.  That’s Mapping Metaphor and its follow-on project MetaphorIC completely finished now, other than the occasional tweak that will no doubt be required.

I spent the bulk of the rest of the week working on the Burns Paper Database for Ronnie Young.  Last week I started looking at the Access version of the database that Ronnie had sent me, and I’m managed to make an initial version of a MySQL database to hold the data and I created an upload script that populated this table with the data via a CSV file.  This week I met with Ronnie to discuss how to take the project further.  We agreed that rather than having an online content management system through which Ronnie would continue to update the database, he would instead continue to use his Access version and I would then run this through my ‘import’ script to replace the old online version whenever updates are required.  This is a more efficient approach as I already have an upload script and Ronnie is already used to working with his Access database.

We went through the data together and worked out which fields would need to be searchable and browseable, and how the data should be presented.  This was particularly useful as there are some consistency issues with the data, for example in how uncertain dates are recorded, which may include square brackets, asterisks, question marks, the use of ‘or’ and also date ranges.

After the meeting I set to work creating an updated structure for the database and an updated ‘import’ script that would enable the extraction and storage of the data required for search purposes.  This included creating separate tables for year searches, manuscript types, watermarks and countermarks, and also images of both the documents and the watermarks.  It took quite some time to get the import script working properly, but now that it is in place I will be able to run any updated version of the data through this in order to create a new online version.  With this in place I set to work on the actual pages for searching and browsing, viewing results and viewing an individual record.  Much of this I managed to repurpose from my previous work on The People’s Voice database of poems, which helped speed things up considerably.  The biggest issue I encountered was with working with the images of the manuscript pages.  The project contains over 1200 high-resolution images that Ronnie wants users to be able to zoom into and pan around.  In order to work with these images I had to batch process the creation of thumbnails and also the renaming of the images, as they had a mixture of upper and lower case file extensions, which causes problems for case sensitive servers.  I then had to decide on a library that would provide the required zoom and pan functionality.  Previously I’ve used OpenLayers, but this requires large images to be split into tiles, and I didn’t want to have to do this.  Instead I looked at some other JavaScript libraries.  What I really wanted was a ‘google maps’ style interface that allowed multiple levels of zoom.  Unfortunately most libraries didn’t seem to offer this.  I found one called ‘jQuery Panzoom’ (http://timmywil.github.io/jquery.panzoom/demo/) that fitted the bill, and I tried working with this for a while.  Unfortunately, my images were all very large and the pane they will be viewed in is considerably smaller, and it didn’t seem very straightforward to reposition the zoomed image so that it actually appeared visible in the pane when zoomed out by default.  Instead I tried another library called magnifier.js (http://mark-rolich.github.io/Magnifier.js/) that can be set up to have a thumbnail navigation window and a larger main window.  I spent quite a bit of time working with this library and thought everything was going to work out perfectly, but then I encountered a bug:  If you manually set the dimensions of the pane in which the zoomed in image appears and these dimensions are different to the image then the zoomed in image is distorted to fit the pane.  After investigating this issue I discovered it had been raised by someone in 2014 and had not been addressed (see https://github.com/mark-rolich/Magnifier.js/issues/4).  As a distorted image was no good I had to look elsewhere once again.  My third attempt was using the ‘Elevate Zoom’ plugin (http://www.elevateweb.co.uk/image-zoom/examples).  Thankfully I managed to get this working.  It also can be set up to have a thumbnail navigation window and then a larger pane for viewing the zoomed in image.  It can also be set up to use the mouse wheel to zoom in and out, which is ideal.  The only downside is without physical zoom controls there’s no way to zoom in and out when using a touchscreen device.  But as it’s still possible to view the full image at one zoom level I think this is good enough.  By the end of the week I had pretty much completed the online database and I emailed the details to Ronnie for feedback.

Other than the above I also did a little bit of work for the SPADE project, beginning to create a proper interface for the website with Rachel MacDonald, and I had a further chat with Gerry McKeever regarding the website for his new project.

Week Beginning 4th September 2017

I spent a lot of this week continuing with the redevelopment of the ARIES app and thankfully after laying the groundwork last week (e.g. working out the styles and the structure, implementing a couple of exercise types) my rate of progress this week was considerably improved.  In fact, by the end of the week I had added in all of the content and had completed an initial version of the web version of the app.  This included adding in some new quiz types, such as one that allows the user to reorder the sentences in a paragraph by dragging and dropping them, and also a simple multiple choice style quiz.  I also received some very useful feedback from members of the project team and made a number of refinements to the content based on this.

This included updating the punctuation quiz so that if you get three incorrect answers in a quiz a ‘show answer’ button is displayed.  Clicking on this puts in all of the answers and shows the ‘well done’ box.  This was rather tricky to implement as the script needed to reset the question, including removing all previous answers, ticks, and resetting the initial letter case as if you select a full stop the following letter is automatically capitalised.  I also implemented a workaround for answers where a space is acceptable.  These no longer count towards the final tally of correct answers, so leaving a space rather than selecting a comma can now result in the ‘well done’ message being displayed.  Again, this was rather tricky to implement and it would be good if you could test out this quiz thoroughly to make sure there aren’t any occasions where the quiz breaks.

I also improved navigation throughout the app.  I added ‘next’ buttons to all of the quizzes, which either take you to the next section, or to the next part of the quiz, as applicable.  I think this works much better than just having the option to return to the page the quiz was linked from.  I also added in a ‘hamburger’ button to the footer of every page within a section.  Pressing on this takes you to the section’s contents page, and I added ‘next’ and ‘previous’ buttons to the contents pages too, so you can navigate between sections without having to go back to the homepage.

I spent a bit of time fixing the drag / drop quizzes so that the draggable boxes were constrained to each exercise’s boundaries.  This seemed to work great until I got to the references quiz, which has quite long sections of draggable text.  With the constraint in place it became impossible for the part of the draggable button that triggers the drop to reach the boxes nearest the boundaries of the question as none of the button could pass the borders.  So rather annoyingly I had to remove this feature and just allow people to drag the buttons all over the page.  But dropping a button from one question into another will always give you an incorrect answer now, so it’s not too big a problem.

With all of this in place I’ll start working on the app version of the resource next week and will hopefully be able to submit it to the app stores by the end of the week, all being well.

In addition to my work on ARIES, I completed some other tasks for a number of other projects.  For Mapping Metaphor I created a couple of scripts for Wendy that output some statistics about the metaphorical connections in the data.  For the Thesaurus of Old English I created a facility to enable staff to create new categories and subcatetories (previously it was only possible to edit existing categories or add / edit / remove words from existing categories).  I met with Nigel Leask and some of the Curious Travellers team on Friday to discuss some details for a new post associated with this project.  I had an email discussion with Ronnie Young about the Burns database he wants me to make an online version of.  I also met with Jane Stuart-Smith and Rachel MacDonald, who is the new project RA for the SPADE project, and set up a user account for Rachel to manage the project website.  I had a chat with Graeme Cannon about a potential project he’s helping put together that may need some further technical input and I updated the DSL website and responded to a query from Ann Ferguson regarding a new section of the site.

I also spent most of a day working on the Edinburgh Gazetteer project, during which I completed work on the new ‘keywords’ feature.  It was great to be able to do this as I had been intending to work on this last week but just didn’t have the time.  I took Rhona’s keywords spreadsheet, which had page ID in one column and keywords separated by a semi-colon in another and created two database tables to hold the information (one for information about keywords and a joining table to link keywords to individual pages).  I then wrote a little script that went through the spreadsheet, extracted the information and added it to my database.  I then set to work on adding the actual feature to the website.

The index page of the Gazetteer now has a section where all of the keywords are listed.  There are more than 200 keywords so it’s a lot of information.  Currently the keywords appear like ‘bricks’ in a scrollable section, but this might need to be updated as it’s maybe a bit much information.  If you click on a keyword a page loads that lists all of the pages that the keyword is associated with.  When you load a specific page, either from the keyword page or from the regular browse option, there’s now a section above the page image that lists the associated keywords.  Clicking on one of these loads the keyword’s page, allowing you to access any other pages that are associated with it.  It’s a pretty simple system but it works well enough.  The actual keywords need a bit of work, though, as some are too specific and there are some near duplications due to typos and things like that.  Rhona is going to send me an updated spreadsheet and I will hopefully upload this next week.

Oh yes, it was five years ago this week that I started in this post.  How time flies.

Week Beginning 28th August 2017

This week was rather a hectic one as I was contacted by many people who wanted my help and advice with things.  I think it’s the time of year – the lecturers are returning from their holidays but the students aren’t back yet so they start getting on with other things, meaning busy times for me.  I had my PDR session on Monday morning, so I spent a fair amount of time at this and then writing things up afterwards.  All went fine, and it’s good to know that the work I do is appreciated.  After that I had to do a few things for Wendy for Mapping Metaphor.  I’d forgotten to run my ‘remove duplicates’ script after I’d made the final update to the MM data, which meant that many of the sample lexemes were appearing twice.  Thankfully Wendy spotted this and a quick execution of my script removed 14,286 duplicates in a flash.  I also had to update some of the text on the site, update the way search terms are highlighted in the HT to avoid links through from MM highlighting multiple terms.  I also wrote a little script that displays the number of strong and weak metaphorical connections there are for each of the categories, which Wendy wanted.

My big task for the week was to start on the redevelopment of the ARIES app.  I had been expecting to receive the materials for this several weeks earlier as Marc wanted the new app to be ready to launch at the beginning of term.  As I’d heard nothing I assumed that this was no longer going to happen, but on Monday Marc gave me access to the files and said the launch must still go ahead at the start of term.  There is rather a lot to do and very little time to do it in, especially as preparing stuff for the App Store takes so much time once the app is actually developed.  Also, Marc is still revising the materials so even though I’m now creating the new version I’m still going to have to go back and make further updates later on.  It’s not exactly an ideal situation.  However, I did manage to get started on the redevelopment on Tuesday, and spent pretty much all of my time on Tuesday, Wednesday and Thursday on this task.  This involved designing a new interface based on the colours found in the logo file, creating the structure of the app, and migrating the static materials that the team had created in HTML to the JSON file I’m creating for the app contents.  This included creating new styles for the new content where required and testing things out on various devices to make sure everything works ok.  I also implemented two of the new quizzes, which also took quite a bit of time, firstly because I needed to manually migrate the quiz contents to a format that my scripts could work with and secondly because although the quizzes were similar to ones I’ve written before they were not identical in structure, so needed some reworking in order to meet the requirements.  I’m pretty happy with how things are developing, but progress is slow.  I’ve only completed the content for three subsections of the app, and there are a further nine sections remaining.  Hopefully the pace will quicken as I proceed, but I’m worried that the app is not going to be ready for the start of term, especially as the quizzes should really be tested out by the team and possibly tweaked before launch.

I spent most of Friday this week writing the Technical Plan for Thomas Clancy’s new place-name project.  Last week I’d sent off a long list of questions about the project and Thomas got back to me with some very helpful answers this week, which really helped in writing the plan.  It’s still only a first version and will need further work, but I think the bulk of the technical issues have been addressed now.

Other than these tasks, I responded to a query from Moira Rankin from the Archives about an old project I was involved with, I helped Michael Shaw deal with some more data for The People’s Voice project, I had a chat to Catriona MacDonald about backing up The People’s Voice database, I looked through a database that Ronnie Young had sent me, which I will be turning into an online resource sometime soon (hopefully), I replied to Gerry McKeever about a project he’s running that’s just starting up which I will be involved with, and I replied to John Davies in History about a website query he had sent me.  Unfortunately I didn’t get a chance to continue with the Edinburgh Gazetteer work I’d started last week, but I’ll hopefully get a chance to do some further work on this next week.